A Distributed Tracking Algorithm for Counting People in Video by Head Detection? Denis Kuplyakov1,2[0000−0002−2957−3297] , Yaroslav Geraskin1[0000−0002−2764−272X] , Timur Mamedov1,2[0000−0001−6554−7988] , and Anton Konushin1,3[0000−0002−6152−0021] 1 Faculty of Computational Mathematics and Cybernetics, Moscow State University, Moscow, Russia 2 Video Analysis Technologies, Moscow, Russia 3 NRU Higher School of Economics, Moscow, Russia {denis.kuplyakov,yaroslav.geraskin, timur.mamedov,anton.konushin}@graphics.cs.msu.ru Abstract. We consider the problem of people counting in video surveillance. This is one of the most popular tasks in video analysis, because this data can be used for predictive analytics and improvement of customer services, traffic con- trol, etc. Our method is based on the object tracking in video with low framerate. We use the algorithm from [1] as a baseline and propose several modifications that improve the quality of people counting. One of the main modifications is to use a head detector instead of a body detector in the tracking pipeline. Head tracking is proved to be more robust and accurate as the heads are less suscep- tible to occlusions. To find the intersection of a person with a signal line, we either raise the signal lines to the level of the heads or perform a regression of bodies based on the available head detections. Our experimental evaluation has demonstrated that the modified algorithm surpasses original in both accuracy and computational efficiency, showing a lower counting error on a lower detection frequency. Keywords: Computer Vision, Video Analytics, Tracking, People Counting. 1 Introduction Counting people passing through certain zones of a public infrastructure, such as pedes- trian crossings, sidewalks, squares, etc., is a practically important task. There are many solutions to this problem, one of them is object tracking. The task of object tracking is to create tracks for each person. A track unambiguously corresponds to a person. It marks this particular person locations in all frames in which he or she is visible. In order to count people a signal line is usually specified in the frame (see fig. 1). If the track crosses the signal line, we can say with confidence that the person also crossed it. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ? Publication is supported by RFBR grant 18-08-01484 2 D. Kuplyakov et al. Fig. 1. An example of signal line for estimating the load of a crossing. We propose a fully automatic people counting algorithm. The algorithm takes as in- put a video stream {Fi }i=1 of frames captured by stationary camera and signal line that specified by an ordered pair of points (La , Lb ) on the frame. The output of the algo- rithm is a set of events {Ei }i=1 that represented by triples of values Ei = (ki , ri , di ). The first value indicates the number of the frame where the signal line was crossed, the second value specifies the coordinates of the bounding box and the last value indicates the direction of the signal line intersection. Our solution is an extension of the algorithm described in [1]. In this paper we propose the following improvements: – use of the head detector instead of full-body detector; – detection matching procedure is modified in order to work with small head bound- ing boxes; – the use of body regression on the heads at different stages of tracking; – an algorithm to automatically determine a region of interest (ROI) by signal line position is proposed to speed up detection. 2 Related Work The modern methods of tracking are based on tracking-by-detection. There are a lot of ways to detect the desired object on the frame. Three most popular methods are: detection of body [2,3,4], detection of head [5,6] and using of key points. The first solution is popular because there are many datasets and ready-made solutions. The head-tracking approach is well suited to track people in a crowd: usually video surveillance cameras are installed above the height of the person, where heads in a crowd can be seen better than full bodies. Heads are more resistant to overlapping than A Distributed Tracking Alg. for Counting People in Video by Head Det... 3 bodies. The number of ready-made solutions and data for training is less than for bodies. There are methods that use body parts detectors for tracking [7], key points of human pose [8], combined solutions (body and head) [9], and detector ensembles [10]. After detection we need to bind all the detections to tracks. As in the detection task, there are a lot of methods. The first group of algorithms for creating tracks is greedy algorithms. In most online algorithms tracks are constructed frame by frame, each frame creates a matrix of the cost of matching new detections and existing tracks, then the problem of matching is solved by a greedy algorithm (searching for the maximum in each row/column) [11,12] or by a Hungarian algorithm [13] [2,3,4]. Sometimes MCMC is used to bind all the detections to tracks [5,14]. Recently, neural networks have been used more often in tracking. For example, in paper [15] authors suggest using detector to obtain new detection by regression of detection on previous frame. However, this method has disadvantages. For example, it is able to work well only at a high frame rate and it also increases the load on the detector. 3 The Proposed Algorithm 3.1 Baseline We use the solution from [1] as the baseline, which is an extension of SORT tracking algorithm [2]. We choose this algorithm as it is capable to work by detection on a sparse set of frames, which is performed on remote servers. This significantly reduces the amount of computational resources required for large-scale video surveillance systems (see fig. 2). Our proposed method inherits distributed nature of baseline. Baseline works in online mode and use the Hungarian algorithm [13] to match de- tections. To improve results at a low detection rate ASMS visual tracking [16] is used to evaluate a speed of people between frames. The same approach with visual tracking speed estimation is used in [11]. The baseline algorithm consists of the following steps: (1) detection; (2) evaluation of the speed of detections using visual tracking; (3) prediction of the position of tracks by the Kalman filter; (4) matching; (5) extrapolation of the tracks; (6) detection of signal line crossing events. Proposed improvements to some of the steps above are described below. 3.2 Detection Since heads are seen better in the video and are less prone to occlusions we decided to use the head detector based on SSD [17] approach instead of the body detector. An- other advantage of the head detector is that neural network body detectors can combine nearby people into one bounding box, which is less frequent for heads. The detector were trained on CrowdHuman [18] public dataset and on the dataset collected by Video Analysis Technologies. Experimental evaluation showed AUC of 0.66 on the test part of CrowdHuman. Usage of the head detector leads to a sub-task of restoring a bounding box of an entire body to find intersections of the signal line, which is located on the ground. We describe it in section 3.4. 4 D. Kuplyakov et al. 8 Mb/s 15 FPS 8 Mb/s 15 FPS 105 pcs. ... 8 Mb/s 15 FPS Datacenter 8 Mb/s 15 FPS ... 240 Gb/s 8 Mb/s 2 FPS 15 FPS ≈ 4500 pcs. ×105 ... ≈ 13400 pcs. 8 Mb/s 15 FPS Fig. 2. Resources (CPUs, GPUs) required for large-scale video surveillance system (105 cam- eras) for (top) traditional tracking algorithms; (bottom) baseline algorithm with 2 Hz detection frequency. 3.3 Matching During the experiments we realized that IOU metric used in the basic algorithm is not suitable for head matching. As head bounding boxes are small enough due to the errors in position and speed estimation bounding box pairs required to match are closely posi- tioned, but have no intersection. IOU equal to zero acquired in the case which leads to track partitioning. Therefore we suggest to increase by s times the size of the bounding boxes (while saving the center of the bounding box) before matching them using IOU metric. This approach solves the problem described above. 3.4 Signal Line Crossing We got the problem of missing signal line intersections as head tracks are located on a human height and the signal line is located on the ground. Problem reveals itself most on the scenes where the signal line is placed orthogonal to the camera. Therefore we offer two solutions: rising of signal lines and body bounding box regression by head. Rising of Signal Lines We propose to raise the signal line to the level of the human head (see fig. 3). This solution reduces the computational cost and time of the algorithm as it doesn’t require any extra steps to project head track to the ground plane. The signal line can be raised either manually or automatically. Automatic approach we propose is using the following anatomical fact: the height of the head is 81 of the A Distributed Tracking Alg. for Counting People in Video by Head Det... 5 height of the entire human body. We can use a body detector to calculate average height of human body for the scene. And raise the signal line to a height equal to 87 of the average height of a person on the scene. This approach has a drawback: people’s growth is different, which means that peo- ple’s heads are on different planes, while people’s legs are always on the same plane. Therefore, the raised signal line doesn’t look so clearly since it’s not clear where the plane is located. Fig. 3. Raised signal line (left picture) and signal line on a ground plane (right picture). Bounding Box Regression by Head To solve the drawbacks that arose in the previous paragraph we propose to use regression of a body by a head, which is trained for the specific scene. Our idea consists of two stages: combining the detections of heads and bodies for the scene and training linear regression on the received data. Combining head and body detections First of all we launch head and body detectors. To combine heads and bodies a matrix of head and body correspondence is built on each frame. CM k, i, j = costmatch (Bk, i , Hk, j ), i ∈ [1, nbk ], j ∈ [1, nhk ] (1) In eq. (1) nbk , nhk – the number of body and head detections on the k-th frame, Bk, i , Hk, j – the bounding box of the i-th body and j-th head on the k-th frame. We calculate matching cost as: 6 D. Kuplyakov et al. ( iohk, i, j , if iohk, i, j ≥ I1 ∧ τ1 ≤ dist(Bk, i , Hk, j ) ≤ τ2 costmatch (Bk, i , Hk, j ) = 0, else (2) area (intersection(Bk, i , Hk, j )) iohk, i, j = (3) area(Hk, j ) centery (Hk, j ) − top(Bk, i ) dist(Bk, i , Hk, j ) = (4) height(Hk, j ) In eq. (2) I1 is threshold for iohk, i, j (we are using I1 = 0.5), τ1 , τ2 – minimum and maximum normalized distances between the center of the head and the upper point of the body (we are using τ1 = −1 and τ2 = 1). So at least half of head bounding box area should be inside body bounding box area and vertically head center shouldn’t be far from body top by at least one head height. Next the assignment problem is solved using the Hungarian algorithm to maxi- mize cost. Combined head and body detections with non-zero match cost form training dataset for a regression (see fig. 4). Linear regression After the previous step we have the data to train a linear regression. The training dataset consists of head and body bounding box pairs: (Bi , Hi ). The re- gressor predicts the following values: (Bh , Bw , shif tx ), where Bh , Bw – height and width of the body bounding box, shif tx – a shift from the center of the head to the center of the predicted body normalized by head width: centerx (B) − centerx (H) shif tx = (5) Hw Prediction is done by linear regression with quadratic members: ch = k0 + k1 Hx + k2 Hy + k3 Hw + k4 Hh + k5 Hx2 + k6 Hx Hy + B . . . + k13 Hh Hw + k14 Hh2 (6) The same way B \ cw , shif tx are predicted, but with separate sets of coefficients. After learning coefficients of linear regression on training dataset, we use predicted value to restore body bounding box: cx = centerx (H) − Bw + shif tx Hw c B (7) 2 B cy = Hy (8) This approach helps us keep the signal line on the ground plane, which solves draw- backs of the previous approach. A Distributed Tracking Alg. for Counting People in Video by Head Det... 7 Now we have a choice to do speed estimation by visual tracking with head or re- gressed body detections. As heads are less prone to occlusions visual tracking of them may be more reliable. But body bounding boxes are several times larger and have more visual information to track. We check both choices on experimental evaluation. Fig. 4. Combined detections (left picture): head without pair (pink, blue), body without pair (dark blue), combined head and body (green). Results of body regression (right picture). 3.5 Limiting the Detection Area The entire frame is not required to find events of crossing the signal line. We can limit the detection area to the area around the signal line – region of interest (ROI). This solution speeds up the SSD detector as it is slower for images with higher resolution. If the signal line is horizontal then the ROI is located on the top and bottom of the line. Otherwise the ROI is located on the left and right side of the line, as well as on the top of the line. Let be α ≤ π2 – the minimum angle between signal line and horizon, wmean , hmean – the average width and height of body detections and (xa , ya ), (xb , yb ) – the coordi- nates of the beginning and the end of the signal line.   sin α x1 = min(xa , xb ) − sw wmean 1 + (9) 2   sin α x2 = max(xa , xb ) + sw wmean 1 + (10) 2 y1 = min(ya , yb ) − sh hmean cos α (11) y2 = max(ya , yb ) + sh hmean cos α (12) 8 D. Kuplyakov et al. Then A = (x1 , y1 , x2 , y2 ) is the region of interest. sw , sh in equations 9-12 are parameters. After visual testing we selected the following values for these variables: sw = 2, sh = 1. 4 Experimental Evaluation 4.1 Datasets For an experimental evaluation of our algorithm we need datasets filmed by static cam- era with body tracks markup. Video sequences should be long enough to evaluate people counting quality. If dataset provides head tracks markup it allows us to check effect of body regression by comparing with head tracking and raised signal lines (section 3.4). Most of the public datasets including popular MOTChallenge dataset [19] have short videos, only body tracks markup or filmed by moving camera. So we used 19 videos from the collection of the Video Analysis Technologies company and the Towncentre dataset [5] to test our algorithm. For all videos signal lines were manually drawn at ground level as well as at head level. The table 1 provides detailed information about each test video. Table 1. Videos that we used to test the algorithm. Video name Duration Format Number of events SignalineBase/02 00:17:40 1280x720@25 254 SignalineBase/03 00:14:40 1280x720@25 134 SignalineBase/04 00:05:20 1280x720@25 69 SignalineBase/05 00:11:40 1280x720@25 172 SignalineBase/06 00:12:00 1280x720@25 219 SignalineBase/07 00:08:29 800x450@25 124 SignalineBase/10 00:38:59 704x576@25 120 SignalineBase/11 00:22:48 704x576@25 91 SignalineBase/13 00:05:01 720x576@12 9 SignalineBase/14 00:05:23 720x576@12 17 SignalineBase/17 00:16:42 1280x1280@20 48 SignalineBase/18 00:16:26 1280x1000@25 180 SignalineBase/20 00:15:00 1024x768@20 144 SignalineBase/21 00:19:40 800x600@25 149 SignalineBase/26 00:10:29 1280x720@15 90 SignalineBase/28 00:11:40 1280x960@25 72 SignalineBase/29 00:40:00 704x576@25 159 SignalineBase/31 00:29:10 1920x1080@25 381 Towncentre 00:03:00 1920x1080@25 162 4.2 Metrics As a quality metric we use the average error of counting the number of intersections (events) [1]. The resulting events can include both true and false ones. The false events A Distributed Tracking Alg. for Counting People in Video by Head Det... 9 have no correspondences in the reference labeling. We say that an event Ei in the input set of data matches the event E ci in the reference labeling if they correspond to the same person crossing the signal line at the same time. We match all events as described in the paper [1]. After the events have been matched we divide videos to segments with 10 reference events and calculate the following characteristics on them: – GT seg is the number of reference events on the segment; – F P seg is the number of unmatched events from the algorithm on the segment; – F N seg is the number of unmatched events from the reference events on the seg- ment; F P seg −F N seg – Eseg = GT seg is an error on the segment. PN Eseg Then final error is calculated as E = i=1 N , where N is the number of the seg- ments. 4.3 Experimental Results Rising of Signal Lines At first we have tested baseline with modifications proposed in sections 3.2, 3.3 and manually raised signal lines as described in section 3.4. The algo- rithm is marked as heads-no-regression in the results table (see table 2) and parameter s shows how many times head bounding boxes have been increased. Table 2. Results of experiments with rising the signal lines. The table shows counting error E in percents for the algorithms depending on the detection frequency. Algorithm / Detection frequency 5 Hz 3 Hz 2 Hz 4/3 Hz SORT [2] 8.3 10.8 17.3 - baseline [1] 7.5 7.5 7.5 8.1 heads-no-regression s = 1 4.6 5.5 8.2 9.3 heads-no-regression s = 2 3.9 4.5 5.1 6.5 Experiments with rising the signal line clearly shows the advantage of the head detector over the body detector. The head detector gives a significant increase in an accuracy reducing the error by ≈ 2 times. The algorithm without matching modification (s = 1) has poor results at low FPS due to the previously voiced problems which shows importance of the modification proposed in the section 3.3. Body Bounding Box Regression by Head Next we have tested body bounding box regression by head. As mentioned in section 3.4 there are two alternative ways to apply it. There are two algorithms we have tested (see table 3): – heads-regression-vistrk — baseline with modifications proposed in sections 3.2, 3.3, visual tracking of regressed bodies to estimate speed and signal lines on the ground plane (section 3.4); 10 D. Kuplyakov et al. – heads-vistrk-regression — baseline with modifications proposed in sections 3.2, 3.3, visual tracking of heads to estimate speed and signal lines on the ground plane (section 3.4). Table 3. Results of experiments with bounding box regression by head. The table shows counting error E in percents for the algorithms depending on the detection frequency. Algorithm / Detection frequency 5 Hz 3 Hz 2 Hz 4/3 Hz baseline [1] 7.5 7.5 7.5 8.1 heads-regression-vistrk 4.5 4.9 5.7 6.8 heads-vistrk-regression 4.1 4.1 4.3 5.9 heads-no-regression 3.9 4.5 5.1 6.5 As you can see the second configuration gives better results. Visualization showed us that visual tracking of heads is more reliable as they are less prone to occlusions. It worth noting that heads-vistrk-regression performed better than heads-no-regression (s = 2) on low detection frequency. Limiting the Detection Area Next we tested limiting of the detection area (sec- tion 3.5). It allowed us to increase the speed of the algorithm almost without affecting the counting error (see table 4). Detection area reduced by almost 60% on some of the videos. Table 4. Results of experiments with limiting the detection area. The table shows counting error E in percents for the algorithms depending on the detection frequency. Algorithm / Detection frequency 5 Hz 3 Hz 2 Hz 4/3 Hz baseline + detection ROI 7.6 7.6 7.6 8.3 heads-regression-vistrk + detection ROI 4.8 4.9 5.8 7.0 heads-vistrk-regression + detection ROI 4.0 4.0 4.3 5.9 5 Conclusion We have proposed the algorithm of counting people in a video, which is an extension of the algorithm described in [1]. Usage of the head detector and body bounding box regression allowed to increase a counting accuracy. Detection area limiting saves com- putational resources. Comparing with the baseline the proposed method is able to work on a lower detection frequency showing lower counting error. For the future work we plan to use automatic camera calibration algorithms (see fig. 5) [20,21]. This will allow to perform person tracking on the ground map and further improve accuracy of people counting. A Distributed Tracking Alg. for Counting People in Video by Head Det... 11 Fig. 5. Visualization of the ground plane restored from the calibration, estimated from the human poses from the video. References 1. Kuplyakov, D.A., Shalnov, E.V., Konushin, V.S., Konushin, A.S.: A Distributed Tracking Algorithm for Counting People in Video. Programming and Computer Software 45(4), 163- 170 (2019) 2. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple Online and Realtime Tracking. CoRR abs/1602.00763 (2016) 3. Wojke, N., Bewley, A., Paulus, D.: Simple Online and Realtime Tracking with a Deep Asso- ciation Metric. ArXiv e-prints (2017) 4. Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: Poi: Multiple object tracking with high perfor- mance detection and appearance feature. In: European Conference on Computer Vision, pp. 36-42 (2016) 5. Benfold, B., Reid, I.: Stable Multi-Target Tracking in Real-Time Surveillance Video. In: CVPR, pp. 3457-3464 (2011) 6. Shalnov, E., Konushin, V., Konushin, A.: An improvement on an MCMC-based video track- ing algorithm. Pattern Recognition and Image Analysis 25, 532-540 (2015) 7. Shu, G., Dehghan, A., Oreifej, O., Hand, E., Shah, M.: Part-based multiple-person tracking with partial occlusion handling. In: CVPR, pp. 1815-1821. IEEE Computer Society (2012) 8. Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose Flow: Efficient Online Pose Tracking. In: (2018) 9. Henschel, R., Leal-Taixe, L., Cremers, D., Rosenhahn, B.: Improvements to Frank-Wolfe optimization for multi-detector multi-object tracking. CoRR abs/1705.08314 (2017) 10. Cobos, R., Hernandez, J., Abad, A.: A fast multi-object tracking system using an object detector ensemble. (2019) 11. Bochinski, E., Senst, T., Sikora, T.: Extending IOU Based Multi-Object Tracking by Visual Information. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1-6 (2018) 12 D. Kuplyakov et al. 12. Bochinski, E., Eiselein, V., Sikora, T.: High-Speed tracking-by-detection without using im- age information. In: Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on, pp. 1-6 (2017) 13. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics (NRL) 2(1-2), 83-97 (1955) 14. Kuplyakov, D., Shalnov, E., Konushin, A.: Markov chain Monte Carlo based video tracking algorithm. Programming and Computer Software 43(4), 224-229 (2017) 15. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking Without Bells and Whistles. In: The IEEE International Conference on Computer Vision (ICCV), (2019) 16. Vojir, T., Noskova, J., Matas, J.: Robust scale-adaptive mean-shift for tracking. In: Scandi- navian Conference on Image Analysis, pp. 652-663 (2013) 17. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: Single Shot MultiBox Detector. Lecture Notes in Computer Science (2016) 18. Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv preprint arXiv:1805.00123 (2018) 19. Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixe, L.: MOT20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003[cs] (2020) 20. Shalnov, E.V., Konushin, A.S., Konushin, V.S.: Convolutional neural network for camera pose estimation from object detections. ISPRS - International Archives of the Photogram- metry, Remote Sensing and Spatial Information Sciences 42(2-W4), 1-6 (2017) 21. Shalnov, E.V., Gringauz, A.D., Konushin, A.S.: Estimation of the people position in the world coordinate system for video surveillance. Programming and Computer Software 42(6), 361-366 (2016)