A Distributed Tracking Algorithm for Counting People
              in Video by Head Detection?

Denis Kuplyakov1,2[0000−0002−2957−3297] , Yaroslav Geraskin1[0000−0002−2764−272X] ,
                  Timur Mamedov1,2[0000−0001−6554−7988] , and
                     Anton Konushin1,3[0000−0002−6152−0021]
1
    Faculty of Computational Mathematics and Cybernetics, Moscow State University, Moscow,
                                           Russia
                       2
                         Video Analysis Technologies, Moscow, Russia
                    3
                       NRU Higher School of Economics, Moscow, Russia
                      {denis.kuplyakov,yaroslav.geraskin,
             timur.mamedov,anton.konushin}@graphics.cs.msu.ru


        Abstract. We consider the problem of people counting in video surveillance.
        This is one of the most popular tasks in video analysis, because this data can be
        used for predictive analytics and improvement of customer services, traffic con-
        trol, etc. Our method is based on the object tracking in video with low framerate.
        We use the algorithm from [1] as a baseline and propose several modifications
        that improve the quality of people counting. One of the main modifications is
        to use a head detector instead of a body detector in the tracking pipeline. Head
        tracking is proved to be more robust and accurate as the heads are less suscep-
        tible to occlusions. To find the intersection of a person with a signal line, we
        either raise the signal lines to the level of the heads or perform a regression of
        bodies based on the available head detections. Our experimental evaluation has
        demonstrated that the modified algorithm surpasses original in both accuracy and
        computational efficiency, showing a lower counting error on a lower detection
        frequency.

        Keywords: Computer Vision, Video Analytics, Tracking, People Counting.


1     Introduction

Counting people passing through certain zones of a public infrastructure, such as pedes-
trian crossings, sidewalks, squares, etc., is a practically important task. There are many
solutions to this problem, one of them is object tracking. The task of object tracking
is to create tracks for each person. A track unambiguously corresponds to a person. It
marks this particular person locations in all frames in which he or she is visible. In order
to count people a signal line is usually specified in the frame (see fig. 1). If the track
crosses the signal line, we can say with confidence that the person also crossed it.
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
    License Attribution 4.0 International (CC BY 4.0).
?
    Publication is supported by RFBR grant 18-08-01484
2 D. Kuplyakov et al.


             Fig. 1. An example of signal line for estimating the load of a crossing.


    We propose a fully automatic people counting algorithm. The algorithm takes as in-
put a video stream {Fi }i=1 of frames captured by stationary camera and signal line that
specified by an ordered pair of points (La , Lb ) on the frame. The output of the algo-
rithm is a set of events {Ei }i=1 that represented by triples of values Ei = (ki , ri , di ).
The first value indicates the number of the frame where the signal line was crossed, the
second value specifies the coordinates of the bounding box and the last value indicates
the direction of the signal line intersection.
    Our solution is an extension of the algorithm described in [1]. In this paper we
propose the following improvements:
    – use of the head detector instead of full-body detector;
    – detection matching procedure is modified in order to work with small head bound-
      ing boxes;
    – the use of body regression on the heads at different stages of tracking;
    – an algorithm to automatically determine a region of interest (ROI) by signal line
      position is proposed to speed up detection.


2     Related Work
The modern methods of tracking are based on tracking-by-detection. There are a lot
of ways to detect the desired object on the frame. Three most popular methods are:
detection of body [2,3,4], detection of head [5,6] and using of key points. The first
solution is popular because there are many datasets and ready-made solutions.
    The head-tracking approach is well suited to track people in a crowd: usually video
surveillance cameras are installed above the height of the person, where heads in a
crowd can be seen better than full bodies. Heads are more resistant to overlapping than
                   A Distributed Tracking Alg. for Counting People in Video by Head Det... 3

bodies. The number of ready-made solutions and data for training is less than for bodies.
There are methods that use body parts detectors for tracking [7], key points of human
pose [8], combined solutions (body and head) [9], and detector ensembles [10].
     After detection we need to bind all the detections to tracks. As in the detection task,
there are a lot of methods. The first group of algorithms for creating tracks is greedy
algorithms. In most online algorithms tracks are constructed frame by frame, each frame
creates a matrix of the cost of matching new detections and existing tracks, then the
problem of matching is solved by a greedy algorithm (searching for the maximum in
each row/column) [11,12] or by a Hungarian algorithm [13] [2,3,4]. Sometimes MCMC
is used to bind all the detections to tracks [5,14].
     Recently, neural networks have been used more often in tracking. For example,
in paper [15] authors suggest using detector to obtain new detection by regression of
detection on previous frame. However, this method has disadvantages. For example,
it is able to work well only at a high frame rate and it also increases the load on the
detector.


3     The Proposed Algorithm
3.1   Baseline
We use the solution from [1] as the baseline, which is an extension of SORT tracking
algorithm [2]. We choose this algorithm as it is capable to work by detection on a sparse
set of frames, which is performed on remote servers. This significantly reduces the
amount of computational resources required for large-scale video surveillance systems
(see fig. 2). Our proposed method inherits distributed nature of baseline.
    Baseline works in online mode and use the Hungarian algorithm [13] to match de-
tections. To improve results at a low detection rate ASMS visual tracking [16] is used
to evaluate a speed of people between frames. The same approach with visual tracking
speed estimation is used in [11].
    The baseline algorithm consists of the following steps: (1) detection; (2) evaluation
of the speed of detections using visual tracking; (3) prediction of the position of tracks
by the Kalman filter; (4) matching; (5) extrapolation of the tracks; (6) detection of signal
line crossing events.
    Proposed improvements to some of the steps above are described below.

3.2   Detection
Since heads are seen better in the video and are less prone to occlusions we decided
to use the head detector based on SSD [17] approach instead of the body detector. An-
other advantage of the head detector is that neural network body detectors can combine
nearby people into one bounding box, which is less frequent for heads. The detector
were trained on CrowdHuman [18] public dataset and on the dataset collected by Video
Analysis Technologies. Experimental evaluation showed AUC of 0.66 on the test part
of CrowdHuman. Usage of the head detector leads to a sub-task of restoring a bounding
box of an entire body to find intersections of the signal line, which is located on the
ground. We describe it in section 3.4.
4 D. Kuplyakov et al.

                                             8 Mb/s
                                             15 FPS

                                             8 Mb/s
                                             15 FPS           105 pcs.
                                       ...
                                             8 Mb/s
                                             15 FPS


                   Datacenter                                     8 Mb/s
                                                                  15 FPS
                            ...
                                         240 Gb/s                 8 Mb/s
                                         2 FPS                    15 FPS
                        ≈ 4500 pcs.
                                         ×105              ...
                        ≈ 13400 pcs.                              8 Mb/s
                                                                  15 FPS

Fig. 2. Resources (CPUs, GPUs) required for large-scale video surveillance system (105 cam-
eras) for (top) traditional tracking algorithms; (bottom) baseline algorithm with 2 Hz detection
frequency.


3.3   Matching

During the experiments we realized that IOU metric used in the basic algorithm is not
suitable for head matching. As head bounding boxes are small enough due to the errors
in position and speed estimation bounding box pairs required to match are closely posi-
tioned, but have no intersection. IOU equal to zero acquired in the case which leads to
track partitioning. Therefore we suggest to increase by s times the size of the bounding
boxes (while saving the center of the bounding box) before matching them using IOU
metric. This approach solves the problem described above.


3.4   Signal Line Crossing

We got the problem of missing signal line intersections as head tracks are located on a
human height and the signal line is located on the ground. Problem reveals itself most
on the scenes where the signal line is placed orthogonal to the camera. Therefore we
offer two solutions: rising of signal lines and body bounding box regression by head.


Rising of Signal Lines We propose to raise the signal line to the level of the human
head (see fig. 3). This solution reduces the computational cost and time of the algorithm
as it doesn’t require any extra steps to project head track to the ground plane.
     The signal line can be raised either manually or automatically. Automatic approach
we propose is using the following anatomical fact: the height of the head is 81 of the
                    A Distributed Tracking Alg. for Counting People in Video by Head Det... 5

height of the entire human body. We can use a body detector to calculate average height
of human body for the scene. And raise the signal line to a height equal to 87 of the
average height of a person on the scene.
    This approach has a drawback: people’s growth is different, which means that peo-
ple’s heads are on different planes, while people’s legs are always on the same plane.
Therefore, the raised signal line doesn’t look so clearly since it’s not clear where the
plane is located.


    Fig. 3. Raised signal line (left picture) and signal line on a ground plane (right picture).


Bounding Box Regression by Head To solve the drawbacks that arose in the previous
paragraph we propose to use regression of a body by a head, which is trained for the
specific scene. Our idea consists of two stages: combining the detections of heads and
bodies for the scene and training linear regression on the received data.

Combining head and body detections First of all we launch head and body detectors.
To combine heads and bodies a matrix of head and body correspondence is built on
each frame.

            CM k, i, j = costmatch (Bk, i , Hk, j ), i ∈ [1, nbk ], j ∈ [1, nhk ]                  (1)

    In eq. (1) nbk , nhk – the number of body and head detections on the k-th frame,
Bk, i , Hk, j – the bounding box of the i-th body and j-th head on the k-th frame. We
calculate matching cost as:
6 D. Kuplyakov et al.


                             (
                              iohk, i, j , if iohk, i, j ≥ I1 ∧ τ1 ≤ dist(Bk, i , Hk, j ) ≤ τ2
costmatch (Bk, i , Hk, j ) =
                              0, else
                                                                                           (2)

                                    area (intersection(Bk, i , Hk, j ))
                     iohk, i, j =                                                         (3)
                                              area(Hk, j )

                                            centery (Hk, j ) − top(Bk, i )
                   dist(Bk, i , Hk, j ) =                                                 (4)
                                                  height(Hk, j )
    In eq. (2) I1 is threshold for iohk, i, j (we are using I1 = 0.5), τ1 , τ2 – minimum
and maximum normalized distances between the center of the head and the upper point
of the body (we are using τ1 = −1 and τ2 = 1). So at least half of head bounding box
area should be inside body bounding box area and vertically head center shouldn’t be
far from body top by at least one head height.
    Next the assignment problem is solved using the Hungarian algorithm to maxi-
mize cost. Combined head and body detections with non-zero match cost form training
dataset for a regression (see fig. 4).

Linear regression After the previous step we have the data to train a linear regression.
The training dataset consists of head and body bounding box pairs: (Bi , Hi ). The re-
gressor predicts the following values: (Bh , Bw , shif tx ), where Bh , Bw – height and
width of the body bounding box, shif tx – a shift from the center of the head to the
center of the predicted body normalized by head width:

                                      centerx (B) − centerx (H)
                         shif tx =                                                        (5)
                                                 Hw
    Prediction is done by linear regression with quadratic members:


  ch = k0 + k1 Hx + k2 Hy + k3 Hw + k4 Hh + k5 Hx2 + k6 Hx Hy +
  B
                                                            . . . + k13 Hh Hw + k14 Hh2   (6)

    The same way B       \
                   cw , shif  tx are predicted, but with separate sets of coefficients. After
learning coefficients of linear regression on training dataset, we use predicted value to
restore body bounding box:

                         cx = centerx (H) − Bw + shif tx Hw
                                            c
                         B                                                                (7)
                                             2

                                          B
                                          cy = Hy                                         (8)
   This approach helps us keep the signal line on the ground plane, which solves draw-
backs of the previous approach.
                    A Distributed Tracking Alg. for Counting People in Video by Head Det... 7

    Now we have a choice to do speed estimation by visual tracking with head or re-
gressed body detections. As heads are less prone to occlusions visual tracking of them
may be more reliable. But body bounding boxes are several times larger and have more
visual information to track. We check both choices on experimental evaluation.


Fig. 4. Combined detections (left picture): head without pair (pink, blue), body without pair (dark
blue), combined head and body (green). Results of body regression (right picture).


3.5   Limiting the Detection Area
The entire frame is not required to find events of crossing the signal line. We can limit
the detection area to the area around the signal line – region of interest (ROI). This
solution speeds up the SSD detector as it is slower for images with higher resolution.
    If the signal line is horizontal then the ROI is located on the top and bottom of the
line. Otherwise the ROI is located on the left and right side of the line, as well as on the
top of the line.
    Let be α ≤ π2 – the minimum angle between signal line and horizon, wmean , hmean
– the average width and height of body detections and (xa , ya ), (xb , yb ) – the coordi-
nates of the beginning and the end of the signal line.
                                                                   
                                                              sin α
                      x1 = min(xa , xb ) − sw wmean 1 +                                  (9)
                                                                2
                                                                   
                                                              sin α
                      x2 = max(xa , xb ) + sw wmean 1 +                                 (10)
                                                                2
                            y1 = min(ya , yb ) − sh hmean cos α                               (11)
                            y2 = max(ya , yb ) + sh hmean cos α                               (12)
8 D. Kuplyakov et al.

    Then A = (x1 , y1 , x2 , y2 ) is the region of interest. sw , sh in equations 9-12 are
parameters. After visual testing we selected the following values for these variables:
sw = 2, sh = 1.

4     Experimental Evaluation
4.1   Datasets
For an experimental evaluation of our algorithm we need datasets filmed by static cam-
era with body tracks markup. Video sequences should be long enough to evaluate people
counting quality. If dataset provides head tracks markup it allows us to check effect of
body regression by comparing with head tracking and raised signal lines (section 3.4).
Most of the public datasets including popular MOTChallenge dataset [19] have short
videos, only body tracks markup or filmed by moving camera. So we used 19 videos
from the collection of the Video Analysis Technologies company and the Towncentre
dataset [5] to test our algorithm. For all videos signal lines were manually drawn at
ground level as well as at head level. The table 1 provides detailed information about
each test video.

                        Table 1. Videos that we used to test the algorithm.

                 Video name       Duration Format       Number of events
                 SignalineBase/02 00:17:40 1280x720@25 254
                 SignalineBase/03 00:14:40 1280x720@25 134
                 SignalineBase/04 00:05:20 1280x720@25 69
                 SignalineBase/05 00:11:40 1280x720@25 172
                 SignalineBase/06 00:12:00 1280x720@25 219
                 SignalineBase/07 00:08:29 800x450@25 124
                 SignalineBase/10 00:38:59 704x576@25 120
                 SignalineBase/11 00:22:48 704x576@25 91
                 SignalineBase/13 00:05:01 720x576@12 9
                 SignalineBase/14 00:05:23 720x576@12 17
                 SignalineBase/17 00:16:42 1280x1280@20 48
                 SignalineBase/18 00:16:26 1280x1000@25 180
                 SignalineBase/20 00:15:00 1024x768@20 144
                 SignalineBase/21 00:19:40 800x600@25 149
                 SignalineBase/26 00:10:29 1280x720@15 90
                 SignalineBase/28 00:11:40 1280x960@25 72
                 SignalineBase/29 00:40:00 704x576@25 159
                 SignalineBase/31 00:29:10 1920x1080@25 381
                 Towncentre       00:03:00 1920x1080@25 162


4.2   Metrics
As a quality metric we use the average error of counting the number of intersections
(events) [1]. The resulting events can include both true and false ones. The false events
                    A Distributed Tracking Alg. for Counting People in Video by Head Det... 9

have no correspondences in the reference labeling. We say that an event Ei in the input
set of data matches the event E
                              ci in the reference labeling if they correspond to the same
person crossing the signal line at the same time. We match all events as described in the
paper [1].
    After the events have been matched we divide videos to segments with 10 reference
events and calculate the following characteristics on them:
 – GT seg is the number of reference events on the segment;
 – F P seg is the number of unmatched events from the algorithm on the segment;
 – F N seg is the number of unmatched events from the reference events on the seg-
   ment;
            F P seg −F N seg
 – Eseg =        GT seg      is an error on the segment.
                                           PN       Eseg
Then final error is calculated as E =         i=1    N , where N is the number of the seg-
ments.

4.3   Experimental Results
Rising of Signal Lines At first we have tested baseline with modifications proposed in
sections 3.2, 3.3 and manually raised signal lines as described in section 3.4. The algo-
rithm is marked as heads-no-regression in the results table (see table 2) and parameter
s shows how many times head bounding boxes have been increased.


Table 2. Results of experiments with rising the signal lines. The table shows counting error E in
percents for the algorithms depending on the detection frequency.

                   Algorithm / Detection frequency 5 Hz 3 Hz 2 Hz 4/3 Hz
                   SORT [2]                        8.3 10.8 17.3 -
                   baseline [1]                    7.5 7.5 7.5 8.1
                   heads-no-regression s = 1       4.6 5.5 8.2 9.3
                   heads-no-regression s = 2       3.9 4.5 5.1 6.5


    Experiments with rising the signal line clearly shows the advantage of the head
detector over the body detector. The head detector gives a significant increase in an
accuracy reducing the error by ≈ 2 times. The algorithm without matching modification
(s = 1) has poor results at low FPS due to the previously voiced problems which shows
importance of the modification proposed in the section 3.3.

Body Bounding Box Regression by Head Next we have tested body bounding box
regression by head. As mentioned in section 3.4 there are two alternative ways to apply
it. There are two algorithms we have tested (see table 3):
 – heads-regression-vistrk — baseline with modifications proposed in sections 3.2,
   3.3, visual tracking of regressed bodies to estimate speed and signal lines on the
   ground plane (section 3.4);
10 D. Kuplyakov et al.

    – heads-vistrk-regression — baseline with modifications proposed in sections 3.2,
      3.3, visual tracking of heads to estimate speed and signal lines on the ground plane
      (section 3.4).


Table 3. Results of experiments with bounding box regression by head. The table shows counting
error E in percents for the algorithms depending on the detection frequency.

                    Algorithm / Detection frequency 5 Hz 3 Hz 2 Hz 4/3 Hz
                    baseline [1]                    7.5 7.5 7.5 8.1
                    heads-regression-vistrk         4.5 4.9 5.7 6.8
                    heads-vistrk-regression         4.1 4.1 4.3 5.9
                    heads-no-regression             3.9 4.5 5.1 6.5


    As you can see the second configuration gives better results. Visualization showed
us that visual tracking of heads is more reliable as they are less prone to occlusions. It
worth noting that heads-vistrk-regression performed better than heads-no-regression
(s = 2) on low detection frequency.

Limiting the Detection Area Next we tested limiting of the detection area (sec-
tion 3.5). It allowed us to increase the speed of the algorithm almost without affecting
the counting error (see table 4). Detection area reduced by almost 60% on some of the
videos.

Table 4. Results of experiments with limiting the detection area. The table shows counting error
E in percents for the algorithms depending on the detection frequency.

                Algorithm / Detection frequency         5 Hz 3 Hz 2 Hz 4/3 Hz
                baseline + detection ROI                7.6 7.6 7.6 8.3
                heads-regression-vistrk + detection ROI 4.8 4.9 5.8 7.0
                heads-vistrk-regression + detection ROI 4.0 4.0 4.3 5.9


5     Conclusion
We have proposed the algorithm of counting people in a video, which is an extension
of the algorithm described in [1]. Usage of the head detector and body bounding box
regression allowed to increase a counting accuracy. Detection area limiting saves com-
putational resources. Comparing with the baseline the proposed method is able to work
on a lower detection frequency showing lower counting error.
     For the future work we plan to use automatic camera calibration algorithms (see
fig. 5) [20,21]. This will allow to perform person tracking on the ground map and further
improve accuracy of people counting.
                     A Distributed Tracking Alg. for Counting People in Video by Head Det... 11


Fig. 5. Visualization of the ground plane restored from the calibration, estimated from the human
poses from the video.


References
 1. Kuplyakov, D.A., Shalnov, E.V., Konushin, V.S., Konushin, A.S.: A Distributed Tracking
    Algorithm for Counting People in Video. Programming and Computer Software 45(4), 163-
    170 (2019)
 2. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple Online and Realtime Tracking.
    CoRR abs/1602.00763 (2016)
 3. Wojke, N., Bewley, A., Paulus, D.: Simple Online and Realtime Tracking with a Deep Asso-
    ciation Metric. ArXiv e-prints (2017)
 4. Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: Poi: Multiple object tracking with high perfor-
    mance detection and appearance feature. In: European Conference on Computer Vision, pp.
    36-42 (2016)
 5. Benfold, B., Reid, I.: Stable Multi-Target Tracking in Real-Time Surveillance Video. In:
    CVPR, pp. 3457-3464 (2011)
 6. Shalnov, E., Konushin, V., Konushin, A.: An improvement on an MCMC-based video track-
    ing algorithm. Pattern Recognition and Image Analysis 25, 532-540 (2015)
 7. Shu, G., Dehghan, A., Oreifej, O., Hand, E., Shah, M.: Part-based multiple-person tracking
    with partial occlusion handling. In: CVPR, pp. 1815-1821. IEEE Computer Society (2012)
 8. Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose Flow: Efficient Online Pose Tracking. In:
    (2018)
 9. Henschel, R., Leal-Taixe, L., Cremers, D., Rosenhahn, B.: Improvements to Frank-Wolfe
    optimization for multi-detector multi-object tracking. CoRR abs/1705.08314 (2017)
10. Cobos, R., Hernandez, J., Abad, A.: A fast multi-object tracking system using an object
    detector ensemble. (2019)
11. Bochinski, E., Senst, T., Sikora, T.: Extending IOU Based Multi-Object Tracking by Visual
    Information. In: 2018 15th IEEE International Conference on Advanced Video and Signal
    Based Surveillance (AVSS), pp. 1-6 (2018)
12 D. Kuplyakov et al.

12. Bochinski, E., Eiselein, V., Sikora, T.: High-Speed tracking-by-detection without using im-
    age information. In: Advanced Video and Signal Based Surveillance (AVSS), 2017 14th
    IEEE International Conference on, pp. 1-6 (2017)
13. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics
    (NRL) 2(1-2), 83-97 (1955)
14. Kuplyakov, D., Shalnov, E., Konushin, A.: Markov chain Monte Carlo based video tracking
    algorithm. Programming and Computer Software 43(4), 224-229 (2017)
15. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking Without Bells and Whistles. In: The
    IEEE International Conference on Computer Vision (ICCV), (2019)
16. Vojir, T., Noskova, J., Matas, J.: Robust scale-adaptive mean-shift for tracking. In: Scandi-
    navian Conference on Image Analysis, pp. 652-663 (2013)
17. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: Single
    Shot MultiBox Detector. Lecture Notes in Computer Science (2016)
18. Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: CrowdHuman: A Benchmark
    for Detecting Human in a Crowd. arXiv preprint arXiv:1805.00123 (2018)
19. Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler,
    K., Leal-Taixe, L.: MOT20: A benchmark for multi object tracking in crowded scenes.
    arXiv:2003.09003[cs] (2020)
20. Shalnov, E.V., Konushin, A.S., Konushin, V.S.: Convolutional neural network for camera
    pose estimation from object detections. ISPRS - International Archives of the Photogram-
    metry, Remote Sensing and Spatial Information Sciences 42(2-W4), 1-6 (2017)
21. Shalnov, E.V., Gringauz, A.D., Konushin, A.S.: Estimation of the people position in the
    world coordinate system for video surveillance. Programming and Computer Software 42(6),
    361-366 (2016)