<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>GraphiCon</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timur Mamedov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Kuplyakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anton Konushin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lomonosov Moscow State University</institution>
          ,
          <addr-line>1, Leninskie Gory, Moscow, 119991, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NRU Higher School of Economics</institution>
          ,
          <addr-line>11, Pokrovsky Bulvar, 109028, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Video Analysis Technologies</institution>
          ,
          <addr-line>7, Sculptora Mukhina, 119634, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>31</volume>
      <fpage>27</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>In this paper, we consider the problem of people counting in video surveillance. This is an important task in video analysis, because this data can be used for predictive analytics and improvement of customer services, trafic control, etc. The proposed methods are based on object tracking and are able to work on sparse frames, which allows them to work faster and requires minimum computing resources. We use the algorithm from [1] as a baseline, which based on object tracking by head detections. Head tracking in baseline is proved to be more robust and accurate as the heads are less susceptible to occlusions. But this approach has two disadvantages: the height of people is diferent, which means that people's heads are in diferent planes, so the raised signal line doesn't look so clear, and also because of this, the accuracy of people counting may decrease. In baseline, this problems were solved using head-to-body linear regression, which had to be retrained for each scene, but this complicates the use of the algorithm for practical purposes. In this paper, we propose a new neural network head-to-body regressor, which allows us to solve the mentioned problems at once. Also in this paper, we use a new visual tracking algorithm that allowed us to speed up our solution. In this work, we introduce two methods - distributed modified baseline algorithm with high people counting accuracy and a solution that can run on a single processor core. Our experimental evaluation showed that the proposed modifications are consistent.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Computer Vision</kwd>
        <kwd>Video Analytics</kwd>
        <kwd>Tracking</kwd>
        <kwd>People Counting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Counting people passing through certain zones of a public infrastructure, such as pedestrian
crossings, sidewalks, squares, etc., is a practically important task. Since the solution of this
problem has to deal with the processing of large data streams, it is necessary to automate the
process of counting people. There are many solutions to this problem, one of them is object
tracking. The task of object tracking is to create tracks for each person. Track is uniquely
specified with the person and contains his location on every frame where he is visible. In order
to count people a signal line is usually specified in the frame (see fig. 1). If the track crosses the
signal line, we can say with confidence that the person also crossed it.</p>
      <p>
        But similar solutions have one big drawback — the integration of modern tracking algorithms
into real people counting systems is economically unprofitable in practice. This is due to the
need to use a large number of GPUs, which are very expensive, especially after the growing
popularity of cryptocurrency mining. That’s why, in our previous works [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] we introduced
distributed algorithms, that allow us efectively count people in real-world scenarios by dividing
GPU resources (see fig. 3), taking into account the mass processing of video streams.
      </p>
      <p>
        However, our latest work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which has the best quality in the problem of counting people
among our algorithms, has two drawbacks that prevent it from being used in practice:
• this algorithm is based on object tracking by head detections, but the height of people is
diferent, which means that people’s heads are in diferent planes, so the raised signal line
doesn’t look so clear (see fig. 2), and also because of this, the accuracy of people counting
may decrease. Moreover, the automatic method of raising the signal line proposed in the
previous work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] requires calibration for the scene, since information about the average
height of a person on a particular scene is used to calculate the height of the signal line;
• the previous problems were solved using head-to-body linear regression in the paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>But this linear regression needs to be retrained for each scene, which also complicates
the use of the algorithm in real scenarios.</p>
      <p>In this paper, we solve the problems with setting up for the scene described above, and also
propose an additional method that is able to count people on a single processor core without
GPUs. We introduce a fully automatic people counting algorithms in a video sequence shot by
a stationary camera. Algorithms take as input a video stream {}=1 of frames captured by
stationary camera and signal line that specified by an ordered pair of points (, ) on the
frame. The output of algorithms is a set of events {}=1 that represented by triples of values
 = (, , ), where  is the frame index in which the signal line was crossed,  specifies
the coordinates of the bounding box and the last value  indicates the direction of the signal
line intersection.</p>
      <p>
        As it was said earlier, our solutions are an extension of the algorithm from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this article,
we propose the following changes designed to speed up the algorithms and allow them to be
used for practical purposes:
• novel neural network body regression by head, which allow us to use our algorithms on
any scene without retraining;
• using Staple visual tracking algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which allows us to significantly speed up our
solution while maintaining the quality of counting;
• using a simpler head detector, which allows us to run our algorithm on a single processor
core and maintain an acceptable quality of people counting for practical purposes.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The modern methods of tracking people are based on tracking through detection. They can be
divided into two groups depending on the type of detection used: detection of body [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ],
detection of head [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ]. The first solution is the most popular because there are many datasets
and ready-made solutions.
      </p>
      <p>
        The head-tracking approach is well suited to track people in a crowd: usually video
surveillance cameras are installed above the height of the person, where heads in a crowd can be seen
better than full bodies. Heads are more resistant to overlapping than bodies. The number of
ready-made solutions and data for training is less than for bodies. There are methods that use
Datacenter
8 Mb/s
15 FPS
8 Mb/s
15 FPS
body parts detectors for tracking [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], key points of human pose [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], combined solutions (body
and head) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and detector ensembles [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        After detection we need to bind all the detections to tracks. As in the detection task, there are
a lot of methods. The first group of algorithms for creating tracks is greedy algorithms. In most
online algorithms tracks are constructed frame by frame, each frame creates a matrix of the cost
of matching new detections and existing tracks, then the problem of matching is solved by a
greedy algorithm (searching for the maximum in each row/column) [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] or by a Hungarian
algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. Sometimes MCMC is used to bind all the detections to tracks [
        <xref ref-type="bibr" rid="ref15 ref7">7, 15</xref>
        ].
      </p>
      <p>
        Recently, neural networks have been used more often in tracking. For example, in paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
authors suggest using detector to obtain new detection by regression of detection on previous
frame. However, this method has disadvantages. For example, it is able to work well only at a
high frame rate and it also increases the load on the detector.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <p>
        We use the solution from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as the baseline, which is an extension of SORT tracking algorithm
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We choose this algorithm as it is capable to work by detection on a sparse set of frames
This significantly reduces the amount of computational resources required for large-scale video
surveillance systems (see fig. 3).
      </p>
      <p>
        Baseline works in online mode and use the Hungarian algorithm [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to match detections.
To improve results at a low detection rate ASMS visual tracking [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is used to evaluate a speed
of people between frames. The same approach with visual tracking speed estimation is used in
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The baseline algorithm consists of the following steps: (1) detection; (2) evaluation of the
speed of detections using visual tracking; (3) prediction of the position of tracks by the Kalman
iflter; (4) matching; (5) extrapolation of the tracks; (6) detection of signal line crossing events.</p>
      <p>In this paper, we propose improvements in the following steps in the baseline: detection,
evaluation of the speed of detections using visual tracking, detection of signal line crossing
events. Proposed improvements are described below.</p>
      <sec id="sec-3-1">
        <title>3.1. Detection</title>
        <p>
          Since heads are seen better in the video and are less prone to occlusions, we continue the idea
of using the head detector instead of the body detector for object tracking, proposed in the
baseline. Another advantage of the head detector is that neural network body detectors can
combine nearby people into one bounding box, which is less frequent for heads. Therefore, in
this paper we use the head detector based on SSD [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. In this work, we use two backbones for
the head detector — ResNet50 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] for distributed modified baseline algorithm with high people
counting accuracy and MobileNet0.5 [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] for solution that can run on a single processor core.
The detector were trained on CrowdHuman [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] public dataset and on the dataset collected by
Video Analysis Technologies.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Visual Tracking</title>
        <p>
          As it was said earlier, ASMS visual tracking is used in the baseline. In this paper, we use Staple
visual tracking algorithm [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which allows us to significantly speed up our solution while
maintaining the quality of counting. Staple consist of two models working in parallel. The first
model is invariant to the change of lighting, but depends on the change in the shape of the
object, and the second model is invariant to the transformations of the object, but depends on
the lighting. Therefore, using two models together, it is possible to achieve the best result.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Signal Line Crossing</title>
        <p>The baseline is based on object tracking by head detections. Head tracking in baseline is proved
to be more robust and accurate as the heads are less susceptible to occlusions. But this approach
has two disadvantages: the height of people is diferent, which means that people’s heads
are in diferent planes, so the raised signal line doesn’t look so clear, and also because of this,
the accuracy of people counting may decrease. In baseline, this problems were solved using
head-to-body linear regression (because people’s feet are always in the same plane), which
had to be retrained for each scene, but this complicates the use of the algorithm for practical
purposes. In this paper, we propose a new neural network head-to-body regressor, which allows
us to solve the mentioned problems at once. Our neural network regressor must be trained
once and can be used for any scenes.</p>
        <p>The proposed head-to-body regressor consist of two steps:</p>
        <p>MobileNetV1
t
u
o
p
o
r
D
e
s
n
e
D
t
u
o
p
o
r
D</p>
        <p>1. using heuristics, we find the approximate position of the body;
2. using neural network regression, we clarify the position of the body.</p>
        <p>The First Step In our heuristics we are using the following fact from anatomy: the width of
a person’s body is on average three times the width of a person’s head, and the height of a
person’s body is on average eight times the height of a person’s head. Experimentally, these
proportions were slightly corrected:
bodyleft = headleft − headwidth,</p>
        <p>bodytop = headtop,
bodywidth = 3 · headwidth,
bodyheight = 8.5 · headheight,
(1)
(2)
(3)
(4)
where (headleft, headtop, headwidth, headheight) — the coordinates of the head bounding box and
(bodyleft, bodytop, bodywidth, bodyheight) — the coordinates of the body bounding box.
The Second Step At the second step we specify the exact position of the body bounding
box, obtained in the previous step. To do this, we developed neural network to regress three
coordinates: the left and right extreme points of the human body (bodyregLeft and bodyregRight,
respectively) and the height bodyregHeight of the human body. The fig. 4 shows the architecture
of the proposed neural network for head-to-body regressor. Our regression neural network has
a simple architecture, because we need our algorithm for counting people to work in real time.
Despite the simple architecture, the proposed neural network has a good regression quality.
As a result, we have the final coordinates of the body bounding box: (bodyregLeft, bodytop,
bodyregWidth, bodyregHeight), where bodyregWidth = bodyregRight − bodyregLeft.</p>
        <p>In our solution, we use head detection for object tracking, but we implement the detection
of signal line crossing events using body detection, regressed by the proposed head-to-body
neural network regressor. That is, each detection of the head is associated with the detection of
the body. Also, when we add a new body detection to the track, we adjust the width and height
of the new body detection (due to occlusions, we may have regression errors that can lead to
false crossings of the signal line) as follows:
bodyNewwidth =  · bodyNewwidth + (1 −  ) · bodyPrevwidth,
bodyNewheight =  · bodyNewheight + (1 −  ) · bodyPrevheight,
(5)
(6)
where bodyNewwidth, bodyNewheight — the width and height of the body bounding box that we
are adding, bodyPrevwidth, bodyPrevheight — the width and height of the last body bounding box
in the track, and  — is a specially selected constant.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          Head-to-Body Regression To train our regression neural network (see section 3.3) we used
videos from 2DMOT2015 [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], MOT17 [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] challenges and data, collected by Video Analysis
Technologies company. To obtain the data for training the neural network model from the video
(frames are considered every 2 seconds), the following strategy was used:
1. using the head detector (see section 3.1), head detections were obtained for each frame;
2. using the linear regression from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], body detections were obtained for each frame;
3. using our heuristic (see section 3.3), the approximate positions of the bodies were found
for each frame. This data was used to make crops for training our neural network
head-to-body regressor;
4. using body detections corresponding to the head detections, the left and right extreme
points of the human body and the height of the human body were found for each frame.
        </p>
        <p>This data was used to train our neural network head-to-body regressor.</p>
        <p>
          Counting People Algorithms For an experimental evaluation of our algorithms we need
datasets filmed by static camera. Video sequences should be long enough to evaluate people
counting quality. Most of the public datasets including popular MOTChallenge dataset [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
have short videos or filmed by moving camera. So we used 19 videos from the collection of the
Video Analysis Technologies company and the Towncentre dataset [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] to test our algorithms.
For all videos signal lines were manually drawn at ground level. The table 1 provides detailed
information about each test video.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Metrics</title>
        <p>
          As a quality metric we use the average error of counting the number of intersections (events)
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The resulting events can include both true and false ones. The false events have no
correspondences in the reference labeling. We say that an event  in the input set of data
matches the event ̂︁ in the reference labeling if they correspond to the same person crossing
the signal line at the same time. We match all events as described in the paper [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>After the events have been matched we divide videos to segments with 10 reference events
and calculate the following characteristics on them:
•   is the number of reference events on the segment;
•    is the number of unmatched events from the algorithm on the segment;
•    is the number of unmatched events from the reference events on the segment;
•  =   −   is an error on the segment.</p>
        <p>Then final error is calculated as  = ∑︀=1  , where  is the number of the segments.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results</title>
        <p>
          In this section, we present experimental results for the proposed algorithms. We consider 3
types of experiments:
• With Body Regression — in this experiments, we used baseline from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and replaced
the head-to-body linear regression to our neural network head-to-body regression (see
section 3.3);
• With Body Regression and Staple — in this experiments, we used modification from
the "With Body Regression" experiments and replaced ASMS visual tracking with Staple
(see section 3.2);
• Single Processor Core Algorithm — in this experiments, we used modifications from
previous experiments and used lightweight head detector with MobileNet0.5 backbone
(see section 3.1).
The table 2 provides detailed information on the results of each of the above experiments.
        </p>
        <p>
          The experimental evaluation shows that our distributed modified baseline algorithm has
a high people counting quality, which significantly exceeds the results of our old algorithm
from [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and slightly lags behind the results of the algorithm from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. However, the proposed
algorithm can be used for practical purposes, since our modifications allow us to use it without
configuring it for a specific scene.
        </p>
        <p>
          In addition, the table 2 shows that our solution, which can run on a single processor core, has
a quality acceptable for practical purposes. Moreover, our lightweight algorithm bypasses the
algorithm based on the classical SORT [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] tracking in terms of the quality of people counting.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Speed Estimation</title>
        <p>
          Since our algorithm consists of several stages, its operating time is equal to the total operating
time of its following components: detector, visual tracking and head-to-body regression. Our
lightweight head detector runs ≈ 250 milliseconds on a Full HD frame (see section 3.1). Staple
visual tracking [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] (see section 3.2) runs ≈ 5 milliseconds for a single detection (ASMS visual
tracking [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], used in baseline [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] runs ≈ 20 milliseconds for a single detection). And the
proposed neural network head-to-body regressor runs ≈ 130 milliseconds for 10 crops.
        </p>
        <p>Given that the rest of the calculations are not so expensive, our algorithm, running on a
single processor core, is able to work with a frequency of 2 Hz for 10 people per frame (≈ 500
milliseconds). All speed measurements were made on a single core of the Intel Core i5-9400
processor.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>
        In this paper, we focused on the practical applicability of people counting algorithms and
introduced two algorithms: distributed modified baseline algorithm with high people counting
accuracy and a solution that can run on a single processor core. Also, in this work, we propose a
new neural network head-to-body regressor, which allows us to solve drawbacks of the baseline
from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], related to configuring the algorithm for a specific scene. This modification allows us
to use our algorithms for practical purposes.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuplyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Geraskin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mamedov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Konushin</surname>
          </string-name>
          ,
          <article-title>A distributed tracking algorithm for counting people in video by head detection</article-title>
          ,
          <source>in: Proceedings of the 30th International Conference on Computer Graphics and Machine Vision</source>
          , volume
          <volume>2744</volume>
          <source>of CEUR Workshop Proceedings</source>
          , M. Jeusfeld c/o Redaktion Sun SITE,
          <string-name>
            <surname>Informatik</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>RWTH Aachen</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .51130/graphicon-2020-2-3-26.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Kuplyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Shalnov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Konushin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Konushin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Distributed</given-names>
            <surname>Tracking</surname>
          </string-name>
          <article-title>Algorithm for Counting People in Video, Programming</article-title>
          and
          <source>Computer Software</source>
          <volume>45</volume>
          (
          <year>2019</year>
          )
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          . doi:
          <volume>10</volume>
          .1134/S0361768819040042.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bertinetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Valmadre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Golodetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Miksik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torr</surname>
          </string-name>
          , Staple:
          <article-title>Complementary learners for real-time tracking</article-title>
          ,
          <year>2016</year>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>156</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bewley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Upcroft</surname>
          </string-name>
          ,
          <article-title>Simple online</article-title>
          and realtime tracking,
          <source>2016 IEEE International Conference on Image Processing (ICIP)</source>
          (
          <year>2016</year>
          ). URL: http://dx.doi.org/ 10.1109/ICIP.
          <year>2016</year>
          .
          <volume>7533003</volume>
          . doi:
          <volume>10</volume>
          .1109/icip.
          <year>2016</year>
          .
          <volume>7533003</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Wojke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bewley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Paulus</surname>
          </string-name>
          ,
          <article-title>Simple online and realtime tracking with a deep association metric</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1703</volume>
          .
          <fpage>07402</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          , Poi:
          <article-title>Multiple object tracking with high performance detection and appearance feature</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuplyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shalnov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Konushin</surname>
          </string-name>
          ,
          <article-title>Further improvement on an mcmc-based video tracking algorithm</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on Computer Graphics and Vision GraphiCon'</source>
          <year>2016</year>
          , GraphiCon,
          <year>2016</year>
          , p.
          <fpage>440</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dehghan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Oreifej</surname>
          </string-name>
          , E. Hand,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Part-based multiple-person tracking with partial occlusion handling</article-title>
          ,
          <source>in: 2012 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>1815</fpage>
          -
          <lpage>1821</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2012</year>
          .
          <volume>6247879</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Pose flow: Eficient online pose tracking</article-title>
          , CoRR abs/
          <year>1802</year>
          .00977 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1802</year>
          .00977. arXiv:
          <year>1802</year>
          .00977.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Henschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Leal-Taixé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cremers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rosenhahn</surname>
          </string-name>
          ,
          <article-title>Improvements to frank-wolfe optimization for multi-detector multi-object tracking</article-title>
          ,
          <source>CoRR abs/1705</source>
          .08314 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1705.08314. arXiv:
          <volume>1705</volume>
          .
          <fpage>08314</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cobos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Abad</surname>
          </string-name>
          ,
          <article-title>A fast multi-object tracking system using an object detector ensemble</article-title>
          , CoRR abs/
          <year>1908</year>
          .04349 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1908</year>
          .04349. arXiv:
          <year>1908</year>
          .04349.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bochinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Senst</surname>
          </string-name>
          , T. Sikora,
          <article-title>Extending iou based multi-object tracking by visual information</article-title>
          ,
          <source>in: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/AVSS.
          <year>2018</year>
          .
          <volume>8639144</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bochinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Eiselein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          ,
          <article-title>High-speed tracking-by-detection without using image information</article-title>
          ,
          <source>in: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/AVSS.
          <year>2017</year>
          .
          <volume>8078516</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <article-title>The hungarian method for the assignment problem</article-title>
          ,
          <source>Naval Research Logistics Quarterly</source>
          <volume>2</volume>
          (
          <year>1955</year>
          )
          <fpage>83</fpage>
          -
          <lpage>97</lpage>
          . URL: https://onlinelibrary.wiley.com/ doi/abs/10.1002/nav.3800020109. doi:https://doi.org/10.1002/nav.3800020109. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/nav.3800020109.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuplyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shalnov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Konushin</surname>
          </string-name>
          ,
          <article-title>Markov chain monte carlo based video tracking algorithm</article-title>
          ,
          <source>Programming and Computer Software</source>
          <volume>43</volume>
          (
          <year>2017</year>
          )
          <fpage>224</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Meinhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Leal-Taixé</surname>
          </string-name>
          ,
          <article-title>Tracking without bells and whistles</article-title>
          , CoRR abs/
          <year>1903</year>
          .05625 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1903</year>
          .05625. arXiv:
          <year>1903</year>
          .05625.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Vojir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Noskova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <article-title>Robust scale-adaptive mean-shift for tracking</article-title>
          , in: J.
          <string-name>
            <surname>- K. Kämäräinen</surname>
          </string-name>
          , M. Koskela (Eds.),
          <source>Image Analysis</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>652</fpage>
          -
          <lpage>663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , Ssd:
          <article-title>Single shot multibox detector</article-title>
          , in: B.
          <string-name>
            <surname>Leibe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sebe</surname>
          </string-name>
          , M. Welling (Eds.),
          <source>Computer Vision - ECCV 2016</source>
          , Springer International Publishing, Cham,
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1512</volume>
          .
          <fpage>03385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andreetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          , Mobilenets:
          <article-title>Eficient convolutional neural networks for mobile vision applications</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1704</volume>
          .
          <fpage>04861</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Sun,
          <article-title>Crowdhuman: A benchmark for detecting human in a crowd</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <year>1805</year>
          .00123.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Leal-Taixé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Milan</surname>
          </string-name>
          , I. Reid,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roth</surname>
          </string-name>
          , K. Schindler,
          <string-name>
            <surname>MOTChallenge</surname>
          </string-name>
          <year>2015</year>
          :
          <article-title>Towards a benchmark for multi-target tracking</article-title>
          ,
          <source>arXiv:1504</source>
          .
          <year>01942</year>
          [cs] (
          <year>2015</year>
          ). URL: http://arxiv.org/ abs/1504.
          <year>01942</year>
          , arXiv:
          <fpage>1504</fpage>
          .
          <year>01942</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Milan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Leal-Taixé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schindler</surname>
          </string-name>
          ,
          <article-title>MOT16: A benchmark for multiobject tracking</article-title>
          ,
          <source>arXiv:1603</source>
          .00831 [cs] (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1603.00831, arXiv:
          <fpage>1603</fpage>
          .
          <fpage>00831</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dendorfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rezatofighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Milan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cremers</surname>
          </string-name>
          , I. Reid,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schindler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Leal-Taixé</surname>
          </string-name>
          ,
          <article-title>Mot20: A benchmark for multi object tracking in crowded scenes</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2003</year>
          .09003.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Benfold</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Reid</surname>
          </string-name>
          ,
          <article-title>Stable multi-target tracking in real-time surveillance video</article-title>
          ,
          <source>in: CVPR</source>
          <year>2011</year>
          ,
          <year>2011</year>
          , pp.
          <fpage>3457</fpage>
          -
          <lpage>3464</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2011</year>
          .
          <volume>5995667</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>