<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Counting Vehicles with Cameras</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Ciampi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Falchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Gennaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fausto Rabitti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information, Science and Technologies of the National Research Council of Italy (ISTI-CNR)</institution>
          ,
          <addr-line>via G. Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>24</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>This paper aims to develop a method that can accurately count vehicles from images of parking areas captured by smart cameras. To this end, we have proposed a deep learning-based approach for car detection that permits the input images to be of arbitrary perspectives, illumination, and occlusions. No other information about the scenes is needed, such as the position of the parking lots or the perspective maps. This solution is tested using Counting CNRPark-EXT, a new dataset created for this speci c task and that is another contribution to our research. Our experiments show that our solution outperforms the stateof-the-art approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Counting</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Machine Learn- ing</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper is motivated by the need to address the challenging real-world
counting problem related to the estimation of the number of vehicles present in a park
area, using images captured by smart cameras. The visual understanding of the
collected images must face many challenges, perhaps common to all the counting
tasks, like variations in scales and perspectives, the inter-object occlusions, the
non-uniform illumination of the scene, and many others.</p>
      <p>In order to address these challenges, we propose a deep learning-based
approach that is able to accurately count cars in images without any extra
information of the scenes, like the position of the parking lots or the perspective
map. The latter aspect is a key feature since in this way our solution is directly
applicable in unconstrained contexts.</p>
      <p>
        To validate our approach, we also built a dataset, called Counting
CNRParkEXT dataset, collecting images from the parking lots in the campus of the
National Research Council (CNR) in Pisa. The images are taken by nine di erent
cameras in challenging conditions since they are captured from di erent
perspectives, they present di erent illuminations and many occlusions. The result
of the proposed methodology signi cantly outperforms the ones obtained using
the state-of-the-art baseline methods.
Objects counting has been tackled in computer vision by various techniques,
especially for the estimation of the number of people in crowded scenes. Following
the taxonomies adopted in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] we can divide counting approaches into
four main categories: counting by detection, counting by clustering, counting by
regression and counting by density estimation.
      </p>
      <p>
        Counting by detection is a supervised approach where a sliding window
detector (i.e. a classi er that is slid over the entire image) previously trained is
used to detect objects in the scene. This information is then used to count the
number of objects. In the monolithic detection the classi er is trained in order
to recognize the whole object we want to detect [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], while in the part-based
detection we looking for speci c parts of the object (such as head and shoulders
for people detection) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Finally, in the shape-matching detection, the
classier is about object shapes, for example composed of ellipses [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Even if these
methods are quite simple to understand, they su er in scenes with occlusions.
      </p>
      <p>
        Counting by clustering tackles the counting problem in an unsupervised way.
A clear advantage is that such an approach does not need to be trained and it
is out of the box. However, the counting accuracy of such fully unsupervised
methods is in general limited. The clustering by self-similarities technique relies
on tracking simple image features and probabilistically group them into
clusters, like in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Then we can count clusters belonging to a certain category. The
clustering by motion similarities approach relies instead on the assumption that
a pair of points that appears to move together is likely to be part of the same
individual, hence coherent feature trajectories can be grouped together to
represent independently moving-entities, like in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The main drawback in such a
method is that it only works with continuous image frames and not with static
images.
      </p>
      <p>Counting by regression is a supervised method that tries to establish a direct
mapping (linear or not) from the image features to the number of objects present
in an image without explicit object detection or tracking. Since it does not rely
on a speci c classi er or model previously trained, it is more robust to occlusions
and perspective distortions.</p>
      <p>
        Finally, counting by density estimation is a supervised technique that
extends in some way counting by regression approach, introduced in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this
case, the (linear or not) mapping is between image features and a corresponding
density map (i.e. a continuous-valued function), and not between features and
the number of objects. Then we can calculate the integral over any region in the
density map obtaining the count of objects within that region. This approach
is robust to occlusions and perspective distortions because it does not rely on a
speci c classi er previously trained, just like counting by regression approach as
well, but the key di erence is that now we are exploiting a pixel level mapping :
each pixel of the image is represented by a feature vector and mapped to a pixel
of the corresponding density map. In some way, unlike in counting by regression,
now we are incorporating spatial information in the learning process.
      </p>
      <p>
        The extraction of suitable features is a crucial operation for all the four
approaches that have been described. Since, in general, handcrafted features
su er a drop in accuracy when subjected to challenging situations (variances
in illumination, perspective distortion, severe occlusion, etc.), most
state-of-theart counting methods employ deep-learning approaches, in particular exploiting
Convolutional Neural Networks (CNNs), in order to extract suitable task-speci c
features automatically. Some works that use CNNs are [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for crowd
counting, and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for the vehicles counting task.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>
        A contribution of this paper is the creation of "Counting CNRPark-EXT", a
dataset of roughly 4,000 images containing more than 79,000 labeled cars. This
dataset is based on the CNRPark-EXT dataset, presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a collection of
roughly 150,000 bounding-box annotated images of vacant and occupied parking
slots in the campus of the National Research Council (CNR). This dataset is
challenging and describes most of the di cult situations that can be found in a
real scenario: the images are captured by nine di erent cameras under various
weather conditions, angles of view and light conditions. Furthermore, another
challenging aspect is due to the presence of partial occlusion patterns in many
scenes such as obstacles (trees, lampposts, other cars) and shadowed cars.
      </p>
      <p>The CNRPark-EXT dataset is speci cally designed for parking lot occupancy
detection and it is not directly usable for the counting task, since each image,
called patch, contains one single park space labeled according to the occupancy
status of it, 0 for vacant and 1 for occupied. Since the purpose of this work is
to count cars present in an image, the information needed to build the ground
truth are full images (and not patches), the total number of vehicles considered
in an entire scene, and, at least for some techniques, the locations of these
vehicles. Therefore, starting from the CNRPark-EXT dataset, we summed up
the occupancy status of the patches belonging to a same full image (note that a
car space is occupied if and only if a car is present in the car space), obtaining
the total number of cars considered in the whole scenes (i.e. obtaining a
perimage ground truth). Only bounding boxes corresponding to occupied spaces
are considered, identifying cars to be counted (i.e. obtaining a per-object ground
truth). Table 1 reports the composition of Counting CNRPark-EXT dataset.</p>
      <p>
        A new vehicles counting solution using Mask R-CNN
Our solution is based on Mask R-CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a very popular deep convolutional
neural network, employed in many detection systems. Unlike previous methods
that tackle the localization problem by building a sliding-window detector, Mask
R-CNN solves the problem by operating within the 'recognition using regions'
paradigm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], taking an image as input and producing as output labels for each
detected object together with bounding boxes and masks localizing them. The
authors di erentiate between the convolutional backbone architecture, used for
features extraction over an entire image, and the network head for bounding-box
recognition (classi cation and regression) and mask prediction that is applied to
each proposed region.
      </p>
      <p>
        As a starting point, we considered a model of Mask R-CNN pre-trained on
the COCO dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a large dataset composed of images describing complex
everyday scenes of common objects in their natural context, categorized in 80
di erent categories. In order to count vehicles, we considered the detected objects
belonging to the car and truck categories. Since this network is a generic objects
detector, we specialize it to recognize the vehicles we want to count.
      </p>
      <p>The rst step has been the creation of a suitable labeled training set. In the
case of Mask R-CNN, these labels correspond to masks and bounding boxes. As
mentioned before, since the labels of our dataset are bounding boxes (perhaps
not very accurate, since they localize parking lots and not the cars), we needed
to add the masks on the vehicles to be detected in order to make Counting
CNRPark-EXT dataset useful for our purposes. Mask creation is a very
timeexpensive operation - for each car in each image, we should associate the pixels
that model the vehicles with a label. Since in the dataset we have hundreds of
cars, this problem has been solved by taking advantage of the output of the
pretrained model of Mask R-CNN. Since this network produces in output accurate
masks localizing objects, the idea is to save these masks automatically generated,
parse them, and then manually add some of the remaining masks for the cars
that were not detected.</p>
      <p>First, we randomly selected a subset of our training set (about the 10%), fed
the previous pre-trained Mask R-CNN model with these images, and saved the
vertex coordinates of the polygons surrounding the masks localizing objects that
the network produces in output along with the category associated with them.
Next, a further analysis of these saved masks has been performed, changing the
wrong associations between some objects labeled with categories that are not
possible in our context (like airplane and boat ), and that were instead vehicles
we want to detect. In this way, we forced the network to learn to recognize these
objects as cars and not as other wrong objects. Finally, we manually add some
of the remaining masks localizing vehicles that were not automatically detected.
At this point, we retrained the network using this new mask-labeled and
taskspeci c dataset, frozen the weights of the backbone, and saving the new weights
of the head after a few epochs. Figure 1 reports the pipeline of the described
approach.
In this section, we present the methodology and the results of the experimental
evaluation. For testing purposes, we have used the test subset of the Counting
CNRPark-EXT dataset.</p>
      <p>Evaluation Metrics Following other counting benchmarks, we use Mean
Absolute Error (MAE ) and Root Mean Square Error (RMSE ) as the metrics for
comparing the performance of our solution against other counting approaches
present in literature. MAE is de ned as follows:
while RMSE is de ned as:</p>
      <p>M AE =
1 XN cgt</p>
      <p>j n</p>
      <p>N n=1
v
RM SE = tuu N1 n=1</p>
      <p>N
X(cgnt
cpred
n</p>
      <p>j
cpnred)2
(1)
(2)
where N is the total number of test images, cgt is the actual count, and cpred is
the predicted count of the n-th image. Note that as a result of the squaring of
each di erence, RMSE e ectively penalizes large errors more heavily than small
ones. Then RMSE should be more useful when large errors are particularly
undesirable.</p>
      <p>
        The above two metrics are indicative of quantifying the error of estimation
of the objects count. However, as pointed out by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], these metrics contain
no information about the relation of the error and the total number of objects
present in the image. To this end, another performance metric is taken into
account, which is essentially a normalized MAE, that we call Mean Occupancy
Error (MOE ), because in this work quanti es the error in the evaluation of the
occupancy of a car park, de ned as:
      </p>
      <p>M OE =
1 XN jcgnt cpnredj
N n=1 num slotsn
(3)
where num slotsn is the total number of parking lots in the current scene. In
the next, this evaluation metric is expressed as a percentage.</p>
      <p>
        Comparisons with the state of the art We have compared our vehicles
counting solution against the method proposed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a state-of-the-art approach
for car parking occupancy detection, based on mAlexNet, a deep CNN speci
cally designed for smart cameras. This work represents an indirect method for
counting cars in a car park, where the counting problem is cast as a classi
cation problem: if a parking lot is occupied we increment the total number of cars,
otherwise not.
      </p>
      <p>The main drawback of such an approach is that the locations of the monitored
objects of a scene must be known in advance. This technique fails if it is applied
on a new camera added in the car park, just because it does not have the
knowledge about where car slots are located: a preliminary annotation of the new
camera and a new train of the network are then mandatory operations. So, this
issue makes this car counting method not directly applicable in unconstrained
contexts.</p>
      <p>On the other hand, as already mentioned before, our counting approach is
able to count vehicles without any extra information about the parking lots
locations. Nevertheless, results using this classi cation approach applied to the
Counting CNRPark-EXT dataset can be used as a baseline and as a basis of
comparison.</p>
      <p>Results The results of the experimental evaluation are reported in Table 2. Note
that our solution not only outperforms the original pre-trained Mask-RCNN,
but also it works better than the state-of-the-art approach using mAlexNet,
considering all the three performance metrics.</p>
      <p>Since counting errors could be due to di erent weather conditions (since di
erent weather conditions might produce signi cant di erent illuminations of the
scenes) and/or to di erent camera views (di erent camera views correspond to
di erent perspectives), we perform a further analysis of the results by dividing
them according to the number of cameras and to the weather condition they
belong to. Figure 2 reports the errors scatter-plots of the errors. Note that, in
general, our solution tends to underestimate the number of vehicles present in
the scene, but it responds well to perspective and illumination changes, having
only a small performance decrease considering frames belonging to camera nine.
In this paper, we presented an e cient solution for counting vehicles in parking
areas that exploit deep Convolutional Neural Networks (CNNs) to detect vehicles
present in challenging scenes. Our proposed methodology does not require any
manually entered information about the parking lots locations, allowing a simple
'plug-and-play' installation. The results outperform the ones obtained using the
state-of-the-art baseline methods.</p>
      <p>As a further contribution, we collected Counting CNRPark-EXT, a dataset
containing images of real parking areas captured by nine cameras, with di erent
weather conditions and perspective views.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahuja</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorovic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Extracting texels in 2.1 d natural textures</article-title>
          .
          <source>In: Computer Vision</source>
          ,
          <year>2007</year>
          .
          <article-title>ICCV 2007</article-title>
          . IEEE 11th International Conference on. pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meghini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vairo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Deep learning for decentralized parking lot occupancy detection</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>72</volume>
          , 327{
          <fpage>334</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Boominathan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruthiventi</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babu</surname>
            ,
            <given-names>R.V.</given-names>
          </string-name>
          :
          <article-title>Crowdnet: A deep convolutional network for dense crowd counting</article-title>
          .
          <source>In: Proceedings of the 2016 ACM on Multimedia Conference</source>
          . pp.
          <volume>640</volume>
          {
          <fpage>644</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ciregan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meier</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Multi-column deep neural networks for image classi cation</article-title>
          .
          <source>In: Computer vision and pattern recognition (CVPR)</source>
          ,
          <source>2012 IEEE conference on</source>
          . pp.
          <volume>3642</volume>
          {
          <fpage>3649</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arbelaez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>Recognition using regions</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          . IEEE Conference on. pp.
          <volume>1030</volume>
          {
          <fpage>1037</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkioxari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          , R.:
          <string-name>
            <surname>Mask</surname>
          </string-name>
          r-cnn.
          <source>In: Computer Vision</source>
          (ICCV),
          <year>2017</year>
          IEEE International Conference on. pp.
          <volume>2980</volume>
          {
          <fpage>2988</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Leibe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seemann</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Pedestrian detection in crowded scenes</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
          <article-title>CVPR 2005</article-title>
          . IEEE Computer Society Conference on. vol.
          <volume>1</volume>
          , pp.
          <volume>878</volume>
          {
          <fpage>885</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lempitsky</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning to count objects in images</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>1324</volume>
          {
          <issue>1332</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection</article-title>
          .
          <source>In: Pattern Recognition</source>
          ,
          <year>2008</year>
          .
          <source>ICPR</source>
          <year>2008</year>
          . 19th International Conference on. pp.
          <volume>1</volume>
          {
          <issue>4</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>In: European conference on computer vision</source>
          . pp.
          <volume>740</volume>
          {
          <fpage>755</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Loy</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Crowd counting and pro ling: Methodology and evaluation</article-title>
          .
          <source>In: Modeling, Simulation and Visual Analysis of Crowds</source>
          , pp.
          <volume>347</volume>
          {
          <fpage>382</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Onoro-Rubio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Sastre</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          :
          <article-title>Towards perspective-free object counting with deep learning</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . pp.
          <volume>615</volume>
          {
          <fpage>629</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rabaud</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Counting crowded moving objects</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <source>2006 IEEE Computer Society Conference on. vol. 1</source>
          , pp.
          <volume>705</volume>
          {
          <fpage>711</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sindagi</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>V.M.:</given-names>
          </string-name>
          <article-title>A survey of recent advances in cnn-based single image crowd counting and density estimation</article-title>
          .
          <source>Pattern Recognition Letters</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nevatia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Segmentation and tracking of multiple humans in crowded environments</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>30</volume>
          (
          <issue>7</issue>
          ),
          <volume>1198</volume>
          {
          <fpage>1211</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>