<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Domain Randomization for Ob ject Counting⋆</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight Centre for Data Analytics</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently, the use of synthetic datasets based on game engines has been shown to improve the performance of several tasks in computer vision. However, these datasets are typically only appropriate for the specific domains depicted in computer games, such as urban scenes involving vehicles and people. In this paper, we present an approach to generate synthetic datasets for object counting for any domain without the need for photo-realistic techniques manually generated by expensive teams of 3D artists. We introduce a domain randomization approach for object counting based on synthetic datasets that are quick and inexpensive to generate. We deliberately avoid photorealism and drastically increase the variability of the dataset, producing images with random textures and 3D transformations, which improves generalization. Experiments show that our method facilitates good performance on various real word object counting datasets for multiple domains: people, vehicles, penguins, and fruit. The source code is available at: https://github.com/enric1994/dr4oc ⋆ This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 765140. This publication has emanated from research supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2, co-funded by the European Regional Development Fund.</p>
      </abstract>
      <kwd-group>
        <kwd>Domain Randomization</kwd>
        <kwd>Synthetic Data</kwd>
        <kwd>Object Counting</kwd>
        <kwd>Computer Vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Object counting is a computer vision task, the goal of which is to automatically
estimate the number of objects in an image or video. It has gained a lot of
interest in recent years because of its many potential uses: it can help to identify
the congestion level in a shopping center [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] (people counting), the level of
trafic on a road [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (vehicle counting), the status of a penguin colony [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (habitat
monitoring), or even to monitor a harvest [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] (fruit counting).
      </p>
      <p>The main challenge of object counting is that the model has to learn all the
variations of the objects in terms of their size, shape, and pose whilst also dealing
with occlusion and perspective efects. Furthermore, object counting algorithms
tend to overfit because of the small amount of annotated data available, which
degrades their performance when applying the model on other slightly diferent
domains.</p>
      <p>
        To address the problems above, some computer vision algorithms are trained
or pretrained using synthetic data, which can be automatically annotated with
perfect precision for a range of application domains, including those where
collecting data is problematic. Many computer vision tasks such as optical flow,
detection, segmentation, or counting have benefited from the use of synthetic
data. There are many well-known artificially generated datasets [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] that are
particularly useful because of their size, quality of the annotations, and the
variability within the dataset.
      </p>
      <p>
        The task of developing novel approaches to people counting in particular has
benefited from the use of synthetic data. However, when counting objects from
other domains such as wildlife, food, or everyday arbitrary objects, the datasets
produced by game engines are not useful because there are no realistic video
game renderings of these types of objects. It is not practical to create realistic
datasets for many diferent tasks because of the significant manual efort and
production costs required [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Furthermore, models trained with synthetic images from a particular domain
perform poorly when tested on a diferent target domain because of the domain
gap, which has posed considerable obstacles to real-world adoption of synthetic
data for computer vision applications. The main cause is that convolutional
neural networks (CNN) introduce a bias towards textures, memorizing them
instead of shapes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For object counting, understanding the shape of the items
is of paramount importance in order to address the challenges of overlapping
objects and occlusions.
      </p>
      <p>
        Domain randomization (DR) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] can reduce the impact of the domain gap
by generating highly variable samples at the cost of increasing the complexity
of the task. The objective of increasing variability is to expand the spectrum
of possibilities of the source domain whereby the real-world domain becomes
just another variation. The synthetic samples generated with this technique tend
to look less realistic because of the random textures, lighting, and backgrounds
used. DR avoids photorealism, minimizing the need for artistic design. Figure 1
shows several examples of DR applied to diferent environments.
      </p>
      <p>The contributions of this paper are as follows:
– We train an object counting algorithm without labeling any data. The ground
truth is calculated automatically during the generation of the synthetic
images.
– We introduce the first domain randomization approach for object counting
based entirely on synthetic images. We increase the variability of the synthetic
dataset by applying random textures, backgrounds, and lighting efects to
the 3D scene. We demonstrate good performance on real-world datasets that
is consistent across multiple domains.
– We introduce a set of 3D transformations that increase the variability of the
3D models while preserving their inner shape, making the task more complex
(a) Crowd counting
(b) Vehicle counting
(c) Environmental survey
(d) Harvest study
during training but improving generalization at test time. To the best of our
knowledge, we are the first to use 3D transformations to randomize synthetic
images in this way.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Early object counting algorithms mainly targeted crowd counting. They applied
detection-based approaches such as R-CNN [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and YOLO [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to estimate the
number of people in an image and demonstrated reasonable accuracy in sparse
scenes [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. However, the performance dropped on densely crowded scenes where
people overlapped with heavy occlusion. Alternative regression-based methods [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
can extract features (textures, gradients, shapes) to overcome occlusion and learn
a mapping function to evaluate how sparse the scene is, but they ignore the
spatial information. In general, CNN-based approaches [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] predict density maps
to estimate the number of instances in the scene and use the spatial information
contained in the density map. Currently, most of the object counting
state-of-theart algorithms are based on fully convolutional networks [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] combined with other
techniques such as analyzing the context [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], using the perspective information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
or applying a multi-column architecture [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
      <p>
        Many object counting datasets have appeared in recent years [
        <xref ref-type="bibr" rid="ref1 ref10 ref31 ref8">31,1,8,10</xref>
        ],
especially for crowd counting. In general they are annotated with dots indicating
the position of the objects, e.g. crowd counting datasets define a dot on the head
of each person. The annotation of object counting datasets is expensive because
it requires precise dot annotations performed by an expert; hence the datasets
tend to be small, as shown in Table 1.
      </p>
      <p>Performance is measured using two main metrics: MAE (mean absolute error)
and MSE (mean squared error). They compute the average L1 and L2 distance
between the predicted count and the ground truth respectively. MAE and MSE
are scale-dependent and therefore can not be used to make comparisons between
datasets using diferent scales, e.g. it can not be used to compare performance
between diferent counting datasets because they may have a diferent average of
objects per image. The formula used to compute them is described as follows:
n
M AE = X |xi − yi| ,</p>
      <p>n
i=1
M SE = Xn (xi − yi)2 ,</p>
      <p>n
i=1
(1)
(2)
where x is the predicted count, y is the ground truth and n is the number of
images evaluated.</p>
      <p>
        Recently, synthetic datasets [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] have been used to train deep networks for
computer vision. The environments used to create the synthetic datasets range
from very simple methods using basic shapes and colors [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], to scenes generated
by complex game engines [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] that render photo-realistic images and videos. The
main benefit is that the data is labeled automatically, saving a substantial amount
of time especially in densely annotated datasets such as those for segmentation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Also, it is possible in a virtual environment to reproduce rare scenes that are hard
to capture from the real world e.g. remote hard to access locations or unusual
weather phenomena.
      </p>
      <p>
        DR aims to make CNNs robust against challenges posed by novel domains
outside the training set, a phenomenon known as the domain gap. It was first
explained by Tobin et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] by varying the texture of the objects, background
image, and lighting in a semantic segmentation task. The objective is to generate
enough variations of synthetic data that the model views real data as just another
variation, even if the variations used for training appear unrealistic to humans.
Expanding the spectrum of possibilities also raises the complexity of the task,
requiring a model with a higher capacity. If the model is trained on a suficient
(a) Human
(b) Vehicles
(c) Penguin
(d) Apple
number of environments it will interpolate well to novel ones. This method can
be considered to be an evolved form of data augmentation.
      </p>
      <p>
        Conversely, domain adaptation (DA) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] tries to bring the training data
closer to the real-world data [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] by matching the distributions of both datasets
and learning the shared properties. When DA is applied on synthetic images,
they will look more realistic. DA is useful when data can be easily obtained
by modeling the distribution of the synthetic features to match the real ones.
However, many domains cannot benefit from this technique because of the high
variability of the data and the low amount of real images available, e.g. face
recognition datasets have a limited amount of infants smiling [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <sec id="sec-3-1">
        <title>Scene creation</title>
        <p>
          Our DR datasets are generated using a mixture of 3D models, textures,
background images, and lighting efects. The 3D software to render the scenes is
Blender [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which can be easily automated. We specify 3D models to have less
than 200 faces, low-poly models, as shown in Figure 2. We found that using
highly-realistic models, which can have a thousand faces, does not improve the
results while significantly increasing the rendering time.
        </p>
        <p>The low-poly 3D models structure is modiefid to produce more variations,
some of them unrealistic. The generated structures are, however, constrained
to keep the basic shape, e.g. humans with one head and two legs, otherwise
the model will not learn the inner properties of the object. Learning a vast
amount of shapes improves generalization to novel scenarios. Figure 3 shows the
diferent 3D transformations that we used to produce the synthetic datasets:
scale, randomization, and extrusion. Scale smoothly expands/contracts all the
vertices on the same axis. It is useful when objects tend to have very diferent
sizes, e.g. adults tend to be twice as big as children.</p>
        <p>The scale of every 3D model is determined by K ∼ U (1/No, 8/No) where
U is a uniform distribution and No is the number of objects in the image.
Randomizing the vertices of the mesh translates all the vertices in diferent
lengths and directions, uniformly by a factor of 40%. This method improves the
performance in environments where the pose of the objects is variable, e.g. people
can have multiple poses while vehicles do not. Extrusion alters the surfaces of the
mesh to increase the thickness by adding depth. This helps to make objects bigger
or smaller but adds bumps and holes. We used the built-in Solidify transform in
Blender and modified the thickness by T where T ∼ U (− 0.1, 0.5).</p>
        <p>
          Textures from the Describable Textures Dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] are applied to the 3D
models as shown in Figure 4. The dataset contains 5640 textures organized into 47
categories. Textures are mapped to the diferent parts of the 3D models, e.g. hair,
skin, shirt, pants. This technique helps our DR approach to transcend realism by
producing unrealistic sets of randomly textured 3D models.
        </p>
        <p>The 3D models are placed in the scene by sampling positions from a standard
Gaussian mixture distribution as follows:
p(x) =</p>
        <p>K
X λ i N (x | µ i, Σ i),
i=1
(3)
where x is the three-dimensional x, y, z position, λ i are the mixture component
weights, µ i are the means, and Σ i = I. The number of components K is sampled
for each scene as K ∼ U (1 + No/20, 2 + No/8) where U is a uniform distribution
and No is the number of objects in the scene. The mean vectors are uniformly
sampled from the rendered area in the 3D space.</p>
        <p>
          This method creates occlusion in the clusters but also produces large empty
areas where the background image is displayed. It also mimics how objects are
distributed on the real world, e.g. people are not uniformly distributed [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], they
tend to form small groups on the street. We found that when the objects are
distributed uniformly the test mean absolute error (MAE) increased to 63.4 on
the SHT B dataset for crowd counting, compared with 23.2 using the Gaussian
mixture approach.
        </p>
        <p>
          Images from the Places2 dataset [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] are used as the background image. The
dataset contains a wide range of scenes from 365 diferent environments including
indoors, streets, and nature. The fact that the background images are very
diferent make the task more complex but improves generalization. Depending on
the task, some image categories have been removed to avoid unlabeled instances
of the relevant objects in the background, e.g. the “stadium-football” category
when counting people or the “iceberg” category when counting penguins. Finally,
a combination of colored lights is randomly placed around the scene to produce
diferent exposure levels and cast shadows around the 3D objects.
(a) Scale
(b) Randomization
(c) Extrude
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Counting procedure</title>
        <p>
          We used the Distribution Matching for Crowd Counting [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] approach as a
baseline. The authors use Optimal Transport to measure the similarity between
the normalized predicted density map and the normalized ground truth density
map. They also include a total variation loss to force the neighbouring pixels
to have similar values. The baseline performance is particularly good on scenes
where the density and overlapping of the objects is high. We used a ResNet50 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
model, pretrained on ImageNet LSVRC [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], as the base model for all of the
counting tasks. Horizontal flips are applied to double the amount of available
synthetic images. In addition, training images are randomly cropped into multiple
smaller images (512 × 512) to obtain more samples.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section we analyze the performance of our DR method and evaluate the
models trained entirely with synthetic data on real-world datasets. Two types of
experiments are conducted: 1) testing on real-world datasets; 2) analyzing the
efect of 3D transformations.
4.1</p>
      <sec id="sec-4-1">
        <title>Comparison with real-world datasets</title>
        <p>We propose a new scheme to remedy the lack of datasets in non-urban
environments. By training our model only on synthetic data we obtain good performance
on multiple real-world environments. Table 2 compares the performance of
training with real data and synthetic data.</p>
        <p>
          Table 3 compares our DR approach with Wang et al. [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] for crowd counting.
Their method is based on DA applied on images from a realistic video game.
Using real-world images to feed a GAN they improve the textures of the video
game images. DA is successfully applied to domains where it is easy to obtain
real-world images and produce synthetic data using a video game, e.g. urban
environments involving people and vehicles. Our DR approach obtains similar
results without using real-world data and very simple rendering techniques. Whilst
the approach of [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] performs better than our approach, it should be noted that
our performance is achieved with an automatically generated synthetic dataset.
3D transformations increase the variability of the dataset in terms of shape,
improving the generalization to novel domains. To the best of our knowledge,
we are the first to use 3D transformations on synthetic datasets for generating
training data. Table 4 shows how 3D transformations afect performance on the
object counting task. For each experiment we generated 2k synthetic images with
the given 3D transformation.
        </p>
        <p>We also observed that when applying strong 3D transformations the training
process takes longer because the task becomes more complex.</p>
        <p>Randomizing the vertices works better on environments with objects that
can present diferent poses, e.g. people. The results obtained with the extrude
transform are similar to the randomize ones because it also creates small
irregularities in the shape. Extrude exhibits good performance on environments where
the objects are solid, e.g. vehicles.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we proposed a domain randomization approach for object counting
that can be easily applied to any domain. The counting model was trained only
with synthetic images and achieves good performance on diferent real-world
counting datasets: crowd counting, vehicle counting, penguin counting, and fruit
counting. Applying the right 3D transformations to the meshes increases the
counting accuracy when evaluating on real-world datasets. The impact of 3D
transformations depends on the nature of the object, e.g. variable pose and size.
Future work in this area will look to extend the proposed domain randomization
approach to the video domain and to use the depth information from the synthetic
data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arteta</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lempitsky</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Counting in the wild</article-title>
          .
          <source>In: European conference on computer vision</source>
          . pp.
          <fpage>483</fpage>
          -
          <lpage>498</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Scale aggregation network for accurate and eficient crowd counting</article-title>
          .
          <source>In: Proceedings of the European Conference on Computer Vision (ECCV)</source>
          . pp.
          <fpage>734</fpage>
          -
          <lpage>750</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cimpoi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kokkinos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , ,
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Describing textures in the wild</article-title>
          .
          <source>In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Community</surname>
            ,
            <given-names>B.O.</given-names>
          </string-name>
          :
          <article-title>Blender - a 3D modelling and rendering package</article-title>
          .
          <source>Blender Foundation</source>
          , Stichting Blender Foundation, Amsterdam (
          <year>2018</year>
          ), http://www.blender. org
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cordts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehfeld</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Enzweiler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benenson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franke</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The cityscapes dataset for semantic urban scene understanding</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>June 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Pcc net: Perspective crowd counting via spatial convolutional network</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Geirhos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubisch</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michaelis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethge</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wichmann</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brendel</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness</article-title>
          .
          <source>In: International Conference on Learning Representations</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Guerrero-Go´</surname>
            mez-Olmedo,
            <given-names>R.</given-names>
          </string-name>
          , Torre-Jimen´ez,
          <string-name>
            <surname>B.</surname>
          </string-name>
          ,
          <article-title>Lo´pez-</article-title>
          <string-name>
            <surname>Sastre</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , MaldonadoBasoc´n,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Onoro-Rubio</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Extremely overlapping vehicle counting</article-title>
          .
          <source>In: Iberian Conference on Pattern Recognition and Image Analysis</source>
          . pp.
          <fpage>423</fpage>
          -
          <lpage>431</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zha</surname>
            ,
            <given-names>Z.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Dadnet:
          <article-title>Dilated-attention-deformable convnet for crowd counting</article-title>
          .
          <source>In: Proceedings of the 27th ACM International Conference on Multimedia</source>
          . pp.
          <fpage>1823</fpage>
          -
          <lpage>1832</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Han¨i, N.,
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Minneapple: A benchmark dataset for apple detection and segmentation</article-title>
          .
          <source>IEEE Robotics and Automation Letters</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <fpage>852</fpage>
          -
          <lpage>858</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkioxari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Dolal´r, P.,
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          , R.:
          <string-name>
            <surname>Mask</surname>
          </string-name>
          r-cnn.
          <source>In: Proceedings of the IEEE international conference on computer vision</source>
          . pp.
          <fpage>2961</fpage>
          -
          <lpage>2969</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Idrees</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saleemi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seibert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Multi-source multi-scale counting in extremely dense crowd images</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>2547</fpage>
          -
          <lpage>2554</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Typical features of pedestrian spatial distribution in the inflow process</article-title>
          .
          <source>Physics Letters A</source>
          <volume>380</volume>
          (
          <issue>17</issue>
          ),
          <fpage>1526</fpage>
          -
          <lpage>1534</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Marsden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          :
          <article-title>People, penguins and petri dishes: Adapting object counting models to new visual domains and object types without forgetting</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <fpage>8070</fpage>
          -
          <lpage>8079</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>V.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozakaya</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okada</surname>
          </string-name>
          , R.:
          <article-title>Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          . pp.
          <fpage>3253</fpage>
          -
          <lpage>3261</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Radau</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Connelly</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paul</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dick</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Evaluation framework for algorithms segmenting short axis cardiac MRI</article-title>
          .
          <source>The MIDAS Journal-Cardiac MR Left Ventricle Segmentation Challenge</source>
          <volume>49</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rahnemoonfar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheppard</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Deep count: fruit counting based on deep simulated learning</article-title>
          .
          <source>Sensors</source>
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <volume>905</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Redmon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Divvala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>You only look once: Unified, real-time object detection</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Djahel</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A novel yolo-based real-time people counting approach</article-title>
          . In: 2017
          <source>international smart cities conference (ISC2)</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vineet</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koltun</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Playing for data: Ground truth from computer games</article-title>
          .
          <source>In: European conference on computer vision</source>
          . pp.
          <fpage>102</fpage>
          -
          <lpage>118</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , et al.:
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>International journal of computer vision 115(3)</source>
          ,
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Tobin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Domain randomization for transferring deep neural networks from simulation to the real world</article-title>
          .
          <source>In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          . pp.
          <fpage>23</fpage>
          -
          <lpage>30</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samaras</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Distribution matching for crowd counting</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Deep visual domain adaptation: A survey</article-title>
          .
          <source>Neurocomputing</source>
          <volume>312</volume>
          ,
          <fpage>135</fpage>
          -
          <lpage>153</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning from synthetic data for crowd counting in the wild</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>8198</fpage>
          -
          <lpage>8207</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Detecting smiles of young children via deep transfer learning</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision Workshops</source>
          . pp.
          <fpage>1673</fpage>
          -
          <lpage>1681</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
          </string-name>
          , E.:
          <article-title>Perspectiveguided convolution networks for crowd counting</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          . pp.
          <fpage>952</fpage>
          -
          <lpage>961</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Crowd counting via scale-adaptive convolutional neural network</article-title>
          .
          <source>In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)</source>
          . pp.
          <fpage>1113</fpage>
          -
          <lpage>1121</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Costeira</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moura</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>FCN-rLSTM: Deep spatio-temporal neural networks for vehicle counting in city cameras</article-title>
          .
          <source>In: Proceedings of the IEEE international conference on computer vision</source>
          . pp.
          <fpage>3667</fpage>
          -
          <lpage>3676</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma, Y.:
          <article-title>Single-image crowd counting via multi-column convolutional neural network</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>589</fpage>
          -
          <lpage>597</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapedriza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Places: A 10 million image database for scene recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>