<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unsupervised Vehicle Counting via Multiple Camera 1 Domain Adaptation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Ciampi</string-name>
          <email>luca.ciampi@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Santiago</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joao Paulo Costeira</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Gennaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International</institution>
          ,
          <addr-line>CC BY 4.0</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Information Science and Technologies (ISTI)</institution>
          ,
          <addr-line>Italian National</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto Superior Te ́cnico (LARSyS/IST)</institution>
          ,
          <addr-line>Portugal, Lisbon</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Monitoring vehicle flows in cities is crucial to improve the urban environment and quality of life of citizens. Images are the best sensing modality to perceive and assess the flow of vehicles in large areas. Current technologies for vehicle counting in images hinge on large quantities of annotated data, preventing their scalability to city-scale as new cameras are added to the system. This is a recurrent problem when dealing with physical systems and a key research area in Machine Learning and AI. We propose and discuss a new methodology to design image-based vehicle density estimators with few labeled data via multiple camera domain adaptations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Artificial Intelligence (AI) systems dedicated to the analysis and
interaction with the physical world can significantly impact
human life. These systems can process a massive amount of data and
make/suggest decisions that help solve many real-world problems
where humans are at the epicenter.</p>
      <p>Crucial examples of human-centered artificial intelligence, whose
aim is to create a better world by achieving common goals beneficial
to our societies, are city mobility, pollution monitoring, or critical
infrastructure management, where decision-makers require, for
instance, measurements about flows of bicycles, cars or people. Like
no other sensing mechanism, networks of city cameras can observe
such large dimensions and simultaneously provide visual data to AI
systems to extract relevant information from this deluge of data.</p>
      <p>Different smart cameras across the city are subject to various
visual conditions (luminance, position, context). This results in
different performances from each of them and added difficulty in
effectively scaling-up the learning task. In this paper, we address this issue
proposing a methodology that performs unsupervised domain
adaptation among different cameras to compute the number of vehicles
in a city reliably. We focus on vehicle counting, but the approach is
applicable to counting any other type of object.</p>
    </sec>
    <sec id="sec-2">
      <title>Counting as a supervised learning task</title>
      <p>
        The counting problem is the estimation of the number of objects
instances in still images or video frames [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Current systems
address the counting problem as a supervised learning process. They
fall in two main classes of methods: a) detection-based approaches
([
        <xref ref-type="bibr" rid="ref1 ref2 ref4">2, 4, 1</xref>
        ]) that try to identify and localize single instances of objects
in the image and b)density-based techniques that rely on regression
techniques to estimate a density map from the image, and where the
final count is given by summing all pixel values [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Figure 1
illustrates the mapping of such regression. Concerning vehicle counting
in urban spaces, where images are of very low resolution, and most
objects are partially occluded, density-based methods have a clear
advantage on detection methods [
        <xref ref-type="bibr" rid="ref15 ref3 ref6 ref8">15, 6, 8, 3</xref>
        ].
      </p>
      <p>Hinging on Convolution Neural Networks (CNN) to learn the
regressor, this class of approaches has shown to be very effective,
especially in single-camera scenarios. However, since they require
pixellevel ground truth for supervised learning, they may not generalize
well to unseen images, especially when there is a large domain gap
between the training (source) and the test (target) sets, such as
different camera perspectives, weather, or illumination. This gap severely
hampers the application of counting methods to very large scale
scenarios since annotating images for all the possible cases is unfeasible.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Unsupervised domain adaptation</title>
      <p>This paper proposes to generalize the counting process through a new
domain adaptation algorithm for density map estimation and
counting. Specifically, we suppose to have an annotated training set for
a source domain, and we want to adapt the system to perform well
in an unseen and unlabelled target domain. For instance, the source
domain consists of images taken from a set of cameras. In contrast,
the target domain consists of pictures taken from different cameras,
with different luminances, perspectives, and contexts. This class of
algorithms is commonly referred to as Unsupervised Domain
Adaptation.</p>
      <p>
        We conduct preliminary experiments using the WebCamT dataset
introduced in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In particular, we consider a test set containing
images from cameras with different perspectives from the training
ones, showing that our unsupervised domain adaptation technique
can mitigate the perspective domain gap.
      </p>
      <p>
        Traditional approaches of Unsupervised Domain Adaptation have
been developed to address the problem of image classification, and
they try to align features across the two domains ([
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ]). However,
as pointed out in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], they do not perform well in other tasks, such
as semantic segmentation.
2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Method</title>
      <p>
        We propose an end-to-end CNN-based unsupervised domain
adaptation algorithm for traffic density estimation and counting. Inspired
by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we base our method on adversarial learning in the output
space (density maps), which contains rich information such as scene
layout and context. In our approach, we rely on the adversarial
learning scheme to make the predicted density distributions of the source
and target domains consistent.
      </p>
      <p>The proposed framework, shown in Fig. 2, consists of two
modules: 1) a CNN that predicts traffic density maps and estimates the
number of vehicles occurring in the scene, and 2) a discriminator that
distinguishes whether the density map (received by the density map
estimator) is generated processing an image of the source domain or
the target domain. In the training phase, the density map predictor
learns to map images to densities, based on annotated data from the
source domain. At the same time, it learns to fool the
discriminator exploiting an adversarial loss, computed using the predicted
density map of unlabeled images from the target domain. Consequently,
the output space is forced to have similar distributions for both the
source and target domains. In the inference phase, the discriminator
is discarded, and only the density map predictor is used for the target
images. A description of each module and their training is provided
in the following subsections.
2.1</p>
    </sec>
    <sec id="sec-5">
      <title>Density Estimation Network</title>
      <p>
        We formulate the counting task as a density map estimation problem
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The density (weight) of each pixel in the map depends on its
proximity to a vehicle centroid and the size of the vehicle in the
image, as shown in Fig. 1, so that each vehicle contributes with a total
value of 1 to the map. Therefore, it provides statistical information
about the vehicles’ location and allows the counting to be estimated
by summing of all density values.
      </p>
      <p>This task is performed by a CNN-based model, whose goal is
to automatically determine the vehicle density map associated with
a given input image. Formally, the density map estimator, :
RC W H 7! RW H, transforms a C channels W H input image,
I, into a density map, D = (I) 2 RW H.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Discriminator Network</title>
      <p>The discriminator network, denoted by , also consists of a CNN
model. It takes as input the density map, D, estimated by the
network . Its output is a lower resolution probability map. Each pixel
represents the probability that the corresponding area (from the input
density map) comes from the source or the target domain. The goal
of the discriminator is to learn to distinguish between density maps
belonging to source or target domains. This, in turn, forces the
density estimator to provide density maps with similar distributions in
both domains, i.e., the density maps, D, of the target domain have to
look realistic, even if network was not trained with an annotated
training set from that domain.
2.3</p>
    </sec>
    <sec id="sec-7">
      <title>Domain Adaptation Learning</title>
      <p>The proposed framework is trained based on an alternate
optimization of density estimation network, , and the discriminator network,
. Regarding the former, the training process relies on two
components: 1) density estimation using pairs of images and ground truth
density maps, which we assume are only available in the source
domain; and 2) adversarial training, which aims to make the
discriminator fail to distinguish between the source and target domains. As
for the latter, images from both domains are used to train the
discriminator on correctly classifying each pixel of the probability map
as either source or target.</p>
      <p>To implement the above training procedure, we introduce two loss
functions: one is employed in the first step of the algorithm to train
network . The other is used in the second step to train the
discriminator . These loss functions are detailed next.</p>
      <p>Network Training. We formulate the loss function for as the
sum of two main components:</p>
      <p>L(IS ; IT ) = Ldensity(IS ) +
advLadv(IT );
(1)
where Ldensity is a composite loss computed using ground truth
annotations available in the source domain, while Ladv is the
adversarial loss that is responsible for making the distribution of the target
and the source domain close each other. In particular, we define the
density loss Ldensity as:</p>
      <p>Ldensity(IS ) = Ldensity map(IS ) + Lregression(IS );
(2)
where Ldensity map is the mean square error between the
predicted and ground truth density maps, i.e. Ldensity map =
M SE(DS ; DS GT ), while Lregression is Euclidean loss between
predicted and ground truth count.</p>
      <p>To compute the adversarial loss Ladv(IT ), we first forward the
images belonging to the target domain and we generate the predicted
density maps DT . Then, we compute</p>
      <p>Ladv(IT ) =</p>
      <p>X log( (DT )):
h;w
(3)
This loss forces the distribution of DT to be closer to DS by training
to fool the discriminator, maximizing the probability of the target
predicted density map to be considered as the source prediction.</p>
      <p>Discriminator Training. Given the estimated density map D =
(I) 2 RW H, we forward D to a fully-convolutional
discriminator using a binary cross-entropy loss Ldisc for the two classes (i.e.,
source and target domains). We formulate the loss as:
Ldisc(D) =</p>
      <p>X[(1 y) log( (D)(h;w;0))+y log( (D)(h;w;1))];
h;w
(4)
where y = 0 if the sample is taken from the target domain, and y = 1
if the sample is taken from the source domain.
2.4</p>
    </sec>
    <sec id="sec-8">
      <title>Implementation Details</title>
      <p>
        Density Map Estimation and Counting Network. We build our
density map estimation network based on U-Net [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. U-Net is a
popular end-to-end encoder-decoder network for semantic
segmentation first used for biomedical image segmentation. The encoder part
consists of convolution blocks, followed by max-pooling blocks that
Source Domain
Target Domain
      </p>
      <p>Source Prediction</p>
      <p>Discriminator
Density Map
Estimation and
Counting Network</p>
      <p>Target Prediction
Adversarial Loss
downscale the feature representations at multiple levels. The decoder
part of the network upsamples the features through upsampling
layers followed by regular convolution operations. Furthermore, the
upsampled features are concatenated with the same scale features from
the encoder, containing more detailed spatial information and
preventing the network from losing spatial awareness due to
downsampling.</p>
      <p>
        Discriminator. We use a Fully Convolutional Network similar to
[
        <xref ref-type="bibr" rid="ref11 ref9">11, 9</xref>
        ], composed of 5 convolution layers with kernel 4 4 and stride
of 2. The number of channels are f64, 128, 256, 512, 1g, respectively.
Each convolution layer is followed by a leaky ReLU having a
parameter equals to 0.2.
3
      </p>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENTAL SETUP</title>
      <p>
        We conduct preliminary experiments using the WebCamT dataset
introduced in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This dataset is a collection of traffic scenes recorded
using city-cameras, and it is particularly challenging for analysis due
to the low-resolution (352 240), high occlusion, and large
perspective. We consider a total of about 42,000 images belonging to
10 different cameras and consequently having different perspectives.
We employ the existing bounding box annotations of the dataset to
generate ground truth density maps, one for each image. In
particular, we consider one Gaussian Normal kernel for each vehicle present
in the scene, having a value of and equals to the center and
proportional to the length of the bounding box surrounding the vehicle,
respectively.
      </p>
      <p>Firstly, we show the domain gap that we want to face. We
generate a first pair of training and validation subsets, picking images
randomly from the whole dataset. Then, we create a second pair of
training and validation subsets, this time picking images belonging
to seven different cameras for the first and pictures belonging to the
three remaining ones for the second (per-camera splits of the whole
dataset). We show the domain gap training our model without the
discriminator on the training subsets and comparing the results obtained
over the validation splits.</p>
      <p>Once we quantified and proved this domain gap, we try to mitigate
it, conducting experiments on the per-camera splits using our
solution, i.e., the network and the discriminator that acts on the
output space. In particular, during the training, we also use the images
belonging to the validation subset without the labels to generate an
adversarial loss aimed at making the source domain (i.e., the training
subset) and the target domain (i.e., the validation subset) close each
other.</p>
      <p>We base the evaluation of the models on three metrics: (i) Mean
Absolute Error (MAE) that measures the absolute count error of each
image; (ii) Mean Squared Error (MSE) that penalizes large errors
more heavily than small ones; (iii) Average Relative Error (ARE),
which measures the absolute count error divided by the true count.
4</p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS AND DISCUSSION</title>
      <p>Figure 3 (a) shows the results for the two validation sets - the random
one and the per-camera one, using the density estimation network
without the discriminator trained over the two training subsets - the
random one and the per-camera one, respectively. Each plot
corresponds to one of the three metrics. As we can see, the domain gap is
significant: even if all the subsets’ images belong to the same dataset
and are collected in the same city under similar conditions, small
changes to the perspectives cause a remarkable loss in performance.
In other words, the network cannot generalize well to views that have
not been seen during the training.</p>
      <p>When combining the density estimation network with the
adversarial component, the performance of the system improves
considerably. These results are shown in Figure 3 (b), where the
improvements obtained using our model (red line) compared to the baseline
model, without discriminator, is visible in all the three metrics. The
discriminator mitigates the domain gap, and the network can
generalize better over images having different perspectives from the ones
employed during the training. The results are related to a specific
value of that showed the most promising results.</p>
      <p>Since all the metrics that we considered take into account only the
counting errors, we also plot some examples of the predicted
density maps using our model either with and without the discriminator.
Figure 4 shows the ground truth and the predicted density maps for
two random samples of the validation subset. As we can see, the
density maps predicted using the model with the discriminator show
(a)
(b)
a decrease of the noise compared with the ones obtained using the
baseline model without the discriminator.
5</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSIONS</title>
      <p>In this article, we tackle the problem of determining the density and
the number of objects present in large sets of images. Building on a
CNN-based density estimator, the proposed methodology can
generalize to new sources of data for which there is no training data
available. We achieve this generalization by adversarial learning, whereby
a discriminator attached to the output induces similar density
distribution in the target and source domains. Experiments show a
significant improvement relative to the performance of the model without
domain adaptation. Given the conventional structure of the
estimator, the improvement obtained by just monitoring the output entails
a great capacity to generalize training, thus suggesting applying
similar principles to the inner layers of the network. In our view, this
work’s surprising outcome opens new perspectives to deal with the
scalability of learning methods for large physical systems with scarce
supervisory resources.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work was partially supported by LARSyS - FCT Plurianual
funding 2020-2023 and by H2020 project AI4EU under GA 825619.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Amato</surname>
          </string-name>
          , Paolo Bolettieri, Davide Moroni, Fabio Carrara, Luca Ciampi, Gabriele Pieri, Claudio Gennaro, Giuseppe Riccardo Leone, and Claudio Vairo, '
          <article-title>A wireless smart camera network for parking monitoring'</article-title>
          ,
          <source>in 2018 IEEE Globecom Workshops (GC Wkshps)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . IEEE, (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Amato</surname>
          </string-name>
          , Luca Ciampi, Fabrizio Falchi, and Claudio Gennaro, '
          <article-title>Counting vehicles with deep learning in onboard uav imagery'</article-title>
          ,
          <source>in 2019 IEEE Symposium on Computers and Communications (ISCC)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . IEEE, (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Lokesh</given-names>
            <surname>Boominathan</surname>
          </string-name>
          , Srinivas SS Kruthiventi, and
          <string-name>
            <given-names>R Venkatesh</given-names>
            <surname>Babu</surname>
          </string-name>
          , '
          <article-title>Crowdnet: A deep convolutional network for dense crowd counting'</article-title>
          ,
          <source>in Proceedings of the 24th ACM international conference on Multimedia</source>
          , pp.
          <fpage>640</fpage>
          -
          <lpage>644</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Luca</given-names>
            <surname>Ciampi</surname>
          </string-name>
          , Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabitti, '
          <article-title>Counting vehicles with cameras</article-title>
          .',
          <string-name>
            <surname>in</surname>
            <given-names>SEBD</given-names>
          </string-name>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yaroslav</given-names>
            <surname>Ganin</surname>
          </string-name>
          and Victor Lempitsky, '
          <article-title>Unsupervised domain adaptation by backpropagation'</article-title>
          ,
          <source>arXiv preprint arXiv:1409.7495</source>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Guerrero-Go´</surname>
          </string-name>
          mez-Olmedo,
          <article-title>Beatriz Torre-Jime´nez, Roberto Lo´pez-</article-title>
          <string-name>
            <surname>Sastre</surname>
          </string-name>
          ,
          <article-title>Saturnino Maldonado-Basco´n</article-title>
          , and
          <string-name>
            <surname>Daniel</surname>
          </string-name>
          Onoro-Rubio, '
          <article-title>Extremely overlapping vehicle counting'</article-title>
          ,
          <source>in Iberian Conference on Pattern Recognition and Image Analysis</source>
          , pp.
          <fpage>423</fpage>
          -
          <lpage>431</lpage>
          . Springer, (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Victor</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , '
          <article-title>Learning to count objects in images'</article-title>
          ,
          <source>in Advances in neural information processing systems</source>
          , pp.
          <fpage>1324</fpage>
          -
          <lpage>1332</lpage>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Yuhong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaofan</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and Deming Chen, 'Csrnet:
          <article-title>Dilated convolutional neural networks for understanding the highly congested scenes'</article-title>
          ,
          <source>in Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pp.
          <fpage>1091</fpage>
          -
          <lpage>1100</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Luke Metz, and Soumith Chintala, '
          <article-title>Unsupervised representation learning with deep convolutional generative adversarial networks'</article-title>
          ,
          <source>arXiv preprint arXiv:1511.06434</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Olaf</surname>
            <given-names>Ronneberger</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and Thomas Brox, '
          <article-title>U-net: Convolutional networks for biomedical image segmentation'</article-title>
          , in International Conference on
          <article-title>Medical image computing and computer-assisted intervention</article-title>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          . Springer, (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Yi-Hsuan</surname>
            <given-names>Tsai</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei-Chih</surname>
            <given-names>Hung</given-names>
          </string-name>
          , Samuel Schulter, Kihyuk Sohn, MingHsuan Yang, and Manmohan Chandraker, '
          <article-title>Learning to adapt structured output space for semantic segmentation'</article-title>
          ,
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>7472</fpage>
          -
          <lpage>7481</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Eric</surname>
            <given-names>Tzeng</given-names>
          </string-name>
          , Judy Hoffman, Kate Saenko, and Trevor Darrell, '
          <article-title>Adversarial discriminative domain adaptation'</article-title>
          ,
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>7167</fpage>
          -
          <lpage>7176</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Shanghang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Guanhang Wu, Joao P Costeira, and Jose MF Moura, '
          <article-title>Understanding traffic density from large-scale web camera data'</article-title>
          ,
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>5898</fpage>
          -
          <lpage>5907</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Yang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Philip David, and Boqing Gong, '
          <article-title>Curriculum domain adaptation for semantic segmentation of urban scenes'</article-title>
          ,
          <source>in Proceedings of the IEEE International Conference on Computer Vision</source>
          , pp.
          <fpage>2020</fpage>
          -
          <lpage>2030</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Yingying</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Desen Zhou, Siqin Chen,
          <string-name>
            <given-names>Shenghua</given-names>
            <surname>Gao</surname>
          </string-name>
          , and Yi Ma, '
          <article-title>Single-image crowd counting via multi-column convolutional neural network'</article-title>
          ,
          <source>in Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pp.
          <fpage>589</fpage>
          -
          <lpage>597</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>