=Paper=
{{Paper
|id=Vol-2659/ciampi
|storemode=property
|title=Unsupervised vehicle counting via multiple
      camera domain adaptation
|pdfUrl=https://ceur-ws.org/Vol-2659/ciampi.pdf
|volume=Vol-2659
|authors=Luca Ciampi,Carlos Santiago,Joao Paulo Costeira,Claudio Gennaro,Giuseppe Amato
|dblpUrl=https://dblp.org/rec/conf/ecai/CiampiSCGA20
}}
==Unsupervised vehicle counting via multiple
      camera domain adaptation==
<pdf width="1500px">https://ceur-ws.org/Vol-2659/ciampi.pdf</pdf>
<pre>
      Unsupervised Vehicle Counting via Multiple Camera
                    Domain Adaptation 1
                            Luca Ciampi2 and Carlos Santiago3 and Joao Paulo Costeira2 and
                                        Claudio Gennaro1 and Giuseppe Amato1


Abstract. Monitoring vehicle flows in cities is crucial to improve
                                                                                                                                    ∑      25 vehicles
the urban environment and quality of life of citizens. Images are the
best sensing modality to perceive and assess the flow of vehicles
in large areas. Current technologies for vehicle counting in images
hinge on large quantities of annotated data, preventing their scala-
bility to city-scale as new cameras are added to the system. This is
a recurrent problem when dealing with physical systems and a key
research area in Machine Learning and AI. We propose and discuss a
new methodology to design image-based vehicle density estimators
with few labeled data via multiple camera domain adaptations.
                                                                                 Figure 1. Example of an image with the bounding box annotations (left)
                                                                                 and the corresponding density map that sums to the counting value (right).
1     INTRODUCTION
Artificial Intelligence (AI) systems dedicated to the analysis and              ([2, 4, 1]) that try to identify and localize single instances of objects
interaction with the physical world can significantly impact hu-                in the image and b)density-based techniques that rely on regression
man life. These systems can process a massive amount of data and                techniques to estimate a density map from the image, and where the
make/suggest decisions that help solve many real-world problems                 final count is given by summing all pixel values [7]. Figure 1 illus-
where humans are at the epicenter.                                              trates the mapping of such regression. Concerning vehicle counting
   Crucial examples of human-centered artificial intelligence, whose            in urban spaces, where images are of very low resolution, and most
aim is to create a better world by achieving common goals beneficial            objects are partially occluded, density-based methods have a clear
to our societies, are city mobility, pollution monitoring, or critical          advantage on detection methods [15, 6, 8, 3].
infrastructure management, where decision-makers require, for in-                  Hinging on Convolution Neural Networks (CNN) to learn the re-
stance, measurements about flows of bicycles, cars or people. Like              gressor, this class of approaches has shown to be very effective, espe-
no other sensing mechanism, networks of city cameras can observe                cially in single-camera scenarios. However, since they require pixel-
such large dimensions and simultaneously provide visual data to AI              level ground truth for supervised learning, they may not generalize
systems to extract relevant information from this deluge of data.               well to unseen images, especially when there is a large domain gap
   Different smart cameras across the city are subject to various vi-           between the training (source) and the test (target) sets, such as differ-
sual conditions (luminance, position, context). This results in differ-         ent camera perspectives, weather, or illumination. This gap severely
ent performances from each of them and added difficulty in effec-               hampers the application of counting methods to very large scale sce-
tively scaling-up the learning task. In this paper, we address this issue       narios since annotating images for all the possible cases is unfeasible.
proposing a methodology that performs unsupervised domain adap-
tation among different cameras to compute the number of vehicles
in a city reliably. We focus on vehicle counting, but the approach is           1.2    Unsupervised domain adaptation
applicable to counting any other type of object.                                This paper proposes to generalize the counting process through a new
                                                                                domain adaptation algorithm for density map estimation and count-
1.1    Counting as a supervised learning task                                   ing. Specifically, we suppose to have an annotated training set for
                                                                                a source domain, and we want to adapt the system to perform well
The counting problem is the estimation of the number of objects                 in an unseen and unlabelled target domain. For instance, the source
instances in still images or video frames [7]. Current systems ad-              domain consists of images taken from a set of cameras. In contrast,
dress the counting problem as a supervised learning process. They               the target domain consists of pictures taken from different cameras,
fall in two main classes of methods: a) detection-based approaches              with different luminances, perspectives, and contexts. This class of
                                                                                algorithms is commonly referred to as Unsupervised Domain Adap-
1 Copyright    c 2020 for this paper by its authors. Use permitted under Cre-   tation.
  ative Commons License Attribution 4.0 International (CC BY 4.0).
2 Institute of Information Science and Technologies (ISTI), Italian National       We conduct preliminary experiments using the WebCamT dataset
  Research Council (CNR),Italy, Pisa. Email: luca.ciampi@isti.cnr.it            introduced in [13]. In particular, we consider a test set containing
3 Instituto Superior Técnico (LARSyS/IST), Portugal, Lisbon.
                                                                                images from cameras with different perspectives from the training
ones, showing that our unsupervised domain adaptation technique           2.3     Domain Adaptation Learning
can mitigate the perspective domain gap.
   Traditional approaches of Unsupervised Domain Adaptation have          The proposed framework is trained based on an alternate optimiza-
been developed to address the problem of image classification, and        tion of density estimation network, Ψ, and the discriminator network,
they try to align features across the two domains ([5, 12]). However,     Θ. Regarding the former, the training process relies on two compo-
as pointed out in [14], they do not perform well in other tasks, such     nents: 1) density estimation using pairs of images and ground truth
as semantic segmentation.                                                 density maps, which we assume are only available in the source do-
                                                                          main; and 2) adversarial training, which aims to make the discrimi-
                                                                          nator fail to distinguish between the source and target domains. As
2     Proposed Method                                                     for the latter, images from both domains are used to train the dis-
                                                                          criminator on correctly classifying each pixel of the probability map
We propose an end-to-end CNN-based unsupervised domain adap-
                                                                          as either source or target.
tation algorithm for traffic density estimation and counting. Inspired
                                                                             To implement the above training procedure, we introduce two loss
by [11], we base our method on adversarial learning in the output
                                                                          functions: one is employed in the first step of the algorithm to train
space (density maps), which contains rich information such as scene
                                                                          network Ψ. The other is used in the second step to train the discrim-
layout and context. In our approach, we rely on the adversarial learn-
                                                                          inator Θ. These loss functions are detailed next.
ing scheme to make the predicted density distributions of the source
                                                                             Network Ψ Training. We formulate the loss function for Ψ as the
and target domains consistent.
                                                                          sum of two main components:
   The proposed framework, shown in Fig. 2, consists of two mod-
ules: 1) a CNN that predicts traffic density maps and estimates the
                                                                                      L(I S , I T ) = Ldensity (I S ) + λadv Ladv (I T ),     (1)
number of vehicles occurring in the scene, and 2) a discriminator that
distinguishes whether the density map (received by the density map        where Ldensity is a composite loss computed using ground truth an-
estimator) is generated processing an image of the source domain or       notations available in the source domain, while Ladv is the adversar-
the target domain. In the training phase, the density map predictor       ial loss that is responsible for making the distribution of the target
learns to map images to densities, based on annotated data from the       and the source domain close each other. In particular, we define the
source domain. At the same time, it learns to fool the discrimina-        density loss Ldensity as:
tor exploiting an adversarial loss, computed using the predicted den-
sity map of unlabeled images from the target domain. Consequently,
the output space is forced to have similar distributions for both the           Ldensity (I S ) = Ldensity map (I S ) + Lregression (I S ),   (2)
source and target domains. In the inference phase, the discriminator
is discarded, and only the density map predictor is used for the target   where Ldensity map is the mean square error between the pre-
images. A description of each module and their training is provided       dicted and ground truth density maps, i.e. Ldensity map =
in the following subsections.                                             M SE(DS , DS GT ), while Lregression is Euclidean loss between
                                                                          predicted and ground truth count.
                                                                             To compute the adversarial loss Ladv (I T ), we first forward the
2.1    Density Estimation Network                                         images belonging to the target domain and we generate the predicted
We formulate the counting task as a density map estimation problem        density maps DT . Then, we compute
[7]. The density (weight) of each pixel in the map depends on its                                              X
proximity to a vehicle centroid and the size of the vehicle in the im-                       Ladv (I T ) = −         log(Θ(DT )).             (3)
age, as shown in Fig. 1, so that each vehicle contributes with a total                                         h,w

value of 1 to the map. Therefore, it provides statistical information     This loss forces the distribution of DT to be closer to DS by training
about the vehicles’ location and allows the counting to be estimated      Ψ to fool the discriminator, maximizing the probability of the target
by summing of all density values.                                         predicted density map to be considered as the source prediction.
   This task is performed by a CNN-based model, whose goal is                Discriminator Θ Training. Given the estimated density map D =
to automatically determine the vehicle density map associated with        Ψ(I) ∈ RW×H , we forward D to a fully-convolutional discrimina-
a given input image. Formally, the density map estimator, Ψ :             tor Θ using a binary cross-entropy loss Ldisc for the two classes (i.e.,
RC×W×H 7→ RW×H , transforms a C channels W × H input image,               source and target domains). We formulate the loss as:
I, into a density map, D = Ψ(I) ∈ RW×H .
                                                                                           X
                                                                          Ldisc (D) = −           [(1−y) log(Θ(D)(h,w,0) )+y log(Θ(D)(h,w,1) )],
2.2    Discriminator Network
                                                                                            h,w
The discriminator network, denoted by Θ, also consists of a CNN                                                                             (4)
model. It takes as input the density map, D, estimated by the net-        where y = 0 if the sample is taken from the target domain, and y = 1
work Ψ. Its output is a lower resolution probability map. Each pixel      if the sample is taken from the source domain.
represents the probability that the corresponding area (from the input
density map) comes from the source or the target domain. The goal
                                                                          2.4     Implementation Details
of the discriminator is to learn to distinguish between density maps
belonging to source or target domains. This, in turn, forces the den-     Density Map Estimation and Counting Network. We build our
sity estimator to provide density maps with similar distributions in      density map estimation network based on U-Net [10]. U-Net is a
both domains, i.e., the density maps, D, of the target domain have to     popular end-to-end encoder-decoder network for semantic segmen-
look realistic, even if network Ψ was not trained with an annotated       tation first used for biomedical image segmentation. The encoder part
training set from that domain.                                            consists of convolution blocks, followed by max-pooling blocks that
                                                                                Density and Counting Loss

                                                                                          Source Prediction


                                                                                                              Discriminator
                                      Source Domain


                                                                      Density Map
                                                                     Estimation and
                                                                    Counting Network


                                                                                           Target Prediction

                                       Target Domain                                        Adversarial Loss

  Figure 2. Algorithm overview. Given images having size C × H × W from source and target domains, we pass them through the density map estimation
 and counting network to obtain output predictions. For source predictions, a density and counting loss is computed based on the source ground truth. To make
target predictions closer to the source ones, we employ a discriminator that aims to distinguish whether the input (i.e., the density map) belongs to the source or
    target domain. Then an adversarial loss is computed on the target prediction and is back-propagated to the density map estimation and counting network.


downscale the feature representations at multiple levels. The decoder               put space. In particular, during the training, we also use the images
part of the network upsamples the features through upsampling lay-                  belonging to the validation subset without the labels to generate an
ers followed by regular convolution operations. Furthermore, the up-                adversarial loss aimed at making the source domain (i.e., the training
sampled features are concatenated with the same scale features from                 subset) and the target domain (i.e., the validation subset) close each
the encoder, containing more detailed spatial information and pre-                  other.
venting the network from losing spatial awareness due to downsam-                     We base the evaluation of the models on three metrics: (i) Mean
pling.                                                                              Absolute Error (MAE) that measures the absolute count error of each
Discriminator. We use a Fully Convolutional Network similar to                      image; (ii) Mean Squared Error (MSE) that penalizes large errors
[11, 9], composed of 5 convolution layers with kernel 4×4 and stride                more heavily than small ones; (iii) Average Relative Error (ARE),
of 2. The number of channels are {64, 128, 256, 512, 1}, respectively.              which measures the absolute count error divided by the true count.
Each convolution layer is followed by a leaky ReLU having a param-
eter equals to 0.2.
                                                                                    4    RESULTS AND DISCUSSION
3    EXPERIMENTAL SETUP                                                             Figure 3 (a) shows the results for the two validation sets - the random
We conduct preliminary experiments using the WebCamT dataset in-                    one and the per-camera one, using the density estimation network
troduced in [13]. This dataset is a collection of traffic scenes recorded           without the discriminator trained over the two training subsets - the
using city-cameras, and it is particularly challenging for analysis due             random one and the per-camera one, respectively. Each plot corre-
to the low-resolution (352 × 240), high occlusion, and large per-                   sponds to one of the three metrics. As we can see, the domain gap is
spective. We consider a total of about 42,000 images belonging to                   significant: even if all the subsets’ images belong to the same dataset
10 different cameras and consequently having different perspectives.                and are collected in the same city under similar conditions, small
We employ the existing bounding box annotations of the dataset to                   changes to the perspectives cause a remarkable loss in performance.
generate ground truth density maps, one for each image. In particu-                 In other words, the network cannot generalize well to views that have
lar, we consider one Gaussian Normal kernel for each vehicle present                not been seen during the training.
in the scene, having a value of µ and σ equals to the center and pro-                  When combining the density estimation network with the adver-
portional to the length of the bounding box surrounding the vehicle,                sarial component, the performance of the system improves consid-
respectively.                                                                       erably. These results are shown in Figure 3 (b), where the improve-
    Firstly, we show the domain gap that we want to face. We gen-                   ments obtained using our model (red line) compared to the baseline
erate a first pair of training and validation subsets, picking images               model, without discriminator, is visible in all the three metrics. The
randomly from the whole dataset. Then, we create a second pair of                   discriminator mitigates the domain gap, and the network can gener-
training and validation subsets, this time picking images belonging                 alize better over images having different perspectives from the ones
to seven different cameras for the first and pictures belonging to the              employed during the training. The results are related to a specific
three remaining ones for the second (per-camera splits of the whole                 value of λ that showed the most promising results.
dataset). We show the domain gap training our model without the dis-                   Since all the metrics that we considered take into account only the
criminator on the training subsets and comparing the results obtained               counting errors, we also plot some examples of the predicted den-
over the validation splits.                                                         sity maps using our model either with and without the discriminator.
    Once we quantified and proved this domain gap, we try to mitigate               Figure 4 shows the ground truth and the predicted density maps for
it, conducting experiments on the per-camera splits using our solu-                 two random samples of the validation subset. As we can see, the
tion, i.e., the network Ψ and the discriminator Θ that acts on the out-             density maps predicted using the model with the discriminator show
                                                                                 domain adaptation. Given the conventional structure of the estima-
                                                                                 tor, the improvement obtained by just monitoring the output entails
                                                                                 a great capacity to generalize training, thus suggesting applying sim-
                                                                                 ilar principles to the inner layers of the network. In our view, this
                                                                                 work’s surprising outcome opens new perspectives to deal with the
                                                                                 scalability of learning methods for large physical systems with scarce
                                                                                 supervisory resources.

                                                                                 ACKNOWLEDGEMENTS
                                                                                 This work was partially supported by LARSyS - FCT Plurianual
                                                                                 funding 2020-2023 and by H2020 project AI4EU under GA 825619.

                                                                                 REFERENCES
                                                                                  [1] Giuseppe Amato, Paolo Bolettieri, Davide Moroni, Fabio Carrara, Luca
                                                                                      Ciampi, Gabriele Pieri, Claudio Gennaro, Giuseppe Riccardo Leone,
                                                                                      and Claudio Vairo, ‘A wireless smart camera network for parking mon-
                                                                                      itoring’, in 2018 IEEE Globecom Workshops (GC Wkshps), pp. 1–6.
                                                                                      IEEE, (2018).
                                                                                  [2] Giuseppe Amato, Luca Ciampi, Fabrizio Falchi, and Claudio Gennaro,
                                                                                      ‘Counting vehicles with deep learning in onboard uav imagery’, in 2019
                                                                                      IEEE Symposium on Computers and Communications (ISCC), pp. 1–6.
                                                                                      IEEE, (2019).
                                                                                  [3] Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu,
                  (a)                                  (b)                            ‘Crowdnet: A deep convolutional network for dense crowd counting’,
                                                                                      in Proceedings of the 24th ACM international conference on Multime-
                                                                                      dia, pp. 640–644, (2016).
   Figure 3. Performance during training: (a) Comparison between the              [4] Luca Ciampi, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and
 random and the per-camera validation splits showing the domain gap; (b)              Fausto Rabitti, ‘Counting vehicles with cameras.’, in SEBD, (2018).
comparison between the proposed approach with and without discriminator.          [5] Yaroslav Ganin and Victor Lempitsky, ‘Unsupervised domain adapta-
           Each row corresponds to a specific evaluation metric.                      tion by backpropagation’, arXiv preprint arXiv:1409.7495, (2014).
                                                                                  [6] Ricardo Guerrero-Gómez-Olmedo, Beatriz Torre-Jiménez, Roberto
                                                                                      López-Sastre, Saturnino Maldonado-Bascón, and Daniel Onoro-Rubio,
                                                                                      ‘Extremely overlapping vehicle counting’, in Iberian Conference on
                                                                                      Pattern Recognition and Image Analysis, pp. 423–431. Springer,
                                                                                      (2015).
                                                                                  [7] Victor Lempitsky and Andrew Zisserman, ‘Learning to count objects
                                                                                      in images’, in Advances in neural information processing systems, pp.
                                                                                      1324–1332, (2010).
                                                                                  [8] Yuhong Li, Xiaofan Zhang, and Deming Chen, ‘Csrnet: Dilated con-
                                                                                      volutional neural networks for understanding the highly congested
                                                                                      scenes’, in Proceedings of the IEEE conference on computer vision and
                                                                                      pattern recognition, pp. 1091–1100, (2018).
                                                                                  [9] Alec Radford, Luke Metz, and Soumith Chintala, ‘Unsupervised repre-
                                                                                      sentation learning with deep convolutional generative adversarial net-
                                                                                      works’, arXiv preprint arXiv:1511.06434, (2015).
                                                                                 [10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, ‘U-net: Convo-
                                                                                      lutional networks for biomedical image segmentation’, in International
 Figure 4. Examples of predicted density maps belonging to two samples                Conference on Medical image computing and computer-assisted inter-
  of the validation subset (each row corresponds to a sample). From left to           vention, pp. 234–241. Springer, (2015).
right, the original image, the ground truth density map, the predicted density   [11] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-
 map obtained using the model without the discriminator, and the predicted            Hsuan Yang, and Manmohan Chandraker, ‘Learning to adapt structured
             density map using our domain adaptation algorithm.                       output space for semantic segmentation’, in Proceedings of the IEEE
                                                                                      Conference on Computer Vision and Pattern Recognition, pp. 7472–
                                                                                      7481, (2018).
a decrease of the noise compared with the ones obtained using the                [12] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell, ‘Adver-
                                                                                      sarial discriminative domain adaptation’, in Proceedings of the IEEE
baseline model without the discriminator.                                             Conference on Computer Vision and Pattern Recognition, pp. 7167–
                                                                                      7176, (2017).
                                                                                 [13] Shanghang Zhang, Guanhang Wu, Joao P Costeira, and Jose MF
5   CONCLUSIONS                                                                       Moura, ‘Understanding traffic density from large-scale web camera
In this article, we tackle the problem of determining the density and                 data’, in Proceedings of the IEEE Conference on Computer Vision and
                                                                                      Pattern Recognition, pp. 5898–5907, (2017).
the number of objects present in large sets of images. Building on a             [14] Yang Zhang, Philip David, and Boqing Gong, ‘Curriculum domain
CNN-based density estimator, the proposed methodology can gener-                      adaptation for semantic segmentation of urban scenes’, in Proceedings
alize to new sources of data for which there is no training data avail-               of the IEEE International Conference on Computer Vision, pp. 2020–
able. We achieve this generalization by adversarial learning, whereby                 2030, (2017).
                                                                                 [15] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma,
a discriminator attached to the output induces similar density distri-                ‘Single-image crowd counting via multi-column convolutional neural
bution in the target and source domains. Experiments show a signif-                   network’, in Proceedings of the IEEE conference on computer vision
icant improvement relative to the performance of the model without                    and pattern recognition, pp. 589–597, (2016).

</pre>