Traffic Density Estimation via Unsupervised Domain Adaptation

Traffic Density Estimation via Unsupervised Domain Adaptation LucaCiampi luca.ciampi@isti.cnr.it Institute of Information Science and Technologies National Research Council -Pisa

Italy

CarlosSantiago Instituto Superior Técnico (LARSyS/IST)

Lisbon Portugal

JoaoPaulo Costeira Instituto Superior Técnico (LARSyS/IST)

Lisbon Portugal

ClaudioGennaro Institute of Information Science and Technologies National Research Council -Pisa

Italy

GiuseppeAmato Institute of Information Science and Technologies National Research Council -Pisa

Italy

Traffic Density Estimation via Unsupervised Domain Adaptation 1613-0073 AD78D85FF54F7BF032B44DB51C418A4A GROBID - A machine learning software for extracting information from scholarly documents Unsupervised Domain Adaptation, Synthetic Datasets, Deep Learning, Counting Vehicles, (L. Ciampi) 0000-0002-6985-0439 (L. Ciampi) 0000-0002-4737-0020 (C. Santiago) 0000-0001-6769-2935 (J. P. Costeira) 0000-0002-3715-149X (C. Gennaro) 0000-0003-0171-4315 (G. Amato)

Monitoring traffic flows in cities is crucial to improve urban mobility, and images are the best sensing modality to perceive and assess the flow of vehicles in large areas. However, current machine learningbased technologies using images hinge on large quantities of annotated data, preventing their scalability to city-scale as new cameras are added to the system. We propose a new methodology to design image-based vehicle density estimators with few labeled data via an unsupervised domain adaptation technique.

Introduction

Traffic problems are constantly increasing, and tomorrow's cities can only be smart if they enable smart mobility. This concept is becoming more critical since traffic congestion caused by the increasing number of people using different road infrastructures to travel anywhere is imposing extra costs that make all activities more expensive and put a damper on the development.

Smart mobility applications such as smart parking and road traffic management are nowadays widely employed worldwide, making our cities more livable and bringing benefits to the cities, a better quality of our life, reducing costs, and improving energy usage.

Images are probably the best sensing modality to perceive and assess the flow of vehicles in large areas. Like no other sensing mechanism, networks of city cameras can observe such large dimensions and simultaneously provide visual data to AI systems to extract relevant information from this deluge of data.

In this work, we propose a CNN-based system that can estimate traffic density and count the vehicles present in urban scenes directly on-board smart city cameras, analyzing the images captured by themselves. Current systems address the counting problem as a supervised learning process. They fall into two main classes of methods: a) detection-based approaches [1][2][3] that try to identify and localize single instances of objects in the image and b) density-based techniques that rely on regression techniques to estimate a density map from the image, and where the final count is given by summing all pixel values [4]. Figure 1 illustrates the mapping of such regression. Concerning vehicle counting in urban spaces, where images are of low resolution, and most objects are partially occluded, density-based methods have a clear advantage on detection methods [5][6].

However, since this class of approaches requires pixel-level ground truth for supervised learning, they may not generalize well to unseen images, especially when there is a large domain gap between the training (source) and the test (target) sets, such as different camera perspectives, weather, or illumination. The direct transfer of the learned features between different domains does not work very well because the distributions are different. Thus, a model trained on the source domain usually experiences a drastic drop in performance when applied to the target domain. This problem is commonly referred to as Domain Shift [7], and it severely hampers the application of counting methods to very large-scale scenarios since annotating images for all the possible cases is unfeasible.

To mitigate this problem, we introduce a methodology that performs Unsupervised Domain Adaptation (UDA) among different scenarios. UDA techniques address the domain shift taking a source labeled dataset and a target unlabeled one. The challenge here is to automatically infer some knowledge from the target data to reduce the gap between the two domains. Specifically, in this work, we propose an end-to-end CNN-based UDA algorithm for traffic density estimation and counting, based on adversarial learning performed directly on the generated density maps, i.e., in the output space, given that in this specific case, the output space contains valuable information such as scene layout and context. We focus on vehicle counting, but the approach is suitable for counting any other types of objects.

Another contribution of this work is represented by the creation of two new per-pixel annotated datasets made available to the scientific community. One of the two novel datasets is a collection of synthetic images taken from a photo-realistic video game where the labels are automatically assigned while interacting with the API of the graphical engine. We conducted our experiments considering these two datasets and another collection of images already present in the literature, validating our approach over different types of domain shifts: i) the Camera2Camera domain shift, where the source images belong to some specific cameras, and the target ones are instead taken from different perspectives and context; ii) the Day2Night domain shift, where the source domain is represented by images taken during the day and the target domain by pictures taken at night; iii) the Synthetic2Real domain shift, where source images are collected using a video game and automatically annotated, while the target ones are real urban pictures. Experiments show a significant improvement compared to the performance of the model without domain adaptation.

The Datasets

This section describes the datasets exploited in this work, focusing mainly on the two novel datasets created on purpose in this work.

NDISPark Dataset

The NDISPark -Night and Day Instance Segmented Park dataset is a small, manually annotated dataset for counting cars in parking lots, consisting of about 250 images. This dataset is challenging and describes the most difficult situations that can be found in a real scenario: seven different cameras capture the images under various weather conditions and angles of view. Furthermore, it is worth noting that pictures are taken during the day and the night, showing utterly different light conditions. The images are precisely annotated with instance segmentation labels, and this allowed us to generate accurate ground truth density maps usable for the counting task.

GTA Dataset

The GTA -Grand Traffic Auto dataset is a vast collection of about 15,000 synthetic images of urban traffic scenes collected from the highly photo-realistic video game GTA V -Grand Theft Auto V. We deploy a framework that can automatically and precisely annotate the vehicles present in the scene with per-pixel annotations. To the best of our knowledge, it is the first instance segmentation synthetic dataset of city traffic scenarios. Figure 2 shows some examples of images belonging to this dataset together with the annotations.

WebCamT Dataset

The WebCamT dataset is a collection of traffic scenes recorded using city-cameras introduced by [6]. It is particularly challenging for analysis due to the low-resolution (352 × 240), high occlusion, and large perspective. We considered images belonging to different cameras and consequently having different views.

Proposed Method

Our method relies on a CNN model trained end-to-end with adversarial learning in the output space (i.e., the density maps), which contains rich information such as scene layout and context. The peculiarity of our adversarial learning scheme is that it forces the predicted density maps in the target domain to have local similarities with the ones in the source domain.

Figure 3 depicts the proposed framework consisting of two modules: 1) a CNN that predicts traffic density maps, from which we estimate the number of vehicles in the scene, and 2) a discriminator that identifies whether a density map (received by the density map estimator) was generated from an image of the source domain or the target domain.

In the training phase, the density map predictor learns to map images to densities based on annotated data from the source domain. At the same time, it learns to predict realistic density maps for the target domain by trying to fool the discriminator with an adversarial loss. The discriminator's output is a pixel-wise classification of a low-resolution map, as illustrated in Figure 3, where each pixel corresponds to a small region in the density map. Consequently, the output space is forced to be locally similar for both the source and target domains. In the inference phase, the discriminator is discarded, and only the density map predictor is used for the target images. We describe each module and how it is trained in the following subsections.

Density Estimation Network

We formulate the counting task as a density map estimation problem [4]. The density (intensity) of each pixel in the map depends on its proximity to a vehicle centroid and the size of the vehicle in the image so that each vehicle contributes with a total value of 1 to the map. Therefore, it provides statistical information about the vehicles' location and allows the counting to be estimated by summing of all density values. This task is performed by a CNN-based model [5], whose goal is to automatically determine the vehicle density map associated with a given input image. Formally, the density map estimator, Ψ : ℛ 𝒞×ℋ×𝒲 ↦ → ℛ ℋ×𝒲 , transforms a 𝒲 × ℋ input image ℐ with 𝒞 channels, into a density map, 𝐷 = Ψ(ℐ) ∈ ℛ ℋ×𝒲 .

Discriminator Network

The discriminator network, denoted by Θ, also consists of a CNN model. It takes as input the density map, 𝐷, estimated by the network Ψ. Its output is a lower resolution probability map where each pixel represents the probability that the corresponding region (from the input density map) comes either from the source or the target domain. The goal of the discriminator is to learn to distinguish between density maps belonging to source or target domains. Through an adversarial loss, this discriminator will, in turn, force the density estimator to provide density maps with similar distributions in both domains. In other words, the target domain density maps have to look realistic, even though the network Ψ was not trained with an annotated training set from that domain.

Domain Adaptation Learning

The proposed framework is trained based on an alternate optimization of the density estimation network, Ψ, and the discriminator network, Θ. Regarding the former, the training process relies on two components: 1) density estimation using pairs of images and ground truth density maps, which we assume are only available in the source domain; and 2) adversarial training, which aims to make the discriminator fail to distinguish between the source and target domains. As for the latter, images from both domains are used to train the discriminator on correctly classifying each pixel of the probability map as either source or target. To implement the above training procedure, we use two loss functions: one is employed in the first step of the algorithm to train network Ψ, and the other is used in the second step to train the discriminator Θ. These loss functions are detailed next.

Network Ψ Training. We formulate the loss function for Ψ as the sum of two main components:

ℒ(ℐ 𝒮 , ℐ 𝒯 ) = ℒ 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 (ℐ 𝒮 ) + 𝜆 𝑎𝑑𝑣 ℒ 𝑎𝑑𝑣 (ℐ 𝒯 ),(1)

where ℒ 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 is the loss computed using ground truth annotations available in the source domain, while ℒ 𝑎𝑑𝑣 is the adversarial loss that is responsible for making the distribution of the target and the source domain closer to each other. In particular, we define the density loss ℒ 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 as the mean square error between the predicted and ground truth density maps, i.e. ℒ 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 = 𝑀 𝑆𝐸(𝐷 𝒮 , 𝐷 𝒮_𝒢𝒯 ).

To compute the adversarial loss ℒ 𝑎𝑑𝑣 , we first forward the images belonging to the target domain through network Ψ, to generate the predicted density maps 𝐷 𝒯 . Then, we forward 𝐷 𝒯 through network Θ, to generate the probability map 𝑃 = Θ(Ψ(ℐ 𝒯 )) ∈ [0, 1] 𝐻 ′ ×𝑊 ′ , where 𝐻 ′ < 𝐻 and 𝑊 ′ < 𝑊 . The adversarial loss is given by

ℒ 𝑎𝑑𝑣 (ℐ 𝒯 ) = − ∑︁ ℎ,𝑤 log(𝑃 ℎ,𝑤 ),(2)

where the subscript ℎ, 𝑤 denotes a pixel in 𝑃 . This loss makes the distribution of 𝐷 𝒯 closer to 𝐷 𝒮 by forcing Ψ to fool the discriminator, through the maximization of the probability of 𝐷 𝒯 being locally classified as belonging to the source domain. Network Θ Training. Given an image ℐ and the corresponding predicted density map 𝐷, we feed 𝐷 as input to the fully-convolutional discriminator Θ to obtain the probability map 𝑃 . The discriminator is trained by comparing 𝑃 with the ground truth label map 𝑌 ∈ {0, 1} 𝐻 ′ ×𝑊 ′ using a pixel-wise binary cross-entropy loss

ℒ 𝑑𝑖𝑠𝑐 (ℐ) = − ∑︁ ℎ,𝑤 (1 − 𝑌 ℎ,𝑤 ) log(1 − 𝑃 ℎ,𝑤 ) + 𝑌 ℎ,𝑤 𝑙𝑜𝑔(𝑃 ℎ,𝑤 ),(3)

where 𝑌 ℎ,𝑤 = 0 ∀ ℎ, 𝑤 if ℐ is taken from the target domain and 𝑌 ℎ,𝑤 = 1 otherwise.

Experimental Results

We validate the proposed UDA method for density estimation and counting of traffic scenes under different settings. First, we employ the NDISPark dataset, and we test the Day2Night domain shift considering pictures taken during the day as the source domain, while night images for the target domain. Then, we utilize the WebCamT dataset to take into account the Camera2Camera performance gap, tackling the domain shift that takes place when we consider a camera different from the ones used during the training phase. Finally, we use the GTA dataset to assess the Synthetic2Real domain difference, training the algorithm using the synthetic images, and then test it on real data considering the WebCamT dataset again. For all the experiments, we base the evaluation of the models on three metrics widely used for the counting task: (i) Mean Absolute Error (MAE) that measures the absolute count error of each image; (ii) Mean Squared Error (MSE) that instead quantifies the squared count error for each image; (iii) Average Relative Error (ARE), which measures the absolute count error divided by the true count. Note that, as a result of the squaring of each error, the MSE effectively penalizes large errors more heavily than small ones. Instead, the ARE is the only metric that 1. We achieved better results compared to the basic model in all the considered scenarios and considering all the three metrics.

Conclusions

In this article, we tackled the problem of determining the density and the number of objects present in large sets of images. Building on a CNN-based density estimator, the proposed methodology can generalize to new data sources for which there are no annotations available. We achieved this generalization by exploiting an Unsupervised Domain Adaptation strategy, whereby a discriminator attached to the output forces similar density distribution in the target and source domains. Experiments show a significant improvement relative to the performance of the model without domain adaptation. To the best of our knowledge, we are the first to introduce a UDA scheme for counting to reduce the gap between the source and the target domain without using additional labels. Given the conventional structure of the estimator, the improvement obtained by just monitoring the output entails a great capacity to generalize learned knowledge, thus suggesting the application of similar principles to the inner layers of the network. Another contribution is represented by the creation of two new per-pixel annotated datasets made available to the scientific community. One of the two novel datasets is a synthetic dataset created from a photo-realistic video game. Here the labels are automatically assigned while interacting with the API of the graphical engine. Using this synthetic dataset, we demonstrated that it is possible to train a model with a precisely annotated and automatically generated synthetic dataset and perform UDA toward a real-world scenario, obtaining very good performance without using additional manual annotations.

In our view, this work's outcome opens new perspectives to deal with the scalability of learning methods for large physical systems with scarce supervisory resources.

Figure 1 :1Figure 1: Example of an image with the bounding box annotations (left) and the corresponding density map that sums to the counting value (right).

Figure 2 :2Figure 2: Some examples of images of our Grand Traffic Auto dataset, together with the automatically generated instance segmentation annotations.

Figure 3 :3Figure 3: Algorithm overview.Given 𝐶 × 𝐻 × 𝑊 images from source and target domains, we pass them through the density map estimation network to obtain output predictions. A density loss is computed for source predictions based on the ground truth. In order to improve target predictions, a discriminator is used to locally classify whether a density map belongs to the source or target domain. Then, an adversarial loss is computed on the target prediction and is back-propagated to the density map estimation and counting network.

Table 11Experimental results obtained for the four considered domain shift in terms of MAE, MSE and ARE. We achieved performance improvements for all the scenarios, considering all the three metrics.MAEMSEAREDay2Night Domain Shift -NDISPark DatasetBaseline -CSRNet [5]3.9527.450.43Our Approach3.4920.900.39Camera2Camera Domain Shift -WebCamT Dataset [6]Baseline -CSRNet [5]3.2416.830.21Our Approach2.8613.030.19Synthetic2Real Domain Shift -GTA DatasetBaseline -CSRNet [5]4.1025.830.28Our Approach3.8823.800.27considers the relation of the error and the total number of vehicles present for each image.Results are summarized in Table

Acknowledgments

This work was partially supported by H2020 project AI4EU under GA 825619.

Counting vehicles with deep learning in onboard UAV imagery GAmato LCiampi FFalchi CGennaro 10.1109/ISCC47284.2019.8969620 2019 IEEE Symposium on Computers and Communications, ISCC 2019

Barcelona, Spain

IEEE June 29 -July 3, 2019. 2019 Counting vehicles with cameras LCiampi GAmato FFalchi CGennaro FRabitti Proceedings of the 26th Italian Symposium on Advanced Database Systems CEUR Workshop Proceedings SBergamaschi TDNoia AMaurino the 26th Italian Symposium on Advanced Database Systems

Castellaneta Marina (Taranto), Italy

June 24-27, 2018. 2161. 2018 A wireless smart camera network for parking monitoring GAmato PBolettieri DMoroni FCarrara LCiampi GPieri CGennaro GRLeone CVairo 10.1109/GLOCOMW.2018.8644226 IEEE Globecom Workshops, GC Wkshps 2018

Abu Dhabi, United Arab Emirates

IEEE December 9-13, 2018. 2018 Learning to count objects in images VSLempitsky AZisserman Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010 JDLafferty CK IWilliams JShawe-Taylor RSZemel ACulotta

Vancouver, British Columbia, Canada

Curran Associates, Inc 9 December 2010. 2010 Proceedings of a meeting held 6- Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes YLi XZhang DChen 10.1109/CVPR.2018.00120 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018

Salt Lake City, UT, USA

IEEE Computer Society 2018. June 18-22, 2018. 2018 Understanding traffic density from large-scale web camera data SZhang GWu JPCosteira JM FMoura 10.1109/CVPR.2017.454 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

Honolulu, HI, USA

IEEE Computer Society July 21-26, 2017. 2017 Unbiased look at dataset bias ATorralba AAEfros 10.1109/CVPR.2011.5995347 The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011

Colorado Springs, CO, USA

IEEE Computer Society 20-25 June 2011. 2011