Global-Local Feature Fusion for Image Classification of Flood Affected Roads from Social Multimedia Benjamin Bischke1, 2 , Patrick Helber1, 2 , Andreas Dengel1, 2 1 TU Kaiserslautern, Germany 2 German Research Center for Artificial Intelligence (DFKI), Germany ABSTRACT This paper presents the solution of the DFKI-team for the Multi- media Satellite Task 2018 at MediaEval. We address the challenge of social multimedia classification with respect to road passability during flooding events. Information about road passability is an important aspect within the context of emergency response and is not well studied in the past. In this paper, we primarily investigate Figure 1: Local image features of objects provide a strong evi- into the visual classification based on global, local and global-local dence for the classification of road passability (left: passable, fused image features. We show that local features of objects can right: non passable). be efficiently used for road passability classification and achieve similar good results with local features as with global features. When we fused global and local visual features, we did not achieve dataset on which the network was pre-trained on. We achieved a significant outperformance against global features alone but see a significant improvement when relying on a network that was a lot of potential for future research into this direction. trained on scene-level information rather than object classes as in the ImageNet dataset. Building upon this approach, we evaluted 1 INTRODUCTION models pre-trained on different datasets containing scene-level and object-level classes for the visual classification of flood passability The Multimedia Satellite Task 2018 [3] continues to focus on flood- evidence. We achieved the best results on our internal valiation ing events as in last year’s Task 2017 [2], since, among high-impact set with features extracted from a Wide-Resnet38 pre-trained on natural disasters, flooding events represent, according to the United Places365 [8], and obtained an improvement of 3% against the Nations Office for the Coordination of Humanitarian Affairs, the features of ResNet152 pre-trained on ImageNet [6]. These findings most common type of disaster worldwide. The task looks at road are in line with the insights from last year’s solution [1]. passability, namely whether or not it is possible to travel through a flooded region. This work focuses on social multimedia and is 2.2 Flood passability image classification based on the benchmark dataset, that contains 7.387 tweets with In this paper we investigate three strategies for the road passability accompanying images and labels for evidence of road passability classification of images. We use a SVM (RBF kernel) as classifier as well as the actual road passability (passable vs. non passable). and visual features based on the following approaches: 2 APPROACH (1) Global features of CNNs pre-trained on Places365 [8], Ima- geNet [6] and the Visual Sentiment Ontology (VSO)[4] Our solution for classifying Tweets with respect to road passability (2) Local features of objects extracted with Faster R-CNN [7] follows a two-step approach. We first categorize all images that pre-trained on Pascal VOC [5] provide evidence for road passability during a flooding event and (3) Fusion of global and local features then classify the relevant images with respect to road passability. Our approach is only based on the visual modality, since we could Global Features. not obtain any meaningful results by taking the metadata of Tweets We follow the same approach as described in section 2.1 and extract (e.g. text, location) into consideration. global image features with pre-trained CNNs. We analyzed models pre-trained on ImageNet [6], Places365 [8] and VSO [4] datasets 2.1 Evidence classification of flood passability and obtained the best results set with scene-level features (VSO and The approach for the evidence classification of images relies on Places365) on the internal validation (see Table 1). last year’s solution [1] for the Multimedia Satellite Task 2017 [2]. Local Features. The goal of the challenge was to retrieve all images from a Flicker In our second stragey we investigated into local image features. dataset that provide evidence of a flooding event. We applied a Our hypothesis is that local features corresponding to objects and pre-trained CNN to obtain the feature representation of images and its surrounding context such as cars, persons, traffic signs shown used a SVM, with a radial basis function (RBF) kernel, as classifier. in Figure 1 provide a high evidence for the discrimination of road One important insight of our approach was the importance of the passability. We trained the object detection network Faster R-CNN Copyright held by the owner/author(s). [7] on the dataset Pascal VOC [5] and applied it on the images of MediaEval’18, 29-31 October 2018, Sophia Antipolis, France the provided Twitter dataset. Whenever Faster-RCNN identified an instance for one of the following classes C={bus, boat, person, car}, MediaEval’18, 29-31 October 2018, Sophia Antipolis, France B. Bischke et al. we croped based on the bounding box of the particular object a Table 1: Road passability classification based on global, local small patch out of the image. We combined the patches for bus, car and fused features on the internal validation set and boat classes into one dataset and resized all patches to the same size of 224x244 pixels. The three classes covered 45% of images Approach F1 score in the development with at least one object. Based on the created ImageNet ResNet101 (global) 73.29% dataset, we trained a CNN with two road passability classes that ImageNet ResNet152 (global) 75.98% follows the same arichtecture as LeNet with a kernel size of 7x7 VSO X-ResNet50 Adjective (global) 79.16% VSO X-ResNet50 Noun (global) 78.97% on all convolutional layers. In the case, that Faster-RCNN detected Places365 Wide-ResNet38 (global) 77.31% multiple objects in the image, we followed a late-fusion approach, local features based on Faster R-CNN 77.24% in which we calculated the mean of the predictions and mapped Places365 Wide-ResNet38 (global) + local features 78.60% values above 0.5 to the passable road class. VSO X-ResNet50 Adjective (global) + local features 78.45% We tried to classify the 3275 patches beloning to objects of the person class, but our classifier was not siginicantly better than Table 2: Results on the internal test set for the F1-score of ev- random guessing. By visual inspection we noticed that there are idence vs. no evidence for passability (row 1) and average F1- a lot of variations in the image patches of persons that made it score of evidence passable and evidence non passable (row 2) also very difficult for us to classify single patches with respect to road passability. The dataset contains, for example, images with Run 1 Run 2 Run 3 persons being fully visible (evidence of passability) and at the same evidence vs. no evidence 87.70% 87.70% 87.70% time persons walking in the water at hip height (evidence of no passable vs. non passable 65.21% 64.96% 66.48% passability). Since we were not able to achieve sufficient results for patches of persons, we suppressed this class in our current approach and leave it open for future research. fusion did neither significantly improve nor worsen the results. We believe that this can be achieved with a more sophisticated Global-Local Feature Fusion. fusion strategy and better local features. The resizing of extracted In our third strategy, we combined global and local features. We patches to 244x244 pixels for different objects could have a negative extracted the global features as described in section 2.1 and appened influence on the classification, since the aspect ratio of objects can to this vector the prediction for local features from section 2.2. In get distorted. A deeper network for classifying local patches could case that no local feature could be extracted from the image, we additionally improve the results. appended a special label to the vector. The resulting feature vector was classified with a SVM (RBF kernel) as described in 2.1. 4 CONCLUSION In this paper, we presented our approach for the Multimedia Satel- 3 EXPERIMENTS AND RESULTS lite Task 2018 at MediaEval. In line with previous research [1], we We first evaluated our three approaches on the internal validation also observed the advantages of scene-level features compared to set. Table 1 shows the results for the classification of road passabil- object related features when classifying images with respect to ity using the F1-score as metric. In the table, we can see that (1) road passability. We could confirm our hypothesis in this work and for global features the best results are achieved with scene-level showed that local features of a few object instances corresponding features from the VSO, followed by Places365 and then ImageNet. to classes in Pascal VOC can be used for the visual classification (2) The classification using local features performed with 77.24% for road passability. We achieved similar good results with local similar good as the global features (in the range between 73.29% – features as with global features and see a lot of potential to improve 79.16%) and better compared to features extracted from ImageNet our current approach. pre-trained models. However, it is also worth mentioning that this When we fused global and local visual features, we did not comparison is not completly fair, since the dataset using local fea- achieve a significant outperformance against global features alone. tures was smaller as not very image contained the local features. (3) We strongly believe that better and more general local features, For the global local feature fusion, we see an small improvement which can be extracted from more than only half of the images, when using features of Places365 Wide-ResNet38 and an decrease of play an important role in this context. We will continue this work the performance using VSO X-ResNet50 Adjective features. with additional classes that are not covered by Pascal VOC dataset. The final results on the private test set are shown in Table 2. Run One direction would be to extract semantic segmentation classes 1 are the results for the global feature VSO X-ResNet50 Adjective, run from a model pre-trained on the Cityscape dataset. This dataset 2 for the same feature but fused with local predictions and run 3 for contains additional classes such as traffic signs, poles, and that features from Places365 Wide-ResNet38 fused with local predictions. could be important for road passability classification as well. The offcial metric is the average F1-scores of (C1) images with the evidence and passable roads as well as (C2) images with evidence ACKNOWLEDGMENTS and non passable roads. In the table, we can see that the fusion of The authors would like to thank NVIDIA for support within the local and global feature information slightly decreased the results NVAIL program. Additionally, this work was supported BMBF for the VSO X-ResNet50 Adjective feature whereas on the Places365 project DeFuseNN (01IW17002). Wide-ResNet38 feature the opposite can be observed, simialar as on the internal validation set. Results show that the global local Multimedia Satellite Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Benjamin Bischke, Prakriti Bhardwaj, Aman Gautam, Patrick Helber, Damian Borth, and Andreas Dengel. 2017. Detection of flooding events in social multimedia and satellite imagery using deep neural networks. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland. 13–15. [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite Task at MediaEval 2017: Emergency Response for Flooding Events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [3] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and Damian Borth. The Multimedia Satellite Task at MediaEval 2018: Emergency Response for Flooding Events. In Proc. of the MediaEval 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [4] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 223–232. [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587. [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- agenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal net- works. In Advances in neural information processing systems. 91–99. [8] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 million image database for scene recogni- tion. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2018), 1452–1464.