=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_15
|storemode=property
|title=CNN and GAN Based Satellite and Social Media Data Fusion for Disaster Detection
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_15.pdf
|volume=Vol-1984
|authors=Kashif Ahmad,Konstantin Pogorelov,Michael Riegler,Nicola Conci,Pål Halvorsen
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AhmadPRCH17
}}
==CNN and GAN Based Satellite and Social Media Data Fusion for Disaster Detection==
CNN and GAN Based Satellite and Social Media Data Fusion for Disaster Detection Kashif Ahmad1 , Pogorelov Konstantin2 , Michael Riegler2 ,Nicola Conci1 ,Pal Holversen2 1 DISI-University of Trento, Italy 2 Simula Research Labs Oslo, Norway kashif.ahmad@unitn.it,konstantin@simula.no,michael@simula.no,nicola.conci@unitn.it,paalh@ifi.uio.no ABSTRACT This paper presents the method proposed by team UTAOS for the Mediaeval 2017 challenge on Multi-media and Satellite. In the first task, we mainly rely on different Convolutional Neural Network (CNN) models combined with two different late fusion methods. We also utilize the additional information available in the form of meta-data. The average and mean over precision at different cut-offs for our best runs are 84.94% and 95.11%, respectively. For challenge two, we utilize a Generative Adversarial Network (GAN). The mean Intersection-over-Union (IoU) for our best run is 0.8315. 1 INTRODUCTION Linking social media information to remote sensed data holds large possibilities for society and research [1–3]. The Multimedia and Figure 1: Block diagram of the proposed methodology for Satellite task in Mediaeval 2017 [4] aims to integrate information DIRSM task. terms of posterior classification probabilities. We also consider from both sources, sensed data and social media, to provide a better user’s tags, date taken along with GPS information from the avail- overview of a disaster. This paper provides a detailed description able meta-data. For the meta-data we rely on Random Tree classifier of the methods developed by the UTOS team for the Mediaeval provided by the WEKA toolbox [6]. Finally, the classification scores 2017 Multimedia Satellite Task. The challenge consists of two sub obtained through Random Trees and SVM trained on meta-data tasks, (i) Disaster Image Retrieval from Social Media (DIRSM) and and visual features are fused using late fusion. For the late fusion (ii) Flood Detection in Satellite Images (FDSI). we propose two different methods, namely, (i) Induced Ordered fusion scheme inspired by Induced Ordered Weighting Averaging 2 PROPOSED APPROACH Operators (IOWA) by Yager et al. [13] and (ii) Particle Swarm Opti- 2.1 Methodology for DIRSM Task mization (PSO). Figure 1 provides a block diagram of the proposed To tackle challenge (i), we rely on Convolutional Neural Network methodology for the Disaster Images Retrieval from Social Media (CNN) features. In detail, we first extract CNN features for seven (DIRSM) task. different models from state-of-the-art architectures pre-trained on the ImageNet [5] and places datasets [14]. These models include 2.2 Methodology for FDSI Task AlexNet [8] (pre-trained on both ImageNet and places datasets), For the challenge (ii), we started from the visual analysis of the GoogleNet [12] (pre-trained on ImageNet ), VGGNet 19 [10] (pre- provided development set. We observed that it is not possible to trained on both ImagNet and places datasets) and different con- use any already existing open-source framework due to the nature figurations of ResNet [7] with 50, 101 and 152 layers. For feature of the provided satellite data. Furthermore, we observed that the extraction from Alexnet and VGGNet19 we use the Caffe toolbox1 used four-channel 16-bit TIFF file format is too specific and cannot while in the case of GoogleNet and Resnet we exploited Vlfeat be correctly processed and even viewed by existing libraries. Matcovnet2 . To perform the visual analysis we developed a conversion code All in all, we extract eight feature vectors through four differ- which provide a conversion from geo-TIFF to a pair of images: ent network architectures from the same image. AlexNet and VG- RGB and infrared (IR). For the RGB images we used the per-three- GNet16 provide a feature vector of size 4096 while GoogleNet and channels normalization which fits all the R, G and B pixel values of Resnet provide feature vectors of 1024 and 2048, respectively. Sub- the input geo-image into standard 0-255 RGB region. Normalization sequently, the extracted features are fed into ensembles of Support coefficients are the same for all three channels to achieve real color Vector Machines (SVMs), which provide classification scores in balance even in cases of low variations in one of the components. 1 http://caffe.berkeleyvision.org/ The normalization of the IR component is performed separately. 2 http://www.vlfeat.org/matconvnet/ rдbmin = min(min r i , min дi , min bi ) i ∈R i ∈G i ∈B Copyright held by the owner/author(s). rдbmax = max(max r i , max дi , max bi ) i ∈R i ∈G i ∈B MediaEval’17, 13-15 September 2017, Dublin, Ireland MediaEval’17, 13-15 September 2017, Dublin, Ireland K. Ahmad et al. irmin = min ir k , irmax = max ir k Table 1: Evaluations of the proposed approach in terms of k ∈I R k ∈I R precision at 480 and mean over average precision at different ({r |д|b}i − rдbmin ) ∗ 255 ∀i ∈ {R|G |B} {r |д|b}i∗ = cutoffs (50, 100, 250 and 480). rдbmax − rдbmin Run Features Precision at 480 Mean precision (ir k − irmin ) ∗ 255 ∀k ∈ IR ir i∗ = 1 Visual only 84.94% 95.11% irmax − irmin 2 Meta-data only 25.88% 31.45% Moreover, we performed a human-expert-driven visual analy- 3 Meta-data and Visual 54.74% 68.12% sis of the images and found them all to be non-contrast, blurry 4 Visual only 81.15% 89.77% and color-range-limited. From our previous experience [9] we de- 5 Meta-data and Visual 73.83% 82.68% cided to use a generative adversarial network (GAN). GANs3 are Table 2: Evaluations of our approach for Flood Detection in a class of artificial intelligence algorithms used in unsupervised Satellite Images (FDSI) task machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework. Run Mean IoU per Location (Thresh.) 01 02 03 04 05 06 Overall 07 (new ) As the basis for our method we selected a neural network archi- 1 (0.78) 0.79 0.81 0.88 0.78 0.75 0.80 0.82 0.73 tecture used for retinal vessel segmentation in fundoscopic images 2 (0.94) 0.77 0.78 0.86 0.74 0.72 0.78 0.80 0.70 with generative adversarial networks (V-GAN) 4 . The V-GAN archi- 3 (0.5) 0.79 0.82 0.88 0.79 0.76 0.81 0.83 0.74 tecture is designed [11] for processing of retinal images that have 4 (0.35) 0.79 0.82 0.87 0.79 0.77 0.80 0.83 0.74 comparable visual properties and provides the required output with 5 (0.12) 0.78 0.80 0.86 0.78 0.77 0.78 0.81 0.73 one-class image segmentation masks. V-GAN is implemented in Python on top of Keras with Ten- sorflow GPU-enabled back-end. We have modified the network visual information. Again, PSO based fusion performs better. One architecture by changing the top-layers configuration in order to of the main limitations of IOWA based fusion is its mechanism of support four-channel floating-point geo-image-compatible input. assigning more weight to a more confident model. In this particular The final generator network output layer used for creation of prob- case, we noticed that our classifier trained on meta-data provides abilistic output segmentation image was extended by the simple more confident decisions with high probabilities causing significant threshold activation layer to generate the binary segmentation map. reduction in the performance. This can also be concluded from the First, we have performed experiments with the development set results on run 2 where the meta-data obtain worst results. The only and found that the modified V-GAN is able to perform the degradation in the performance due to the inclusion of meta-data segmentation of the provided satellite images, but the estimated shows that the additional information available are not much useful. performance metrics were below the expected level. Additional visual analysis of the converted RGB and IR images showed that 3.2 Runs Description in FDSI Task sometimes IR component of the sourced geo-images was irrelevant Table 2 represents the experimental results of our method for FDSI to the flooding areas that probably caused our GAN to bias during task. In total, we submitted 5 different runs for 7 different target training process and prevent it from the correct flooding areas prop- locations that are represented by image patches of satellite images erties extraction. Thus, we have decided to exclude IR component of different regions affected by flooding. We have used the different from the model input and process only the RGB components of the binarization threshold level for the different runs with the same converted normalized geo-images. This resulted in the significant model in order to find the optimum balance in the number of false- performance improvement and correct segmentation most of the positive and false-negative pixels in the segmented images. The developments set flooding areas except for the some images taken selection of used threshold values was performed based on the in not-common lighting and cloudy conditions. visual analysis of the segmentation results in order to maximize the variability of detected flooding area. The best results are reported 3 RESULTS AND ANALYSIS for location 03 (which have the best ground visibility without clouds 3.1 Runs Description in DIRSM Task and proper lighting with strong light reflections from the water For DIRSM, we submitted five different runs. Table 1 provides the surface in the flooded areas) in all runs. Overall better results are official results of our methods in terms of average precision at cut- obtained at runs 3 and 4 with mean IoU of 0.83. For the new location off 480 and mean over precision at different cutoff (50, 100, 250, 480). (07) better results are obtained at runs 3 and 4. Run 1 and run 4 are mainly based on visual information extracted with seven different CNN models and jointly utilized in PSO and 4 CONCLUSION AND FUTURE WORK IOWA based fusions, respectively. As it can be seen in Table 1, the This paper provides a detailed description of the methods proposed PSO based fusion method outperforms IOWA with a significant by UTAOS for the Mediaeval 2017 challenge on Multimedia and gain of 3.79% and 5.34%. On the other hand, run 2 is based on meta- Satellite. During the experimental evaluation of sub-task 1 (DIRSM), data achieving the worst results among the all runs. Similarly, run we noticed that visual information seems more useful compared to 3 and run 5 represents two different variations of our method used meta-data for the retrieval of disaster images. For sub-task 2 (FDSI), for combining meta-data and visual information. Run 3 is based on we rely on a Generative Adversarial Network where better results IOWA while run 5 represents our PSO based fusion of meta-data and are obtained in 3 and 4. Based on the experiments conducted in this 3 http://en.wikipedia.org/wiki/Generative_adversarial_networks work we believe that a proper fusion of social media information 4 https://bitbucket.org/woalsdnd/v-gan and satellite data can provide a better story of a natural disaster. Multimedia Satellite Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep [1] Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Nicola Conci, residual learning for image recognition. In Proceedings of the IEEE Pål Halvorsen, and Francesco De Natale. 2017. JORD: A System for conference on computer vision and pattern recognition. 770–778. Collecting Information and Monitoring Natural Disasters by Link- [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- ing Social Media with Satellite Imagery. In Proceedings of the 15th agenet classification with deep convolutional neural networks. In International Workshop on Content-Based Multimedia Indexing. ACM, Advances in neural information processing systems. 1097–1105. 12. [9] Konstantin Pogorelov, Michael Riegler, Sigrun Losada Eskeland, [2] Kashif Ahmad, Michael Riegler, Ans Riaz, Nicola Conci, Duc-Tien Thomas de Lange, Dag Johansen, Carsten Griwodz, Peter Thelin Dang-Nguyen, and Pål Halvorsen. 2017. The JORD System: Linking Schmidt, and Pål Halvorsen. 2017. Efficient disease detection in gas- Sky and Social Multimedia Data to Natural Disasters. In Proceedings trointestinal videos–global features versus neural networks. Multime- of the 2017 ACM on International Conference on Multimedia Retrieval. dia Tools and Applications (2017), 1–33. ACM, 461–465. [10] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- [3] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas lutional networks for large-scale image recognition. arXiv preprint Dengel. 2016. Contextual enrichment of remote-sensed events with arXiv:1409.1556 (2014). social media streams. In Proceedings of the 2016 ACM on Multimedia [11] Jaemin Son, Sang Jun Park, and Kyu-Hwan Jung. 2017. Retinal Vessel Conference. ACM, 1077–1081. Segmentation in Fundoscopic Images with Generative Adversarial [4] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Networks. arXiv preprint arXiv:1706.09318 (2017). Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite [12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Task at MediaEval 2017: Emergence Response for Flooding Events. Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Rabinovich. 2015. Going deeper with convolutions. In Proceedings of Ireland. the IEEE conference on computer vision and pattern recognition. 1–9. [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. [13] Ronald R Yager and Dimitar P Filev. 1999. Induced ordered weighted 2009. Imagenet: A large-scale hierarchical image database. In Computer averaging operators. IEEE Transactions on Systems, Man, and Cyber- Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. netics, Part B (Cybernetics) 29, 2 (1999), 141–150. IEEE, 248–255. [14] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and [6] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Aude Oliva. 2014. Learning deep features for scene recognition using Reutemann, and Ian H Witten. 2009. The WEKA data mining software: places database. In Advances in neural information processing systems. an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18. 487–495.