=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_52
|storemode=property
|title=Multi-Modal Machine Learning for Floods Detection in News, Social Media and Satellite
Sequences
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_52.pdf
|volume=Vol-2670
|authors=Kashif Ahmad,Konstantin
Pogorelov,Mohib Ullah,Michael
Riegler,Nicola Conci,Johannes Langguth,Ala Al-Fuqaha
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AhmadPURCLA19
}}
==Multi-Modal Machine Learning for Floods Detection in News, Social Media and Satellite
Sequences==
Multi-Modal Machine Learning for Flood Detection in News, Social Media and Satellite Sequences Kashif Ahmad1 , Konstantin Pogorelov 2 , Mohib Ullah3 , Michael Riegler4 , Nicola Conci5 , Johannes Langguth 2 , Ala Al-Fuqaha 1 1 Hamad Bin Khalifa University, Doha, Qatar, 2 Simula Research Laboratory, Norway 3 Norwegian University of Science and Technology, Norway, 4 Simula Metropolitan Center for Digitalisation and Kristiania University College, Norway, 5 University of Trento, Italy {kahmad,aalfuqaha}@hbku.edu.qa,mohib.ullah@ntnu.no,nicola.conci@unitn.it {konstantin,michael,langguth}@simula.no ABSTRACT 2 PROPOSED APPROACH In this paper we present our methods for the MediaEval 2019 Mul- 2.1 Methodology for NITD task timedia Satellite Task, which is aiming to extract complementary Considering the diversity of the content covered by natural disaster- information associated with adverse events from Social Media and related images, based on our previous experience [2], we utilize a satellites. For the first challenge, we propose a framework jointly uti- diversified set of visual features including colour, texture, object lizing colour, object and scene-level information to predict whether and scene-level features. The object and scene-level features are the topic of an article containing an image is a flood event or not. extracted through three different Convolutional Neural Network Visual features are combined using early and late fusion techniques (CNN) models, namely AlexNet[9], VggNet [13] and ResNet [8], achieving an average F1-score of 82.63, 82.40, 81.40 and 76.77. For pre-trained on the ImageNet dataset [7] and the Places dataset [15]. the multi-modal flood level estimation, we rely on both visual The models pre-trained on ImageNet correspond to object level and textual information achieving an average F1-score of 58.48 information while the ones pre-trained on the Places dataset extract and 46.03, respectively. Finally, for the flooding detection in time- scene level information. For feature extraction from all models, we based satellite image sequences we used a combination of classical use the Caffe toolbox1 . For colour and texture features we rely on computer-vision and machine learning approaches achieving an the LIRE open source library [10] which we used to extract joint average F1-score of 58.82. composite descriptor (JCD) features from the images. In order to combine the features, we use both early and late fusion 1 INTRODUCTION techniques. For the early fusion, feature vectors are concatenated. When natural disasters occur, an instant access to relevant informa- For late fusion two different techniques namely (i) simple averaging tion may be crucial to mitigate loss in terms of property and human and (ii) Particle Swarm Optimization (PSO) based technique is used lives, and may result in a speedy recovery [12]. In this regards, for late fusion. The basic motivation behind PSO based fusion is to social media and remotely sensed information have been proved assign merit based weights to the deep models. For classification very effective [1, 3, 12]. Similar to the 2017 [5] and 2018 [6] versions purposes, we rely on Support Vector Machines (SVMs) in all of of the task, the MediaEval 2019 Multimedia Satellite task [4] aims the submitted fusion runs. Moreover, to deal with class imbalance to combine the information from the two complementary sources, problem, we use ensemble different re-sampled data sets technique namely social media and satellites. where five different models are trained using all the samples of the This paper provides detailed description of the methods pro- rare class and n-differing samples of the abundant class. posed by team UTAOS for the MediaEval 2019 Multimedia Satellite challenge. The challenge consists of three parts, namely (i) News 2.2 Methodology for MFLE task Image Topic Disambiguation (NITD), (ii) Multimodal Flood Level Es- For the MFLE task, we proposed two different solutions exploiting timation (MFLE) and (iii) City-centered Satellite Sequences (CCSS). both: visual and textual information. For visual features based flood The first two tasks(NITD and MFLE) are based on social media estimation, we proposed a two step framework where as a first data aiming to (a) predict whether the topic of the article containing step an ensemble of binary image classifiers trained on deep visual the image was a water-related natural-disaster event or not, and (b) features, extracted through AlexNet pre-trained on ImageNet and to build a binary classifier that predicts whether or not the image Places datasets, is used to differentiate between flood and non- contains at least one person standing in water above the knee. flooded images. In the second step, we rely on tracking techniques In the CCSS task, the participants are provided with a set of [14], for which an open source library, namely OpenPose2 , has sequences of satellite images depicting a certain city over a certain been used to draw and extract body points on the people in the length of time, and the they need to propose and develop a frame- flood related images. Subsequently, the generated coordinates are work able to determine whether or not there was a flooding event analyzed to identify the images having at least one person standing ongoing in that city at that time. in water and the water level is above the knee height by checking the knee joints at the corresponding index of the generated files Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 1 https://github.com/BVLC/caffe 4.0 International (CC BY 4.0). 2 https://www.learnopencv.com/tag/openpose/ MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France K. Ahmad et al. of the joints extracted for each person. On the other hand, for text Table 1: Evaluation of our proposed approaches for (a) NITD analysis we employed two methods, namely (i) Bag-of-words Model and (b) MLFE tasks in terms of F1-scores. (BoW) and (ii) LSTM network. Before applying the methods, the data (a) NITD (b) MFLE was pre-processed for tokenization and removing of punctuation. Run Dev. Set Test Set Run Dev. Set Test Set Run 1 81.08 82.63 2.3 Methodology for CCSS task Run 1 61.00 58.48 Run 2 79.43 82.40 Run 2 56.71 46.03 For CCSS task, first, we tried to employ a recurrent convolutional Run 3 77.77 81.40 Run 4 57.56 44.91 neural network architecture designed for change detection in multi- Run 4 75.70 76.77 spectral satellite imagery data (ReCNN) [11]. This network was initially designed to solve the task very similar to CCSS task goals, 3.2 Runs Description in MLFE Task and the results depicted by ReCNN’s authors are promising. How- For MLFE task, we submitted two mandatory and one optional run. ever, despite high expectations, ReCNN was not able to achieve suf- The first run is based on visual information where a two phase ficient and better-than-random performance of detection changes approach has been proposed for flood level estimation starting caused by flooding. Our assumption is that was caused by the "real" with deep features based classification of flooded and non-flooded nature of the dataset provided in CCSS task. Images were taken images, followed by human body points detection via Openpose in different seasons, often partially or fully covered by clouds and library in the flood-related images. Our second and third runs are sometimes have noticeable pixel offset between each other. After based on textual information where Bag-of-words (BoW) and LSTM a series of unsuccessful experiments, we decided to use a classical based techniques are used for the article classification, respectively. image processing and analysis approach with multi-stage image Table 1b shows the experimental results of our solutions for MLFE processing using simple operations. First, we mask-out from the task on both development and test sets. Overall, better results are further analysis all cloud-covered image areas by applying simple obtained with visual information. Moreover, BoW features produce threshold function. Reference threshold value is computed per- slightly better results over LSTM based approach. image by averaging the values of the pixels located in monotoni- cally white-colored areas. The same masking is performed for dark 3.3 Runs Description in CCSS Task underexposured and areas with missing imaging data. Next, we For CCSS task we submitted the mandatory run only. Evaluation scaled images down to uniform size of 128 ∗ 128 pixels to reduce performed by the task organizers showed F1-score of 58.82% for noise and soften image-shifting influence. Then, scaled images are flooding detection performance on the provided test set. The rela- converted into hue-saturation-value (HSV) color space, and further tively high performance for our simple detection approach can be analysis is performed on HSV bands. Using the same thresholding explained by the used aggressive image masking technique which methodology, we mask-out pixels with too-low and too-high satu- allow us to perform comparison of only clearly visible areas. How- ration (S) and Value (V) channel values. The resulting masks are ever, our own evaluation shows that our approach is not able to filtered with median filter and processed by dilation filter. Resulting distinguish correctly between image changes caused by flooding images are compared in sequential pairs within non-masked-out and seasonal vegetation grow. regions using grey level co-occurrence matrix texture feature. Final flooding presence detection is made by using random tree classifier. 4 CONCLUSIONS AND FUTURE WORK This year, the social multimedia satellite task introduced a new 3 RESULTS AND ANALYSIS and important challenges including image based news topic dis- ambiguation (NITD), multi-modal flood level estimation in social 3.1 Runs Description in NITD Task media content (MFLE) and predicting a flood event in a set of se- For NITD, we submitted total four runs. In run 1, we used the PSO quences of satellite images of a certain city over a certain length based weight optimization method for assigning weights to each of time (CCSS). For the NITD task, we mainly relied on ensembles model on merit basis. For run 2, the deep models are treated equally of classifiers trained on deep features extracted through several by assigning equal weights to all models. In our run 3, we added pre-trained deep models as well as global features (GF). During colour based features to our pool of features descriptors in a late the experiments, we observed that the object and scene-level fea- fusion method where the scores of all models are simply added to tures complement each others when jointly utilized in a proper obtain the final prediction. Our run 4 is based early fusion where way. Moreover, deep features are proved more effective compared the deep features are simply concatenated for training SVMs. Table to GF. For MLFE task, we used both textual and visual information 1a provides the experimental results our proposed solutions for where better results were obtained with visual information. How- NITD task on both development and test sets. Overall better results ever, textual and visual information can complement each other. In are obtained with PSO based late fusion which shows the advantage the future, we aim to analyze the task with more advanced early of merit based late fusion of the models. On the other hand, least and late fusion techniques to better utilize the multi-modal infor- F1-score is obtained with early fusion. Moreover, the colour based mation. Furthermore, we plan to use complex GF. For CCSS task, features did not contribute positively in the performance of the we used the combination of computer-vision and machine learning framework. This might be due to the fact that the JCD feature is approaches. For future results improvement, we will continue inves- very compressed and does not contain much information that the tigating recurrent CNN and GAN-based approaches in combination fusion algorithm could exploit. with classical image processing algorithms. The Multimedia Satellite Task MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES Ieee, 248–255. [1] Kashif Ahmad, Konstantin Pogorelov, Michael Riegler, Olga Os- [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep troukhova, Pål Halvorsen, Nicola Conci, and Rozenn Dahyot. 2019. residual learning for image recognition. In Proceedings of the IEEE Automatic detection of passable roads after floods in remote sensed conference on computer vision and pattern recognition. 770–778. and social media data. Signal Processing: Image Communication 74 [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- (2019), 110–118. agenet classification with deep convolutional neural networks. In [2] Kashif Ahmad, Amir Sohail, Nicola Conci, and Francesco De Natale. Advances in neural information processing systems. 1097–1105. 2018. A Comparative study of Global and Deep Features for the [10] Mathias Lux, Michael Riegler, Pål Halvorsen, Konstantin Pogorelov, analysis of user-generated natural disaster related images. In 2018 IEEE and Nektarios Anagnostopoulos. 2016. LIRE: open source visual infor- 13th Image, Video, and Multidimensional Signal Processing Workshop mation retrieval. In Proceedings of the 7th International Conference on (IVMSP). IEEE, 1–5. Multimedia Systems. ACM, 30. [3] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas [11] Lichao Mou, Lorenzo Bruzzone, and Xiao Xiang Zhu. 2018. Learn- Dengel. 2016. Contextual enrichment of remote-sensed events with ing spectral-spatial-temporal features via a recurrent convolutional social media streams. In Proceedings of the 24th ACM international neural network for change detection in multispectral imagery. IEEE conference on Multimedia. ACM, 1077–1081. Transactions on Geoscience and Remote Sensing 57, 2 (2018), 924–935. [4] Benjamin Bischke, Patrick Helber, Erkan Basar, Simon Brugman, [12] Naina Said, Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Zhengyu Zhao, and Konstantin Pogorelov. The Multimedia Satel- Laiq Hassan, Nasir Ahmad, and Nicola Conci. 2019. Natural disas- lite Task at MediaEval 2019: Flood Severity Estimation. In Proc. of the ters detection in social media and satellite imagery: a survey. Mul- MediaEval 2019 Workshop (Oct. 27-29, 2019). Sophia Antipolis, France. timedia Tools and Applications (17 Jul 2019). https://doi.org/10.1007/ [5] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan s11042-019-07942-1 Venkat, Andreas Dengel, and Damian Borth. 2017. The Multimedia [13] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- Satellite Task at MediaEval 2017: Emergence Response for Flooding lutional networks for large-scale image recognition. arXiv preprint Events. In Proceedings of the MediaEval 2017 Workshop (Sept. 13-15, arXiv:1409.1556 (2014). 2017). Dublin, Ireland. [14] Mohib Ullah and Faouzi Alaya Cheikh. 2018. A directed sparse graph- [6] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and ical model for multi-target tracking. In Proceedings of the IEEE Con- Damian Borth. The Multimedia Satellite Task at MediaEval 2018: ference on Computer Vision and Pattern Recognition Workshops. 1816– Emergency Response for Flooding Events. In Proc. of the MediaEval 1823. 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [15] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Aude Oliva. 2014. Learning deep features for scene recognition using 2009. Imagenet: A large-scale hierarchical image database. In Computer places database. In Advances in neural information processing systems. Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. 487–495.