Multimedia Analysis Techniques for Flood Detection Using Images, Articles and Satellite Imagery Stelios Andreadis1 , Marios Bakratsas1 , Panagiotis Giannakeris1 , Anastasia Moumtzidou1 , Ilias Gialampoukidis1 , Stefanos Vrochidis1 , Ioannis Kompatsiaris1 1 Centre for Research & Technology Hellas - Information Technologies Institute, Greece {andreadisst,mbakratsas,giannakeris,moumtzid,heliasgj,stefanos,ikom}@iti.gr ABSTRACT images flooded areas may be completely out of view. Even more This paper presents the various algorithms that the CERTH-ITI challenging are the instances where a flooded area is clearly shown team has implemented to tackle three tasks that relate to the prob- in the image but the article’s topic is not relevant to a flood event. lem of flood severity estimation, using satellite images and on- Also, in some instances water is present but not in the context of line media content. Deep Convolutional Neural Networks were floods (e.g. a beach). In order to examine the performance of state- deployed to classify articles as flood event-related based on their of-the-art image classification techniques [11] in this task we deploy images, but also to detect flooding events in satellite sequences. a Deep Convolutional Neural Network (DCNN) that was trained Remote sensing indices play a key role in the machine learning ap- on the full development set ("CNN2019"). Another DCNN that was proach to identify changes between satellite imagery, while visual trained on the Mediaeval 2017 development set ("CNN2017") [2] and textual features were exploited to estimate whether an image is also tested here in order to evaluate a straight flood/non-flood shows people standing in flooded areas. image classifier and compare both approaches. We acquire the VGG architecture pre-trained on the Places365 dataset [13] for both cases. The weights of this model are carefully 1 INTRODUCTION optimized to extract features for scene recognition which is a suit- News websites now play a crucial role in the field of public infor- able starting point for our objective [8]. In order to fine-tune the mation, turning into a rich and open source of articles and images network, 5-fold cross-validation was performed so as to find how that cover numerous events. At the same time, the high availability many of the final layers to freeze and at which epoch to stop the of satellite data induces an alternative source of imagery. This data training. The setting with the highest average accuracy was fine- can be exploited in the domain of natural disasters, e.g. to detect a tuning all fully-connected layers for 35 epochs. The development flooding incident or to estimate the severity of a flood. Several ongo- set is heavily biased towards negative samples (nearly 7 times more ing H2020 projects follow this direction: beAWARE [3] includes the negative images), therefore we chose to oversample the set with analysis of visual and textual information for disaster forecasting positive images to balance it. and management, while EOPEN [10] involves Earth Observation and social media data in flood risk monitoring. The Multimedia Satellite Task is a challenge of MediaEval that 2.2 Multimodal Flood Level Estimation (MFLE) consists of the following subtasks. News Image Topic Disambigua- The estimation of flood level involves checking whether or not an tion (NITD) entails an image classifier that is able to identify whether image contains people standing in water above the knee, and it is or not an image belongs to a flood-related article. Multimodal Flood realized by considering machine learning techniques on visual and Level Estimation (MFLE) calls for a classifier that receives visual textual information. Regarding the visual information, a 22-layer and/or textual information from articles and predicts whether or GoogleNet network was fine-tuned and the dimension of the classi- not an image contains people standing in water above the knee. fication layer was set equal to 345 [9], which equals to the 345 SIN Finally, City-Centered Satellite Sequences (CCSS) asks participants TRECVID concepts. Then a set of 6 concepts were considered as in- to detect a flooding incident by using sequences of satellite images. teresting for locating people (being "Adult", "Person", "Two_People") For further details on the subtasks and the respective data sets, the and water ("River", "Waterscape_Waterfront"). The probabilities of reader is referred to [1]. each concept appearing in each image were considered as input to The next section presents the algorithms proposed by the CERTH- a binary Support Vector Machine (SVM) classifier. ITI team for each subtask, followed by the results of their evaluation Regarding the textual information, we followed a well-established and a short discussion with conclusions. approach in text classification called word2vec [7] that considers word embeddings. In general, word embeddings stand on the con- 2 APPROACH cept that similar words tend to occur together and have a similar context (e.g. football and basketball are linked to sports) and they 2.1 News Image Topic Disambiguation (NITD) are based on Deep Neural Networks (DNN) [4]. Eventually, a binary We aim to classify news articles’ topics judging from the images SVM classifier is trained using the word2vec text representations. that appear in them. One challenge of this task is that inside the Finally, a simple late fusion approach was followed in order to consider both visual and textual information, so the outputs of the Copyright 2019 for this paper by its authors. Use above two modules are considered for deciding the fused approach permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). prediction. If the output of the two SVM binary classifiers coincide, MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France S. Andreadis et al. then their common label defines the label of the fused module; Table 1: CERTH-ITI results in all tasks otherwise, only the output of the visual module is considered. Dev set Test set Run submissions 2.3 City-Centered Satellite Sequences (CCSS) F1-Score (%) F1-Score (%) The first approach to detect flood events using satellite sequences News Image Topic Disambiguation involved the use of a deep learning model which was trained on CNN2019 95.98 90.20 two different datasets of three-channel images with the differences CNN2017 78.46 88.73 of two days within an event. The first dataset was created by com- bining the Red-Green-Blue (B02-B03-B04) bands and the second by Multimodal Flood Level Estimation combining the Red-Swir-Nir (B02-B03-B04). Then, the three bands Visual 47.3 64.33 were stacked and converted to JPEG. Within each event, the unique Textual 35.2 57.62 differences between its days were calculated. Next, pre-trained net- Visual & Textual 47.3 64.33 works on ImageNet [6] were fine-tuned in order to learn the new City-centered satellite sequences features of our dataset. The last pooling layer was replaced with MNDWI γ =2.1 ratio=0.05 83.54 76.47 a densely-connected NN layer with a softmax activation function VGG16 Red-Green-Blue 60.57 70.58 with 2 outputs. The following parameters were considered: (i) eval- VGG16 Red-Swir-Nir 60.74 70.58 uation of the Adam [5] and SGD optimizers, and (ii) evaluation of VGG19 Red-Swir-Nir 60.74 70.58 learning rates 0.1, 0.01, 0.001. Batch size was set to 32. MNDWI water masks γ =2 57.53 54.41 An additional change detection approach based on the remote sensing water index of MNDWI [12] was implemented. Within each event the MNDWI differences of the consecutive days were knee. The textual information features performed slightly lower calculated. For each difference image, the  outliers were estimated to the visual ones, while the fusion of visual and textual features as follows: pixel’s values that fall within m − γ σ , m + γ σ denotes  performed equally to the visual, which can be easily explained by no change. A minimum water _ratio needs to be set to characterize the aforementioned description of the approach. the image as changed (i.e. flooded). The method was applied on the CCSS Detecting the outliers on the differences of MNDWI con- dev set to identify the optimum values for дamma and water _ratio. secutive images achieved a 76.47% F1-score. The image differenc- As a third approach, outlier detection was also performed on ing technique proved adequate to detect changes relative to flood water body masks, produced by zero thresholding of the MNDWI events, using the σ and minimum water _ratio values that were index. Counting the water pixels of each day of an event generated calculated on the annotated dev set. Using DCNN provided decent time series of integers. Then, Z-score was calculated per each point results (70.58%), showing its ability to learn to detect flood patterns as x − m / σ , where x is the value of each point and m and σ are even with a small training set. On the other hand, outlier detection the mean and standard deviation of all points in the time series. If a on water masks, using MNDWI index and setting γ to 2, did not point exceeded a threshold γ , it was considered an outlier and thus accomplish a high F1-score (54.41%), possibly due to the fact that the complete sequence of images was classified as an event. all the remote sensing information was limited to a binary mask. 3 RESULTS AND ANALYSIS 4 DISCUSSION AND OUTLOOK The complete results in the dev set and the test set for all three Through the participation in the Multimedia Satellite challenge, subtasks of the Multimedia Satellite Task can be seen in Table 1, the CERTH-ITI team gained the opportunity to examine various where it is evident that the DCNN approaches in NITD and the methodologies for the problem of flood detection. Results for the image differencing technique in CCSS really stood out. In detail: NITD task indicate that it is possible to classify flood event articles NITD Examining the errors, we observe that the article clas- with good accuracy using either a generic flood detector or by sifier is mainly producing False Positives and very little to none annotating a specific dataset. However, the second approach looks False Negatives. Many of the FP cases actually show flooded areas, more promising when dealing with articles concerning a single although the article topic is negative to a flood event. On the test event. Results of the MFLE task show that visual features perform set, the 2019 model performs better than the 2017 model reaching better than the textual ones, but they could be further improved if a an accuracy of 90.2%. We hypothesise that it is performing better segmentation step was applied on top of the proposed approach for because it has learnt correlations beyond the obvious: a flooded recognising whether water covered people below the knee. Finally, area in an image is a strong sign of flood relevancy in the article but results of the CCSS demonstrate the ability of the combined method certain groups of people appearing may also be a positive flag, like of image differencing and water relative index of MNDWI to detect authorities or politicians. This is expected to hold true, especially if flood events, showing better robustness with balanced FP and FN the training and the test set are taken from a single event where rates, compared to the DCNN approach , whereas the three extra the same people appear frequently on the news articles. layers of VGG19 don’t show any impact on the learning process. MFLE The exploitation of visual information reaches a ∼65% F1-score, due to the significant number of FP, since the concept ACKNOWLEDGMENTS detection focused on the identification of humans and water and This work was supported by EC-funded projects H2020-700475- it didn’t restrict to images of people standing in water above the beAWARE and H2020-776019-EOPEN. The Multimedia Satellite Task MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Benjamin Bischke, Patrick Helber, Simon Brugman, Erkan Basar, Zhengyu Zhao, Martha Larson, and Konstantin Pogorelov. The Multi- media Satellite Task at MediaEval 2019: Estimation of Flood Severity. In Proc. of the MediaEval 2019 Workshop (Oct. 27-29, 2019). Sophia Antipolis, France. [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite Task at MediaEval 2017: Emergence Response for Flooding Events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [3] H2020 DRS. 2017-2020. beAWARE project. https://beaware-project.eu/ [4] Moshe Hazoom. 2018. Word2Vec For Phrases. Learning Embeddings For More Than One Word. (2018). https://bit.ly/32mDMNH [5] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- agenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105. [7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119. [8] Anastasia Moumtzidou, Panagiotis Giannakeris, Stelios Andreadis, Athanasios Mavropoulos, Georgios Meditskos, Ilias Gialampoukidis, Konstantinos Avgerinakis, Stefanos Vrochidis, and Ioannis Kompat- siaris. 2018. A Multimodal Approach in Estimating Road Passability Through a Flooded Area Using Social Media and Satellite Images.. In Proc. of the MediaEval. [9] Nikiforos Pittaras, Foteini Markatopoulou, Vasileios Mezaris, and Ioan- nis Patras. 2017. Comparison of fine-tuning and extension strategies for deep convolutional neural networks. In International Conference on Multimedia Modeling. Springer, 102–114. [10] H2020 EO RIA. 2017-2020. EOPEN project. https://eopen-project.eu/ [11] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolu- tional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations. [12] Hanqiu Xu. 2006. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. International journal of remote sensing 27, 14 (2006), 3025–3033. [13] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recogni- tion. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1452–1464.