Multimedia Analysis Techniques for Flood Detection Using
                Images, Articles and Satellite Imagery
             Stelios Andreadis1 , Marios Bakratsas1 , Panagiotis Giannakeris1 , Anastasia Moumtzidou1 ,
                        Ilias Gialampoukidis1 , Stefanos Vrochidis1 , Ioannis Kompatsiaris1
                          1 Centre for Research & Technology Hellas - Information Technologies Institute, Greece

                                 {andreadisst,mbakratsas,giannakeris,moumtzid,heliasgj,stefanos,ikom}@iti.gr

ABSTRACT                                                                   images flooded areas may be completely out of view. Even more
This paper presents the various algorithms that the CERTH-ITI              challenging are the instances where a flooded area is clearly shown
team has implemented to tackle three tasks that relate to the prob-        in the image but the article’s topic is not relevant to a flood event.
lem of flood severity estimation, using satellite images and on-           Also, in some instances water is present but not in the context of
line media content. Deep Convolutional Neural Networks were                floods (e.g. a beach). In order to examine the performance of state-
deployed to classify articles as flood event-related based on their        of-the-art image classification techniques [11] in this task we deploy
images, but also to detect flooding events in satellite sequences.         a Deep Convolutional Neural Network (DCNN) that was trained
Remote sensing indices play a key role in the machine learning ap-         on the full development set ("CNN2019"). Another DCNN that was
proach to identify changes between satellite imagery, while visual         trained on the Mediaeval 2017 development set ("CNN2017") [2]
and textual features were exploited to estimate whether an image           is also tested here in order to evaluate a straight flood/non-flood
shows people standing in flooded areas.                                    image classifier and compare both approaches.
                                                                              We acquire the VGG architecture pre-trained on the Places365
                                                                           dataset [13] for both cases. The weights of this model are carefully
1    INTRODUCTION                                                          optimized to extract features for scene recognition which is a suit-
News websites now play a crucial role in the field of public infor-        able starting point for our objective [8]. In order to fine-tune the
mation, turning into a rich and open source of articles and images         network, 5-fold cross-validation was performed so as to find how
that cover numerous events. At the same time, the high availability        many of the final layers to freeze and at which epoch to stop the
of satellite data induces an alternative source of imagery. This data      training. The setting with the highest average accuracy was fine-
can be exploited in the domain of natural disasters, e.g. to detect a      tuning all fully-connected layers for 35 epochs. The development
flooding incident or to estimate the severity of a flood. Several ongo-    set is heavily biased towards negative samples (nearly 7 times more
ing H2020 projects follow this direction: beAWARE [3] includes the         negative images), therefore we chose to oversample the set with
analysis of visual and textual information for disaster forecasting        positive images to balance it.
and management, while EOPEN [10] involves Earth Observation
and social media data in flood risk monitoring.
   The Multimedia Satellite Task is a challenge of MediaEval that
                                                                           2.2    Multimodal Flood Level Estimation (MFLE)
consists of the following subtasks. News Image Topic Disambigua-           The estimation of flood level involves checking whether or not an
tion (NITD) entails an image classifier that is able to identify whether   image contains people standing in water above the knee, and it is
or not an image belongs to a flood-related article. Multimodal Flood       realized by considering machine learning techniques on visual and
Level Estimation (MFLE) calls for a classifier that receives visual        textual information. Regarding the visual information, a 22-layer
and/or textual information from articles and predicts whether or           GoogleNet network was fine-tuned and the dimension of the classi-
not an image contains people standing in water above the knee.             fication layer was set equal to 345 [9], which equals to the 345 SIN
Finally, City-Centered Satellite Sequences (CCSS) asks participants        TRECVID concepts. Then a set of 6 concepts were considered as in-
to detect a flooding incident by using sequences of satellite images.      teresting for locating people (being "Adult", "Person", "Two_People")
For further details on the subtasks and the respective data sets, the      and water ("River", "Waterscape_Waterfront"). The probabilities of
reader is referred to [1].                                                 each concept appearing in each image were considered as input to
   The next section presents the algorithms proposed by the CERTH-         a binary Support Vector Machine (SVM) classifier.
ITI team for each subtask, followed by the results of their evaluation        Regarding the textual information, we followed a well-established
and a short discussion with conclusions.                                   approach in text classification called word2vec [7] that considers
                                                                           word embeddings. In general, word embeddings stand on the con-
2 APPROACH                                                                 cept that similar words tend to occur together and have a similar
                                                                           context (e.g. football and basketball are linked to sports) and they
2.1 News Image Topic Disambiguation (NITD)                                 are based on Deep Neural Networks (DNN) [4]. Eventually, a binary
We aim to classify news articles’ topics judging from the images           SVM classifier is trained using the word2vec text representations.
that appear in them. One challenge of this task is that inside the            Finally, a simple late fusion approach was followed in order to
                                                                           consider both visual and textual information, so the outputs of the
Copyright 2019 for this paper by its authors. Use
                                                                           above two modules are considered for deciding the fused approach
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                             prediction. If the output of the two SVM binary classifiers coincide,
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                    S. Andreadis et al.


then their common label defines the label of the fused module;                         Table 1: CERTH-ITI results in all tasks
otherwise, only the output of the visual module is considered.
                                                                                                                 Dev set         Test set
                                                                                Run submissions
2.3    City-Centered Satellite Sequences (CCSS)                                                                F1-Score (%)    F1-Score (%)
The first approach to detect flood events using satellite sequences                        News Image Topic Disambiguation
involved the use of a deep learning model which was trained on                  CNN2019                        95.98       90.20
two different datasets of three-channel images with the differences             CNN2017                        78.46       88.73
of two days within an event. The first dataset was created by com-
bining the Red-Green-Blue (B02-B03-B04) bands and the second by                           Multimodal Flood Level Estimation
combining the Red-Swir-Nir (B02-B03-B04). Then, the three bands                 Visual                         47.3         64.33
were stacked and converted to JPEG. Within each event, the unique               Textual                        35.2         57.62
differences between its days were calculated. Next, pre-trained net-            Visual & Textual               47.3         64.33
works on ImageNet [6] were fine-tuned in order to learn the new                          City-centered satellite sequences
features of our dataset. The last pooling layer was replaced with               MNDWI γ =2.1 ratio=0.05        83.54               76.47
a densely-connected NN layer with a softmax activation function                 VGG16 Red-Green-Blue           60.57               70.58
with 2 outputs. The following parameters were considered: (i) eval-             VGG16 Red-Swir-Nir             60.74               70.58
uation of the Adam [5] and SGD optimizers, and (ii) evaluation of               VGG19 Red-Swir-Nir             60.74               70.58
learning rates 0.1, 0.01, 0.001. Batch size was set to 32.                      MNDWI water masks γ =2         57.53               54.41
   An additional change detection approach based on the remote
sensing water index of MNDWI [12] was implemented. Within
each event the MNDWI differences of the consecutive days were              knee. The textual information features performed slightly lower
calculated. For each difference image, the   outliers were estimated      to the visual ones, while the fusion of visual and textual features
as follows: pixel’s values that fall within m − γ σ , m + γ σ denotes
                                                             
                                                                           performed equally to the visual, which can be easily explained by
no change. A minimum water _ratio needs to be set to characterize          the aforementioned description of the approach.
the image as changed (i.e. flooded). The method was applied on the             CCSS Detecting the outliers on the differences of MNDWI con-
dev set to identify the optimum values for дamma and water _ratio.         secutive images achieved a 76.47% F1-score. The image differenc-
   As a third approach, outlier detection was also performed on            ing technique proved adequate to detect changes relative to flood
water body masks, produced by zero thresholding of the MNDWI               events, using the σ and minimum water _ratio values that were
index. Counting the water pixels of each day of an event generated         calculated on the annotated dev set. Using DCNN provided decent
time series of integers. Then, Z-score was calculated per each point      results (70.58%), showing its ability to learn to detect flood patterns
as x − m / σ , where x is the value of each point and m and σ are          even with a small training set. On the other hand, outlier detection
the mean and standard deviation of all points in the time series. If a     on water masks, using MNDWI index and setting γ to 2, did not
point exceeded a threshold γ , it was considered an outlier and thus       accomplish a high F1-score (54.41%), possibly due to the fact that
the complete sequence of images was classified as an event.                all the remote sensing information was limited to a binary mask.

3     RESULTS AND ANALYSIS                                                 4   DISCUSSION AND OUTLOOK
The complete results in the dev set and the test set for all three         Through the participation in the Multimedia Satellite challenge,
subtasks of the Multimedia Satellite Task can be seen in Table 1,          the CERTH-ITI team gained the opportunity to examine various
where it is evident that the DCNN approaches in NITD and the               methodologies for the problem of flood detection. Results for the
image differencing technique in CCSS really stood out. In detail:          NITD task indicate that it is possible to classify flood event articles
    NITD Examining the errors, we observe that the article clas-           with good accuracy using either a generic flood detector or by
sifier is mainly producing False Positives and very little to none         annotating a specific dataset. However, the second approach looks
False Negatives. Many of the FP cases actually show flooded areas,         more promising when dealing with articles concerning a single
although the article topic is negative to a flood event. On the test       event. Results of the MFLE task show that visual features perform
set, the 2019 model performs better than the 2017 model reaching           better than the textual ones, but they could be further improved if a
an accuracy of 90.2%. We hypothesise that it is performing better          segmentation step was applied on top of the proposed approach for
because it has learnt correlations beyond the obvious: a flooded           recognising whether water covered people below the knee. Finally,
area in an image is a strong sign of flood relevancy in the article but    results of the CCSS demonstrate the ability of the combined method
certain groups of people appearing may also be a positive flag, like       of image differencing and water relative index of MNDWI to detect
authorities or politicians. This is expected to hold true, especially if   flood events, showing better robustness with balanced FP and FN
the training and the test set are taken from a single event where          rates, compared to the DCNN approach , whereas the three extra
the same people appear frequently on the news articles.                    layers of VGG19 don’t show any impact on the learning process.
    MFLE The exploitation of visual information reaches a ∼65%
F1-score, due to the significant number of FP, since the concept           ACKNOWLEDGMENTS
detection focused on the identification of humans and water and            This work was supported by EC-funded projects H2020-700475-
it didn’t restrict to images of people standing in water above the         beAWARE and H2020-776019-EOPEN.
The Multimedia Satellite Task                                                   MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Benjamin Bischke, Patrick Helber, Simon Brugman, Erkan Basar,
     Zhengyu Zhao, Martha Larson, and Konstantin Pogorelov. The Multi-
     media Satellite Task at MediaEval 2019: Estimation of Flood Severity.
     In Proc. of the MediaEval 2019 Workshop (Oct. 27-29, 2019). Sophia
     Antipolis, France.
 [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
     Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite
     Task at MediaEval 2017: Emergence Response for Flooding Events.
     In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,
     Ireland.
 [3] H2020 DRS. 2017-2020. beAWARE project. https://beaware-project.eu/
 [4] Moshe Hazoom. 2018. Word2Vec For Phrases. Learning Embeddings
     For More Than One Word. (2018). https://bit.ly/32mDMNH
 [5] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic
     optimization. arXiv preprint arXiv:1412.6980 (2014).
 [6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
     agenet classification with deep convolutional neural networks. In
     Advances in Neural Information Processing Systems. 1097–1105.
 [7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
     Dean. 2013. Distributed representations of words and phrases and
     their compositionality. In Advances in Neural Information Processing
     Systems. 3111–3119.
 [8] Anastasia Moumtzidou, Panagiotis Giannakeris, Stelios Andreadis,
     Athanasios Mavropoulos, Georgios Meditskos, Ilias Gialampoukidis,
     Konstantinos Avgerinakis, Stefanos Vrochidis, and Ioannis Kompat-
     siaris. 2018. A Multimodal Approach in Estimating Road Passability
     Through a Flooded Area Using Social Media and Satellite Images.. In
     Proc. of the MediaEval.
 [9] Nikiforos Pittaras, Foteini Markatopoulou, Vasileios Mezaris, and Ioan-
     nis Patras. 2017. Comparison of fine-tuning and extension strategies
     for deep convolutional neural networks. In International Conference
     on Multimedia Modeling. Springer, 102–114.
[10] H2020 EO RIA. 2017-2020. EOPEN project. https://eopen-project.eu/
[11] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolu-
     tional Networks for Large-Scale Image Recognition. In International
     Conference on Learning Representations.
[12] Hanqiu Xu. 2006. Modification of normalised difference water index
     (NDWI) to enhance open water features in remotely sensed imagery.
     International journal of remote sensing 27, 14 (2006), 3025–3033.
[13] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
     Torralba. 2017. Places: A 10 million image database for scene recogni-
     tion. IEEE transactions on pattern analysis and machine intelligence 40,
     6 (2017), 1452–1464.