Flood Severity Estimation from Online News Images and
    Multi-Temporal Satellite Images using Deep Neural Networks
                                         Benjamin Bischke1,2 , Simon Brugman3 , Patrick Helber1,2
                  1 German Research Center for Artificial Intelligence (DFKI), Germany 2 TU Kaiserslautern, Germany
                                                                 3 Radboud University, Netherlands

                                                             {benjamin.bischke,patrick.helber}@dfki.de
                                                                    {simon.brugman}@cs.ru.nl

ABSTRACT                                                                            use a ResNet18 [6] pre-trained on ImageNet [3] which we finetuned
This paper provides a description of our approaches for flood sever-                on the images of the MFLE dataset. In the following step, we only
ity estimation in our participation at the Multimedia Satellite Task                consider those images that were classified as flood-related. We use
at MediaEval 2019. We use state-of-the-art deep neural networks for                 a object detector to identify persons and a classifier to determine if
image classification, object detection and human pose estimation                    the water level is above or below the knee of the detected person.
in order to estimate the water level from online news images. On                    As object detector, we use Faster R-CNN [7] with a ResNet101 [6]
the multi-temporal city-centered satellite sequences, we show that                  as backbone which was pre-trained on the Pascal VOC 2007 dataset
derived water indices which are often used for flood detection can                  [4]. We employ the model on the filtered images and crop patches
be learned with neural networks. By relying on recurrent networks,                  of persons. For each extracted patch, we compute a feature vector
we want to move forward the state-of-the-art in flood impact assess-                that reflects the body pose of the depicted person. The motivation
ment by motivating for models that are well known in computer                       of using the body pose as a feature vector for estimating the water
vision but generally not often used by remote sensing researchers.                  level is the following: If the knees or lower body parts are occluded
                                                                                    by water, this is also reflected in the feature vector with no predicted
                                                                                    coordinates for these body joints or with a very low confidence
1    INTRODUCTION                                                                   only. We use Openpose [2] for pose estimation and compute as
Many approaches in emergency response for flooding events are                       feature vector the normalized coordinates of the predicted body
based on satellite imagery and focus on flood extend mapping. In                    joints as well as the corresponding confidence scores of the model.
this work, we study the enrichment of satellite imagery with com-                   In the case that the image crop depicts more than one person in the
plementary information from online news by focusing on flood                        crop, we select the one which is most centered. To finally classify
severity estimation. We first consider the task of water level es-                  the crops into water level above or beyond water level, we trained
timation during flooding events. Such information is particularly                   a Support Vector Machine classifier with radial basis function as
important for emergency response, but at the same time difficult                    kernel. If there is at least one crop that was classified as above
to extract from satellite imagery. Reasons for this, are the need                   knee level, we assign this label also to original image, otherwise
of a high resolution elevation model and a fast access to satellite                 we continue with the next person patch.
imagery. The latter aspect is often difficult to establish since images                Evaluations on our internal validation dataset revealed that the
from optical sensors can often not be used due to the presence of                   approach of classifying the body pose as a proxy to estimate the
clouds and adverse constellations of non-geostationary satellites                   water level estimation leads to high recall but low precision. This
at particular points in time. Additionally, we study flood severity                 is because the lower legs are often occluded by other objects (e.g.
estimation by exploiting multi-temporal satellite images that are                   other persons, cars, boats) that are not water. In order to reduce the
increasingly available nowadays. While many approaches in the                       number of False Positives, we extract the lower part of the person
past are based on indices and pre-defined thresholds that work well                 crop and classified this region into the classes in the two classes
for particular regions of the world, we look at new methods from                    water and non water. This water detector is based on a ResNet18
deep learning that are able to detect changes in multi-temporal                     [6] model which we fine-tuned on small patches of water and non
images that are attributed due to flooding events. Our approaches                   water occluded persons.
build upon the dataset that was released by the Multimedia Satellite
Task 2019 [1].                                                                      2.2    City-centered satellite sequences
2 APPROACH                                                                          The satellite images of the CCSS dataset are already pre-processed
                                                                                    and atmospherically corrected. However, so that the images can be
2.1 Multimodal Flood Level Estimation                                               processed with standard deep learning frameworks, we multiply
For the estimation of water level from online news images, we use                   the pixel values in all bands by a factor of 1e − 4 to map the values
a multi-stage approach. In the first step, we use a convolutional                   from 16 bit to a floating number and normalize each band with
neural network (CNN) to classify images with respect to the two                     mean and standard deviation. We supressed incomplete images in a
classes of flood-related and non-flood related images. Therefore, we                sequence with the tag FU LL − DAT A − COV ERAGE equals to false.
                                                                                       For classifying the change in the images that is attributed to
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                                flooding events, we use a sequence classification approach with
4.0 International (CC BY 4.0).                                                      LSTM models. Since we are dealing with images, we employ a
MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France
MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France                                                                     B. Bischke et al.

Table 1: We report the F1-Scores of the MFLE task for the                final water classifer, (3) using random guessing with the distribu-
testset and our internal validation set. We can see that, the            tion of the development set. We can see in Table 1, that multi-stage
water patch classifier (run 2) has only a low impact (run 1).            approach performs marginally better than random guessing. We
                    Run 1      Run 2       Random     Dev. Dist.         can also see that the water level classifier adds only a minor con-
        Dev. set    73.09%     74.38%      59.51%     51.42%             tribution to pose-based water level classifier. Since we are using a
        Test set    74.27%     74.82%      59.17%     51.80%             multi-stage system and we want to quantify the influence of the
                                                                         different classifiers and perform an ablation study on the inter-
Table 2: We report the F1-Scores of the CCSS task for the                nal validation set. Our results show the following insights: When
testset and our internal validation set. We can see that, the            considering those images that have the label ’above knee level’
Conv-LSTM performs marginally better than our baselines.                 as relevant class, we can see that the flood classifier results in a
                                                                         high recall and high precision. Similarly for the person detection
               Run 1      Run 2      Run 3        Random   Dev. Dist.
                                                                         with Faster R-CNN [7], we obtain a high recall and high precision.
    Dev. set   93.82%     91.35%     92.59%       51.85%   58.88%        For the classification with Openpose however, the recall is high
    Test set   92.10%     93.50%     96.29%       49.32%   56.10%        but the precision is low. There are multiple reasons for this. We
                                                                         observed that (1) the pose estimation fails in certain conditions e.g.
Convolutional LSTM (ConvLSTM) [9] to learn the temporal de-              for women in a skirt and (2) noticed errors of the prediction from
pendencies between the images. The ConvLSTM uses 32 hidden               Openpose due to reflections on the water surface. By looking at
units and is trained on sequences of variable lengths. We use the        the failure cases, we additionally noticed that is important to filter
pre-trained network ResNet18 as encoder for extracting from raw          non-standing persons that are detected by Faster R-CNN [7], e.g.
images the feature maps before the average pooling layer and pass        persons that are not standing or partially visible persons close to
these feature maps to the ConvLSTM. Since ResNet18 was trained           the image border.
on images with only three channels, we only pass the RGB bands
of the Sentinel-2 satellite imagery to the network.                      3.2    City-centered satellite sequences
   In the second step, we also experiment with adding two convolu-       For the CCSS subtask, we submitted the following runs: (1) using
tional layers before the input of the ResNet18 that compress the 12      a ConvLSTM with RGB bands as input, (2) using a ConvLSTM
input channels to 3 channels using 2D convolutions. Therefore, we        with all 12 bands as input and two 1x1 convolutions that reduce
upsample all 20 and 60 meter bands of the Sentinel 2 images of the       12 to 3 bands, (3) same as in (2) but with loд and exp as activation
dataset to resolution of the 10 meter bands via bilinear interpolation   functions. As baselines we compare the approaches against (4)
and perform a channel-wise stacking.                                     random guessing and (5) random guessing with the distribution of
   Our third approach builds on the observation that the remote          the development set. In Table 2, we report the scores for all five
sensing community has been using indexes [5], such as the Normal-        runs. We can see that all runs based on the ConvLSTM yield high
ized difference water index (NDWI), while other researchers use          scores for both sets. Additionally, we can see that the score for RGB
Convolutional Neural Networks (CNNs) for these tasks [8]. Using          (run 1) is slightly better on the dev. set than the other runs while
indexes has the benefit that the transformed bands can be visualised     all bands (run 2) and all bands reduced with the internally learned
and interpreted by humans, at the cost of having been selected and       indices (run 3) resulted in the highest scores on the testset. Since
optimized by experts for the task at hand. The CNNs do not offer         the scores for the first three runs are all very high, we will extend
this approach, however can be trained with labelled data. We unify       this work in the future with an additional testset.
both approaches and propose a network where the indices are rep-
resented as layers in the CNN. The architecture consists of two          4     CONCLUSION & FUTURE WORK
convolutional layers with 1x1 convolutional kernels, and a loд- and      Summarizing this work, we presented for the MFLE task an ap-
exp function as activation function after these layers respectively.     proach based on state-of-the-art computer vision models for water
In this architecture, there is an analytical solution for finding the    level estimation from online images. In this approach we employed
weights that correspond to popular indexes as NDWI, NDVI, ARVI,          the model Openpose [2] and showed how existing approaches can
NDRE. We use two of these layers as well as the activation functions     be used to support disaster response. Nevertheless, we also identi-
as an alternative for the second approach to convert the 12 input        fied limitations and future directions to consider (reflections, skirts,
channels to 3 channels.                                                  persons on image borders). For the second subtask we showed that
                                                                         ConvLSTMs are a powerful model to detect changes of a particular
3     RESULTS AND ANALYSIS                                               class in multi-temporal satellite imagery. Additionally we explored
The development sets for all subtasks are split into an internal train   the possibility to represent traditional remote sensing indices di-
and validation set with a 70/30 ratio. We make source code for both      rectly with neural networks. We will follow up this idea in the
subtasks available under this link1 .                                    future, as such models can be very helpful to combine insights of
                                                                         Remote Sensing (indices) with recent advances of Deep Learning.
3.1     Multimodal Flood Level Estimation
For the MFLE subtask, we submitted the following runs: (1) Classifi-     ACKNOWLEDGMENTS
cation using the multi-stage pipeline, (2) Same as (1) but without the   This work was supported BMBF project DeFuseNN (01IW17002)
1 https://github.com/bbischke/MMSat19Submission                          and the NVIDIA AI Lab (NVAIL) program.
The 2019 Multimedia Satellite Task                                             MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Benjamin Bischke, Patrick Helber, Simon Brugman, Erkan Basar,
     Zhengyu Zhao, Martha Larson, and Konstantin Pogorelov. 2019. The
     Multimedia Satellite Task at MediaEval 2019. In Working Notes Proceed-
     ings of the MediaEval 2019. MediaEval Benchmark (MediaEval-2019),
     October 27-29. Sophia Antipolis, France.
 [2] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime
     Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR.
 [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
     2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
     conference on computer vision and pattern recognition. Ieee, 248–255.
 [4] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn,
     and Andrew Zisserman. 2010. The pascal visual object classes (voc)
     challenge. International journal of computer vision 88, 2 (2010), 303–
     338.
 [5] Bo-Cai Gao. 1996. NDWI—A normalized difference water index for
     remote sensing of vegetation liquid water from space. Remote sensing
     of environment 58, 3 (1996), 257–266.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     conference on computer vision and pattern recognition. 770–778.
 [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster
     r-cnn: Towards real-time object detection with region proposal net-
     works. In Advances in neural information processing systems. 91–99.
 [8] Tim GJ Rudner, Marc Rußwurm, Jakub Fil, Ramona Pelich, Benjamin
     Bischke, Veronika Kopačková, and Piotr Biliński. 2019. Multi3Net:
     Segmenting Flooded Buildings via Fusion of Multiresolution, Mul-
     tisensor, and Multitemporal Satellite Imagery. In Proceedings of the
     AAAI Conference on Artificial Intelligence, Vol. 33. 702–709.
 [9] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin
     Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A
     machine learning approach for precipitation nowcasting. In Advances
     in neural information processing systems. 802–810.