Flood Severity Estimation from Online News Images and Multi-Temporal Satellite Images using Deep Neural Networks Benjamin Bischke1,2 , Simon Brugman3 , Patrick Helber1,2 1 German Research Center for Artificial Intelligence (DFKI), Germany 2 TU Kaiserslautern, Germany 3 Radboud University, Netherlands {benjamin.bischke,patrick.helber}@dfki.de {simon.brugman}@cs.ru.nl ABSTRACT use a ResNet18 [6] pre-trained on ImageNet [3] which we finetuned This paper provides a description of our approaches for flood sever- on the images of the MFLE dataset. In the following step, we only ity estimation in our participation at the Multimedia Satellite Task consider those images that were classified as flood-related. We use at MediaEval 2019. We use state-of-the-art deep neural networks for a object detector to identify persons and a classifier to determine if image classification, object detection and human pose estimation the water level is above or below the knee of the detected person. in order to estimate the water level from online news images. On As object detector, we use Faster R-CNN [7] with a ResNet101 [6] the multi-temporal city-centered satellite sequences, we show that as backbone which was pre-trained on the Pascal VOC 2007 dataset derived water indices which are often used for flood detection can [4]. We employ the model on the filtered images and crop patches be learned with neural networks. By relying on recurrent networks, of persons. For each extracted patch, we compute a feature vector we want to move forward the state-of-the-art in flood impact assess- that reflects the body pose of the depicted person. The motivation ment by motivating for models that are well known in computer of using the body pose as a feature vector for estimating the water vision but generally not often used by remote sensing researchers. level is the following: If the knees or lower body parts are occluded by water, this is also reflected in the feature vector with no predicted coordinates for these body joints or with a very low confidence 1 INTRODUCTION only. We use Openpose [2] for pose estimation and compute as Many approaches in emergency response for flooding events are feature vector the normalized coordinates of the predicted body based on satellite imagery and focus on flood extend mapping. In joints as well as the corresponding confidence scores of the model. this work, we study the enrichment of satellite imagery with com- In the case that the image crop depicts more than one person in the plementary information from online news by focusing on flood crop, we select the one which is most centered. To finally classify severity estimation. We first consider the task of water level es- the crops into water level above or beyond water level, we trained timation during flooding events. Such information is particularly a Support Vector Machine classifier with radial basis function as important for emergency response, but at the same time difficult kernel. If there is at least one crop that was classified as above to extract from satellite imagery. Reasons for this, are the need knee level, we assign this label also to original image, otherwise of a high resolution elevation model and a fast access to satellite we continue with the next person patch. imagery. The latter aspect is often difficult to establish since images Evaluations on our internal validation dataset revealed that the from optical sensors can often not be used due to the presence of approach of classifying the body pose as a proxy to estimate the clouds and adverse constellations of non-geostationary satellites water level estimation leads to high recall but low precision. This at particular points in time. Additionally, we study flood severity is because the lower legs are often occluded by other objects (e.g. estimation by exploiting multi-temporal satellite images that are other persons, cars, boats) that are not water. In order to reduce the increasingly available nowadays. While many approaches in the number of False Positives, we extract the lower part of the person past are based on indices and pre-defined thresholds that work well crop and classified this region into the classes in the two classes for particular regions of the world, we look at new methods from water and non water. This water detector is based on a ResNet18 deep learning that are able to detect changes in multi-temporal [6] model which we fine-tuned on small patches of water and non images that are attributed due to flooding events. Our approaches water occluded persons. build upon the dataset that was released by the Multimedia Satellite Task 2019 [1]. 2.2 City-centered satellite sequences 2 APPROACH The satellite images of the CCSS dataset are already pre-processed and atmospherically corrected. However, so that the images can be 2.1 Multimodal Flood Level Estimation processed with standard deep learning frameworks, we multiply For the estimation of water level from online news images, we use the pixel values in all bands by a factor of 1e − 4 to map the values a multi-stage approach. In the first step, we use a convolutional from 16 bit to a floating number and normalize each band with neural network (CNN) to classify images with respect to the two mean and standard deviation. We supressed incomplete images in a classes of flood-related and non-flood related images. Therefore, we sequence with the tag FU LL − DAT A − COV ERAGE equals to false. For classifying the change in the images that is attributed to Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution flooding events, we use a sequence classification approach with 4.0 International (CC BY 4.0). LSTM models. Since we are dealing with images, we employ a MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France B. Bischke et al. Table 1: We report the F1-Scores of the MFLE task for the final water classifer, (3) using random guessing with the distribu- testset and our internal validation set. We can see that, the tion of the development set. We can see in Table 1, that multi-stage water patch classifier (run 2) has only a low impact (run 1). approach performs marginally better than random guessing. We Run 1 Run 2 Random Dev. Dist. can also see that the water level classifier adds only a minor con- Dev. set 73.09% 74.38% 59.51% 51.42% tribution to pose-based water level classifier. Since we are using a Test set 74.27% 74.82% 59.17% 51.80% multi-stage system and we want to quantify the influence of the different classifiers and perform an ablation study on the inter- Table 2: We report the F1-Scores of the CCSS task for the nal validation set. Our results show the following insights: When testset and our internal validation set. We can see that, the considering those images that have the label ’above knee level’ Conv-LSTM performs marginally better than our baselines. as relevant class, we can see that the flood classifier results in a high recall and high precision. Similarly for the person detection Run 1 Run 2 Run 3 Random Dev. Dist. with Faster R-CNN [7], we obtain a high recall and high precision. Dev. set 93.82% 91.35% 92.59% 51.85% 58.88% For the classification with Openpose however, the recall is high Test set 92.10% 93.50% 96.29% 49.32% 56.10% but the precision is low. There are multiple reasons for this. We observed that (1) the pose estimation fails in certain conditions e.g. Convolutional LSTM (ConvLSTM) [9] to learn the temporal de- for women in a skirt and (2) noticed errors of the prediction from pendencies between the images. The ConvLSTM uses 32 hidden Openpose due to reflections on the water surface. By looking at units and is trained on sequences of variable lengths. We use the the failure cases, we additionally noticed that is important to filter pre-trained network ResNet18 as encoder for extracting from raw non-standing persons that are detected by Faster R-CNN [7], e.g. images the feature maps before the average pooling layer and pass persons that are not standing or partially visible persons close to these feature maps to the ConvLSTM. Since ResNet18 was trained the image border. on images with only three channels, we only pass the RGB bands of the Sentinel-2 satellite imagery to the network. 3.2 City-centered satellite sequences In the second step, we also experiment with adding two convolu- For the CCSS subtask, we submitted the following runs: (1) using tional layers before the input of the ResNet18 that compress the 12 a ConvLSTM with RGB bands as input, (2) using a ConvLSTM input channels to 3 channels using 2D convolutions. Therefore, we with all 12 bands as input and two 1x1 convolutions that reduce upsample all 20 and 60 meter bands of the Sentinel 2 images of the 12 to 3 bands, (3) same as in (2) but with loд and exp as activation dataset to resolution of the 10 meter bands via bilinear interpolation functions. As baselines we compare the approaches against (4) and perform a channel-wise stacking. random guessing and (5) random guessing with the distribution of Our third approach builds on the observation that the remote the development set. In Table 2, we report the scores for all five sensing community has been using indexes [5], such as the Normal- runs. We can see that all runs based on the ConvLSTM yield high ized difference water index (NDWI), while other researchers use scores for both sets. Additionally, we can see that the score for RGB Convolutional Neural Networks (CNNs) for these tasks [8]. Using (run 1) is slightly better on the dev. set than the other runs while indexes has the benefit that the transformed bands can be visualised all bands (run 2) and all bands reduced with the internally learned and interpreted by humans, at the cost of having been selected and indices (run 3) resulted in the highest scores on the testset. Since optimized by experts for the task at hand. The CNNs do not offer the scores for the first three runs are all very high, we will extend this approach, however can be trained with labelled data. We unify this work in the future with an additional testset. both approaches and propose a network where the indices are rep- resented as layers in the CNN. The architecture consists of two 4 CONCLUSION & FUTURE WORK convolutional layers with 1x1 convolutional kernels, and a loд- and Summarizing this work, we presented for the MFLE task an ap- exp function as activation function after these layers respectively. proach based on state-of-the-art computer vision models for water In this architecture, there is an analytical solution for finding the level estimation from online images. In this approach we employed weights that correspond to popular indexes as NDWI, NDVI, ARVI, the model Openpose [2] and showed how existing approaches can NDRE. We use two of these layers as well as the activation functions be used to support disaster response. Nevertheless, we also identi- as an alternative for the second approach to convert the 12 input fied limitations and future directions to consider (reflections, skirts, channels to 3 channels. persons on image borders). For the second subtask we showed that ConvLSTMs are a powerful model to detect changes of a particular 3 RESULTS AND ANALYSIS class in multi-temporal satellite imagery. Additionally we explored The development sets for all subtasks are split into an internal train the possibility to represent traditional remote sensing indices di- and validation set with a 70/30 ratio. We make source code for both rectly with neural networks. We will follow up this idea in the subtasks available under this link1 . future, as such models can be very helpful to combine insights of Remote Sensing (indices) with recent advances of Deep Learning. 3.1 Multimodal Flood Level Estimation For the MFLE subtask, we submitted the following runs: (1) Classifi- ACKNOWLEDGMENTS cation using the multi-stage pipeline, (2) Same as (1) but without the This work was supported BMBF project DeFuseNN (01IW17002) 1 https://github.com/bbischke/MMSat19Submission and the NVIDIA AI Lab (NVAIL) program. The 2019 Multimedia Satellite Task MediaEval‘19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Benjamin Bischke, Patrick Helber, Simon Brugman, Erkan Basar, Zhengyu Zhao, Martha Larson, and Konstantin Pogorelov. 2019. The Multimedia Satellite Task at MediaEval 2019. In Working Notes Proceed- ings of the MediaEval 2019. MediaEval Benchmark (MediaEval-2019), October 27-29. Sophia Antipolis, France. [2] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR. [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255. [4] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303– 338. [5] Bo-Cai Gao. 1996. NDWI—A normalized difference water index for remote sensing of vegetation liquid water from space. Remote sensing of environment 58, 3 (1996), 257–266. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal net- works. In Advances in neural information processing systems. 91–99. [8] Tim GJ Rudner, Marc Rußwurm, Jakub Fil, Ramona Pelich, Benjamin Bischke, Veronika Kopačková, and Piotr Biliński. 2019. Multi3Net: Segmenting Flooded Buildings via Fusion of Multiresolution, Mul- tisensor, and Multitemporal Satellite Imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 702–709. [9] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems. 802–810.