=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_53
|storemode=property
|title=Flood
Level Estimation from News Articles and Flood Detection from Satellite Image Sequences
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_53.pdf
|volume=Vol-2670
|authors=Yu Feng,Shumin Tang,Hao
Cheng,Monika Sester
|dblpUrl=https://dblp.org/rec/conf/mediaeval/FengTCS19
}}
==Flood
Level Estimation from News Articles and Flood Detection from Satellite Image Sequences==
Flood level estimation from news articles and flood detection from satellite image sequences Yu Feng, Shumin Tang, Hao Cheng, Monika Sester Institute of Cartography and Geoinformatics, Leibniz University Hannover, Germany {yu.feng,hao.cheng,monika.sester}@ikg.uni-hannover.de,shumin.tang@outlook.com ABSTRACT 2 APPROACH This paper presents the solutions of team EVUS-ikg for the Mul- In MFLE, corresponding image and text pairs were annotated with timedia Satellite Task at MediaEval 2019. We addressed two of the binary labels, which indicate whether the image contains at least subtasks, namely multimodal flood level estimation (MFLE) and one person standing in water above the knee. For run 1, where only city-centered satellite sequences (CCSS). For MFLE, a two-step ap- visual information is allowed, a two-step approach was proposed. A proach was proposed, which retrieves flood relevant images based first classifier was trained for extracting flood relevant images and a on global deep features and then detects severe flood images based second classifier was then used to detect the images containing peo- on self-defined distance features, which can be extracted from hu- ple standing in water above the knee from these relevant images. man body keypoints and semantic segments. For CCSS, a neural We concatenated the features extracted from four CNN models, network, which combines CNN and LSTM, was used to detect namely InceptionV3 [15], DenseNet201 [10], InceptionResNetV2 [14] floods in satellite image sequences. Both methods have achieved a pre-trained on ImageNet and VGG16 pre-trained on Place365 [16]. good performance on the test set, which shows a great potential to Then, we trained a classifier on these features with Xgboost [7]. Sub- improve the current flood monitoring applications. sequently, all positive predicted images are processed with Open- Pose [5] pre-trained on Microsoft COCO [13] dataset for multi- person body keypoint detection and DeeplabV3+ [6] pre-trained 1 INTRODUCTION on ADE20K [17] for semantic segmentation. During this step, im- Flood, as one of the great natural disasters, endangers people’s ages without persons, or all persons in the image who are without safety and their property. Satellite images are one of the most often adjacency to ground or water segments, were directly marked as used data sources for flood mapping. However, this is not sufficient negative. Afterwards, we detected the water line based on the body to obtain enough evidence for the estimation of local flood severity. keypoints and segments, with the steps shown in Figure 1. Finally, Crowdsourcing, as a rapidly developing method for data acquisition, the pixel distances from each keypoint to the water line in vertical has been proved to be beneficial for such a purpose. From social direction are divided by the body length to calculate the relative media, flood relevant posts can be retrieved with image and text distances. These distances were used as the features to represent classifiers trained from deep neural networks (e.g. [8, 11]). However, the relationship between water and single person. After all, we the information retrieved by now are mostly evidences, further assigned each image annotations to all of the persons in the image details, such as flood severity, is still desired for many emergency and trained a second binary classifier with Xgboost. As for images response applications. In the previous Multimedia Satellite Tasks with multiple persons, the image would be considered "positive" if (MMSat) at the MediaEval benchmarking initiative, several tasks at least one of the persons is predicted as "positive" by this model. have been proposed regarding flood detection from satellite and For run 2, where only textual information is allowed, we used social media data. MMSat’18 [3] provided binary labels showing a TextCNN model [12] with fasttext [4] word embeddings, which road passability for tweets with photos, which can be regarded as is same as our solution for MMSat’18 [9]. For run 3, where the an early step for extracting local flood severity information. Our visual and textual information are fused, only the articles predicted solution [9], which simply used early fusion of several pre-trained "positive" by both models in run 1 and 2 would be considered as CNN features, has achieved an average performance compared with "positive" by this fused model. In run 4, we introduced extra data for the other teams. In MMSat’19 [1], the subtask multimodal flood level the visual based model. Since MMSat’17 [2] provided binary labeled estimation (MFLE) goes one step further, which aims to extract images for training a binary classifier showing flood relevancy. We news articles only with severe flood situation based on textual and trained on this augmented dataset to replace the first classifier in visual information. As for the satellite data, most of the previous run 1. research applied semantic segmentation indicating which pixels are In CCSS, the sequences are collected from 12-bands Sentinel-2 water. In order to confirm if it is flooding, an extra water boundary satellite images with date and time. As a pre-processing step, we is always needed for a comparison. The differences are not only performed a normalization by calculating the Z-score (subtracting caused by flood, but also can be caused by the mapping errors or the mean and then dividing by the standard deviation) for each band season change. In MMSat’19 [1], sequences of satellite images are individually and then clipped the normalized image into the range provided for a binary classification of the appearance of flooding from -1 to 1. We used DenseNet121 as a feature extractor, and then events, which could be a more reliable data source. connected the features using a LSTM with 32 cells in the temporal direction (Figure 2). We used a many-to-many LSTM, where we Copyright 2019 for this paper by its authors. Use required an output for each input image individually. The weights permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). of this DenseNet were initialized with the weights pre-trained on MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Y. Feng et al. (1) Overlay of semantic (2) Valid area created by body (3) Extract valid connecting (4) Abstraction with the (5) Extraction of distance segmentation and body keypoints and image bottom boundary height of lowest boundary feature keypoints points using convex hull point Figure 1: Steps for feature extraction (adapted on image under CC BY-NC-SA 2.0) Input Feature Feature Sequence Predicted Images Generator Vectors Learning Label Table 1: Evaluation on multimodal flood level estimation DenseNet121 LSTM 𝑦1 Macro-avg. F1-score Run 1 Run 2 Run 3 Run 4 Development set 73.99% 52.96% 75.56% 73.23% DenseNet121 LSTM 𝑦2 Test set 68.16% 48.86% 67.27% 68.28% Table 2: Evaluation on city-centered satellite sequences DenseNet121 LSTM 𝑦𝑡 Micro-avg. F1-score Run 1 Run 2 Run 3 Run 4 Development set 97.00% 98.50% 97.75% 68.16% Figure 2: Model for subtask city-centered satellite sequences Test set 92.65% 94.12% 97.06% 60.29% 3 RESULTS AND DISCUSSION ImageNet. Since the image sequence lengths are different and do not exceed 24 layers, we padded the sequence to 24 layers with For MLFE (Table 1), our image based approach can achieve an aver- same size tensors with zeros and generate a mask indicating which aged F1-score of 68.16% on test set. The text based method performs layers are padded. Then, we excluded these padded images when significantly worse than the image based model. Combining both calculating the categorical cross-entropy loss and accuracy metric. textual and visual information did not improve the F1-score in our During the training, we used only the RGB channels with a reduced case. In run 4, f1-score was improved slightly by introducing ad- image size 256 × 256. Since there are 267 sequences available in ditional images for training of the flood relevance classifier. We devset, we used 170 for training, 43 for validation and 54 as an further exam the failure examples. They can be categorized into internal testset. Since some of the images in the sequence may three types, namely failed pose detection, failed semantic segmen- be broken or incomplete, the field FULL-DATA-COVERAGE in the tation and failed water level estimation, where most of them are timeseries files can help us to filter these images. caused by drawing a wrong water line. This leads to many false We trained this model with different annotation settings. Each positive detections. For CCSS (Table 2), comparing the results from sequence is annotated with binary labels (hereinafter called seq- the first three runs using many-to-many LSTM and many-to-one label), where the field FLOODING in the timeseries file of each LSTM in run 4, the improvement is obvious. Regarding the different sequence provides the layer-level labels (hereinafter called layer- annotation settings, run 3 achieved the best performance, where the label). Our primary observation of the training data shows that, pseudo labels followed annotation patterns in layer-labels. Run 2 the layer-labels indicate a strong correspondence to the seq-labels, has also achieved a better performance than run 1, which indicates where only the sequences annotated with all negative are annotated the exact layer-labels may not be necessary to predict if a sequence with negative in the seq-labels. Thus, in run 1, we simply used these has flood or not. layer-labels to train this model. Subsequently, we tested different pseudo labels generated from the seq-labels. A repetition of seq- 4 CONCLUSIONS AND OUTLOOKS labels in run 2 (i.e. seq-label is 1, layer-labels are all 1; seq-label In this paper, separate solutions have been proposed for subtask is 0, layer-labels are all 0). Since we observed a strong pattern in MFLE and CCSS. Both models can solve the tasks properly according the positive labeled sequences, that the first half of the seq-labels to their performance on test set. For MFLE, the water line estimation are negative while the latter half negative. Thus, we followed this can be further improved in order to reduce false positive detections. pattern to generate pseudo layer-labels in run 3, where seq-label is For CCSS, the robustness of model can be tested on further events. 1, layer-labels are [0, 0, 0, 1, 1, 1] for a sequence of 6 images. For a comparison, instead of using a many-to-many LSTM, run 4 applied ACKNOWLEDGMENTS a many-to-one LSTM, where the model is optimized only based on This work is supported by project TransMiT (BMBF, 033W105A). seq-labels. The computational resource is provided by ICAML (BMBF, 01IS17076). The Multimedia Satellite Task: Flood Severity Estimation MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [17] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and [1] Benjamin Bischke, Patrick Helber, Simon Brugman, Erkan Basar, Antonio Torralba. 2017. Scene Parsing through ADE20K Dataset. In Zhengyu Zhao, Martha Larson, and Konstantin Pogorelov. The Multi- Proceedings of the IEEE Conference on Computer Vision and Pattern media Satellite Task at MediaEval 2019: Estimation of Flood Severity. Recognition. In Proc. of the MediaEval 2019 Workshop (Oct. 27-29, 2019). Sophia Antipolis, France. [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. 2017. The multime- dia satellite task at mediaeval 2017: Emergence response for flooding events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [3] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and Damian Borth. The Multimedia Satellite Task at MediaEval 2018: Emergency Response for Flooding Events. In Proc. of the MediaEval 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146. [5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh. 2017. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1302–1310. [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV). 801–818. [7] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794. [8] Yu Feng and Monika Sester. 2018. Extraction of pluvial flood relevant volunteered geographic information (VGI) by deep learning from user generated texts and photos. ISPRS International Journal of Geo- Information 7, 2 (2018), 39. [9] Yu Feng, Sergiy Shebotnov, Claus Brenner, and Monika Sester. 2018. Ensembled convolutional neural network models for retrieving flood relevant tweets. In Proc. of the MediaEval 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [10] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Wein- berger. 2017. Densely Connected Convolutional Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [11] Xiao Huang, Cuizhen Wang, Zhenlong Li, and Huan Ning. 2018. A visual–textual fused approach to automated tagging of flood-related tweets during a flood event. International Journal of Digital Earth (2018), 1–17. [12] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi- fication. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar. 1746–1751. [13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- ona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Mi- crosoft COCO: Common objects in context. In European conference on computer vision. Springer, 740–755. [14] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Vol. 4. 12. [15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826. [16] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recogni- tion. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1452–1464.