=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_39
|storemode=property
|title=Data-Driven Flood Detection using Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_39.pdf
|volume=Vol-1984
|authors=Keiller Nogueira,Samuel Fadel,Ícaro Dourado,Rafael Werneck,Javier Muñoz,Otávio Penatti,Rodrigo Calumby,Lin Li,Jefersson Dos Santos,Ricardo Torres
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NogueiraFDWMPCL17
}}
==Data-Driven Flood Detection using Neural Networks==
Data-Driven Flood Detection using Neural Networks Keiller Nogueira1 , Samuel G. Fadel2 , Ícaro C. Dourado2 , Rafael de O. Werneck2 , Javier A. V. Muñoz2 , Otávio A. B. Penatti3 , Rodrigo T. Calumby2,4 , Lin Tzy Li2,3 , Jefersson A. dos Santos1 , Ricardo da S. Torres2 1 Universidade Federal de Minas Gerais (UFMG), 2 University of Campinas (Unicamp), 3 SAMSUNG R&D Institute Brazil, 4 University of Feira de Santana [keiller.nogueira,jefersson]@dcc.ufmg.br,[samuel.fadel,icaro.dourado,rafael.werneck,lintzyli,rtorres]@ic.unicamp.br, jalvarm.acm@gmail.com,o.penatti@samsung.com,rtcalumby@ecomp.uefs.br ABSTRACT Run 2: A relevant portion of the available metadata are tags This paper describes the approaches used by our team (MultiBrasil) and descriptions that are not necessarily well-written sentences. for the Multimedia Satellite Task at MediaEval 2017. For both dis- In addition, the amount of available data also discourages the use aster image retrieval and flood-detection in satellite images, we of recent neural networks designed for learning from text data, employ neural networks for end-to-end learning. Specifically, for as many of them are large architectures based on convolutional the first subtask, we exploit Convolutional Networks and Relation or recurrent neural networks, thus requiring larger datasets of Networks while, for the latter, dilated Convolutional Networks structured sentences. With this in mind, we hypothesized that the were employed. co-occurrence of words is still valuable evidence of a flooding event and is easier to learn from than structured sentences. For runs 2 and 3, which use text data, we extracted a set of words from each 1 INTRODUCTION image metadata and used a neural network to learn from that set Natural disaster monitoring is a fundamental task to create preven- how to classify if an image describes or not a flooding event. This tion strategies, as well as to help authorities to act in the control set is the union of the set of words extracted from the description of damages. In its first appearance at MediaEval, the Multimedia and the set of words occurring in tags. To obtain the set of words Satellite Task [3] focuses on monitoring of flooding events, which from the description, we remove any HTML, non-letter symbols, is considered the most harmful and costly type of natural disas- stop words, and then apply a standard stemming procedure [2]. ter in the world [8].The task is subdivided into two subtasks: (a) We designed a RN to learn from the sets of words for run 2. In or- Disaster Image Retrieval from Social Media (DIRSM), which deals der to create a representation for words, we built a word dictionary with flooding events in data (visual and textual) crawled from social from all words in the training data, assigning an integer ID to each media; and (b) Flooding-Detection in Satellite Images (FDSI), which word. The first layer of the network is a fully connected layer that refers to segment flooding regions in satellites images. takes one-hot encoded vectors representing words as input (ID of the word is the index of the ‘1’ value) and outputs a 32-dimensional 2 DISASTER IMAGE RETRIEVAL (DIRSM) vector. Using this strategy, vector representations improve as learn- For the DIRSM task, we employed Convolutional Networks (CNN) [6] ing goes on, while not requiring large word datasets. to deal with visual features. For textual features, we applied Rela- Run 3: Since we had access to both images and their metadata, tion Networks (RN) [9] and traditional methods (as baseline) such we use an architecture that incorporates a CNN for image data and as Bag of Words (BoW) and bigrams. The recently proposed RN is a RN similar to the one used for run 2 for the metadata. The CNN a neural network designed for taking into account the relationship is a ResNet-18 [7] pre-trained on ImageNet, but its last layer was between pairs of objects during training. A RN consists of two replaced by a fully-connected layer of 512 units. The f network neural networks, f and д, whose parameters are learned jointly. of the RN uses the same architecture from run 2, except its last In runs 1, 2, 3, and 4, we used neural networks and trained them layer is replaced by a fully-connected layer of 256 units. Then, the for classification, with the positive class being a flooding event. output of both the CNN and RN are concatenated into a single In those runs, the final ranking was created by sorting the test vector, followed by a fully-connected layer of 512 units and finally a set with respect to the classification score, from highest to lowest. single sigmoid unit for classification. The network is then trained as Thus, ideally, images with flooding events should have higher score whole, with no specific tuning for handling the pre-trained weights. and appear first in the ranking. None of the runs used additional Run 4: We proposed an alternative solution for run 1. We split datasets. the training images into 5 disjoint subsets, which are combined Run 1: This run, which focuses only on visual data, employed by taking 4 out of 5, covering all possible combinations of the 5 GoogleNet [11] pre-trained on ImageNet dataset. We fine-tuned the sets (similar to a 5-fold cross validation). This process results in 5 network using the whole training set, replacing the original last distinct and combined sets (each one composed of 4 subsets), which layer by a new one containing two neurons, which correspond to are used to fine-tune 5 independent GoogLeNets. In the prediction the two classes: “flooding” and “non-flooding”. phase, we average the scores of being a flooding event given by each network and then rank the test set with them. Copyright held by the owner/author(s). Run 5: We used a metadata-based approach based on an IR MediaEval’17, 13-15 September 2017, Dublin, Ireland ranking solution, ranking test samples based on their estimation MediaEval’17, 13-15 September 2017, Dublin, Ireland K. Nogueira et al. of being flood. That estimation comes from a metric evaluation Run 3: We combined the features extracted from several distinct over a ranked list for each test sample, computed as follows. Let CNNs using a linear SVM. Specifically, the SVM receives as input D = {d 1 , d 2 , . . .} be the dev set, and T = {t 1 , t 2 , . . .} be the test features extracted from CNNs trained in run 1, 2, and 5, as well set. Also let ⟨M, F ⟩ be a pair of a representation model M and a as: (i) a dilated CNN with pooling layers (but that do not reduce distance function F , in which M is applied over a sample s (the query the resolution giving the padding), and (ii) two networks based from an IR perspective) and produces M(s), and F is applied over a SegNet [1] that uses deconvolution layers. pair of samples previously modeled by M, so F (M(s 1 , s 2 )) = fs1,s2 Run 4: We combined all networks presented in run 3 using a corresponds to the distance of s 1 to s 2 with respect to M and F . majority voting scheme. With ⟨M, F ⟩ and a test sample t, we generate a ranked list r (t) that Run 5: We trained a specific dilated CNN (using patches of contains up to |D| pairs of ⟨di , ft,di ⟩, where di ∈ D, whose pairs 25 × 25) for each of the six locations, i.e., we had one network are sorted by ft,di . specialized for each location. The prediction is similar to run 1, We tested ⟨M, F ⟩ pairs, then selected the ones who performed except for the use of each CNN in its respective location. For the best on dev set. For the three best pairs, we produce three ranked new location, we combined the features extracted from each CNN lists for a sample. We use a graph-based rank-aggregation technique using a linear SVM, just like run 3. to produce an unified ranked list. By applying the same procedure to the dev samples as well, as if they were also queries, we generate 4 RESULTS & DISCUSSION graphs that combine their ranked lists, thus we end up with graphs Table 1 presents our results for the DIRSM subtask. The best results for every sample. Given a test graph, we compare it to the dev considering AP@480 was achieved by run 3 (95.84%), the neural graphs and produce a final ranked list. A graph-based dissimilarity network solution that combines textual and visual data. However, function [12] is used to compare test and dev graphs. Given the considering the MAP@[50,100,250,480] the visual only approach set of ranked lists produced for each test sample, we estimate ‘how that combines results from 5 fine-tunned networks from GoogLeNet much flood’ a test sample is, using the NDCG@K measure [5]. (91.59%) stood out (e.g., run 4). In both cases, the results of neural The final submission file contains test samples decreasingly sorted networks surpassed by far those yielded by the IR approach (run 5). by their estimation. We chose K by evaluating within the dev set, picking K = 7, which maximized the effectiveness. Table 1: Average Precision (%) at 480 and Mean Average Pre- The three best ⟨M, F ⟩ pairs were ⟨RF, WGU⟩, ⟨bigrams-TF, cosine⟩, cision (MAP) (%) at cut-offs 50, 100, 250 and 480 (DIRSM). and ⟨BoW-TF, cosine⟩, where TF is a weighting functions, RF (rel- ative frequency) is a graph-based text representation model [10], and WGU [12] is graph-based dissimilarity function. Run 1 Run 2 Run 3 Run 4 Run 5 AP@480 74.60 76.71 95.84 82.06 54.31 MAP@[50,100,250,480] 87.88 62.53 85.63 91.59 41.13 3 FLOOD-DETECTION (FDSI) For the FDSI task, we employed CNNs with dilated (or a-trous) convolution [4]. Unlike standard CNNs, networks composed of this Our results for FDSI subtask (Table 2) indicated that the solution type of convolution learn the given task by processing the input that extracts features using several CNNs and combines them with without downsampling it. This is only possible because dilated con- SVM produced the best results (run 3) for test items referring to volutions allow gaps (or “holes”) inside their filters, which represent either locations seen before in training set, or new locations. a great advantage in terms of computational processing, as well as in terms of learning, given that internal feature maps do not lose Table 2: Mean Intersection Over Union (%) (FDSI). resolution (and information). In this subtask, we proposed 4 CNNs. The most important one, Run 1 Run 2 Run 3 Run 4 Run 5 which was exploited in all runs, is composed of 6 dilated convolution layers and the final fully-connected one, which is responsible for Same locations 87.64 86.56 88.23 78.06 87.93 the classification. There are no poolings or normalizations inside New locations 82.53 80.25 84.10 49.80 84.10 the network. The first two convolutions have 5 × 5 filters with dilation rate 1. Convolutions 3 and 4 have 4 × 4 filters but larger rate 2. Finally, the last convolutions have smaller filters (3 × 3) but larger dilation rate 4. In a pre-processing stage, we normalized the 5 FINAL REMARKS & FUTURE WORK images using the mean and standard deviation of each image band. As future work, for the DIRSM subtask, we intend to: (i) combine Run 1: We trained the aforementioned CNN by using overlap- different networks (VGG, AlexNet, ResNet, DenseNet) and (ii) use ping patches of size 25 × 25 extracted from all training images. In RNs with open vocabulary models, as the current approach has its the prediction phase, we also extracted overlapping patches with vocabulary limited to the training data. For the FDSI subtask, we in- the same resolution from the testing images and averaged the prob- tend to: (i) explore different learning algorithms as post-processing, abilities outputted by the network. and (ii) combine distinct networks. Run 2: We processed the images exactly as in run 1 but using a larger patch, with 50 × 50 pixels, which tends to aggregate more ACKNOWLEDGMENTS context that could improve the learning process. We thank FAPESP, FAPEMIG, CNPq, and CAPES. Multimedia Satellite Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2015. Seg- net: A deep convolutional encoder-decoder architecture for image segmentation. (2015). Preprint at http://arxiv.org/abs/1511.00561. [2] Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc. [3] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite Task at MediaEval 2017: Emergence Response for Flooding Events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Mur- phy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. (2016). Preprint at http://arxiv.org/abs/1606.00915. [5] M.-L. Fernández and G. Valiente. 2001. A graph distance metric combin- ing maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22, 6 (2001), 753–758. [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. [7] K He, X Zhang, S Ren, and J Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR. 2016.90 [8] Sandro Martinis, André Twele, and Stefan Voigt. 2009. Towards oper- ational near real-time flood detection using a split-based automatic thresholding procedure on high resolution TerraSAR-X data. Natural Hazards and Earth System Sciences 9, 2 (2009), 303–314. [9] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. (June 2017). Preprint at http://arxiv.org/abs/1706.01427. [10] Adam Schenker, Horst Bunke, Mark Last, and Abraham Kandel. 2005. Graph-Theoretic Techniques for Web Content Mining. World Scientific Publishing Co., Inc., NJ, USA. [11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR). [12] W. D. Wallis, P. Shoubridge, M. Kraetz, and D. Ray. 2001. Graph distances using graph union. Pattern Recognition Letters 22, 6 (2001), 701–704.