=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_39 |storemode=property |title=Data-Driven Flood Detection using Neural Networks |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_39.pdf |volume=Vol-1984 |authors=Keiller Nogueira,Samuel Fadel,Ícaro Dourado,Rafael Werneck,Javier Muñoz,Otávio Penatti,Rodrigo Calumby,Lin Li,Jefersson Dos Santos,Ricardo Torres |dblpUrl=https://dblp.org/rec/conf/mediaeval/NogueiraFDWMPCL17 }} ==Data-Driven Flood Detection using Neural Networks== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_39.pdf
                Data-Driven Flood Detection using Neural Networks
                     Keiller Nogueira1 , Samuel G. Fadel2 , Ícaro C. Dourado2 , Rafael de O. Werneck2 ,
                     Javier A. V. Muñoz2 , Otávio A. B. Penatti3 , Rodrigo T. Calumby2,4 , Lin Tzy Li2,3 ,
                                     Jefersson A. dos Santos1 , Ricardo da S. Torres2
                            1 Universidade Federal de Minas Gerais (UFMG), 2 University of Campinas (Unicamp),
                                          3 SAMSUNG R&D Institute Brazil, 4 University of Feira de Santana

     [keiller.nogueira,jefersson]@dcc.ufmg.br,[samuel.fadel,icaro.dourado,rafael.werneck,lintzyli,rtorres]@ic.unicamp.br,
                          jalvarm.acm@gmail.com,o.penatti@samsung.com,rtcalumby@ecomp.uefs.br

ABSTRACT                                                                       Run 2: A relevant portion of the available metadata are tags
This paper describes the approaches used by our team (MultiBrasil)          and descriptions that are not necessarily well-written sentences.
for the Multimedia Satellite Task at MediaEval 2017. For both dis-          In addition, the amount of available data also discourages the use
aster image retrieval and flood-detection in satellite images, we           of recent neural networks designed for learning from text data,
employ neural networks for end-to-end learning. Specifically, for           as many of them are large architectures based on convolutional
the first subtask, we exploit Convolutional Networks and Relation           or recurrent neural networks, thus requiring larger datasets of
Networks while, for the latter, dilated Convolutional Networks              structured sentences. With this in mind, we hypothesized that the
were employed.                                                              co-occurrence of words is still valuable evidence of a flooding event
                                                                            and is easier to learn from than structured sentences. For runs 2
                                                                            and 3, which use text data, we extracted a set of words from each
1    INTRODUCTION                                                           image metadata and used a neural network to learn from that set
Natural disaster monitoring is a fundamental task to create preven-         how to classify if an image describes or not a flooding event. This
tion strategies, as well as to help authorities to act in the control       set is the union of the set of words extracted from the description
of damages. In its first appearance at MediaEval, the Multimedia            and the set of words occurring in tags. To obtain the set of words
Satellite Task [3] focuses on monitoring of flooding events, which          from the description, we remove any HTML, non-letter symbols,
is considered the most harmful and costly type of natural disas-            stop words, and then apply a standard stemming procedure [2].
ter in the world [8].The task is subdivided into two subtasks: (a)             We designed a RN to learn from the sets of words for run 2. In or-
Disaster Image Retrieval from Social Media (DIRSM), which deals             der to create a representation for words, we built a word dictionary
with flooding events in data (visual and textual) crawled from social       from all words in the training data, assigning an integer ID to each
media; and (b) Flooding-Detection in Satellite Images (FDSI), which         word. The first layer of the network is a fully connected layer that
refers to segment flooding regions in satellites images.                    takes one-hot encoded vectors representing words as input (ID of
                                                                            the word is the index of the ‘1’ value) and outputs a 32-dimensional
2    DISASTER IMAGE RETRIEVAL (DIRSM)                                       vector. Using this strategy, vector representations improve as learn-
For the DIRSM task, we employed Convolutional Networks (CNN) [6]            ing goes on, while not requiring large word datasets.
to deal with visual features. For textual features, we applied Rela-           Run 3: Since we had access to both images and their metadata,
tion Networks (RN) [9] and traditional methods (as baseline) such           we use an architecture that incorporates a CNN for image data and
as Bag of Words (BoW) and bigrams. The recently proposed RN is              a RN similar to the one used for run 2 for the metadata. The CNN
a neural network designed for taking into account the relationship          is a ResNet-18 [7] pre-trained on ImageNet, but its last layer was
between pairs of objects during training. A RN consists of two              replaced by a fully-connected layer of 512 units. The f network
neural networks, f and д, whose parameters are learned jointly.             of the RN uses the same architecture from run 2, except its last
   In runs 1, 2, 3, and 4, we used neural networks and trained them         layer is replaced by a fully-connected layer of 256 units. Then, the
for classification, with the positive class being a flooding event.         output of both the CNN and RN are concatenated into a single
In those runs, the final ranking was created by sorting the test            vector, followed by a fully-connected layer of 512 units and finally a
set with respect to the classification score, from highest to lowest.       single sigmoid unit for classification. The network is then trained as
Thus, ideally, images with flooding events should have higher score         whole, with no specific tuning for handling the pre-trained weights.
and appear first in the ranking. None of the runs used additional              Run 4: We proposed an alternative solution for run 1. We split
datasets.                                                                   the training images into 5 disjoint subsets, which are combined
   Run 1: This run, which focuses only on visual data, employed             by taking 4 out of 5, covering all possible combinations of the 5
GoogleNet [11] pre-trained on ImageNet dataset. We fine-tuned the           sets (similar to a 5-fold cross validation). This process results in 5
network using the whole training set, replacing the original last           distinct and combined sets (each one composed of 4 subsets), which
layer by a new one containing two neurons, which correspond to              are used to fine-tune 5 independent GoogLeNets. In the prediction
the two classes: “flooding” and “non-flooding”.                             phase, we average the scores of being a flooding event given by
                                                                            each network and then rank the test set with them.
Copyright held by the owner/author(s).                                         Run 5: We used a metadata-based approach based on an IR
MediaEval’17, 13-15 September 2017, Dublin, Ireland                         ranking solution, ranking test samples based on their estimation
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                      K. Nogueira et al.


of being flood. That estimation comes from a metric evaluation                       Run 3: We combined the features extracted from several distinct
over a ranked list for each test sample, computed as follows. Let                 CNNs using a linear SVM. Specifically, the SVM receives as input
D = {d 1 , d 2 , . . .} be the dev set, and T = {t 1 , t 2 , . . .} be the test   features extracted from CNNs trained in run 1, 2, and 5, as well
set. Also let ⟨M, F ⟩ be a pair of a representation model M and a                 as: (i) a dilated CNN with pooling layers (but that do not reduce
distance function F , in which M is applied over a sample s (the query            the resolution giving the padding), and (ii) two networks based
from an IR perspective) and produces M(s), and F is applied over a                SegNet [1] that uses deconvolution layers.
pair of samples previously modeled by M, so F (M(s 1 , s 2 )) = fs1,s2               Run 4: We combined all networks presented in run 3 using a
corresponds to the distance of s 1 to s 2 with respect to M and F .               majority voting scheme.
With ⟨M, F ⟩ and a test sample t, we generate a ranked list r (t) that               Run 5: We trained a specific dilated CNN (using patches of
contains up to |D| pairs of ⟨di , ft,di ⟩, where di ∈ D, whose pairs              25 × 25) for each of the six locations, i.e., we had one network
are sorted by ft,di .                                                             specialized for each location. The prediction is similar to run 1,
    We tested ⟨M, F ⟩ pairs, then selected the ones who performed                 except for the use of each CNN in its respective location. For the
best on dev set. For the three best pairs, we produce three ranked                new location, we combined the features extracted from each CNN
lists for a sample. We use a graph-based rank-aggregation technique               using a linear SVM, just like run 3.
to produce an unified ranked list. By applying the same procedure
to the dev samples as well, as if they were also queries, we generate             4     RESULTS & DISCUSSION
graphs that combine their ranked lists, thus we end up with graphs                Table 1 presents our results for the DIRSM subtask. The best results
for every sample. Given a test graph, we compare it to the dev                    considering AP@480 was achieved by run 3 (95.84%), the neural
graphs and produce a final ranked list. A graph-based dissimilarity               network solution that combines textual and visual data. However,
function [12] is used to compare test and dev graphs. Given the                   considering the MAP@[50,100,250,480] the visual only approach
set of ranked lists produced for each test sample, we estimate ‘how               that combines results from 5 fine-tunned networks from GoogLeNet
much flood’ a test sample is, using the NDCG@K measure [5].                       (91.59%) stood out (e.g., run 4). In both cases, the results of neural
The final submission file contains test samples decreasingly sorted               networks surpassed by far those yielded by the IR approach (run 5).
by their estimation. We chose K by evaluating within the dev set,
picking K = 7, which maximized the effectiveness.
                                                                                  Table 1: Average Precision (%) at 480 and Mean Average Pre-
    The three best ⟨M, F ⟩ pairs were ⟨RF, WGU⟩, ⟨bigrams-TF, cosine⟩,
                                                                                  cision (MAP) (%) at cut-offs 50, 100, 250 and 480 (DIRSM).
and ⟨BoW-TF, cosine⟩, where TF is a weighting functions, RF (rel-
ative frequency) is a graph-based text representation model [10],
and WGU [12] is graph-based dissimilarity function.                                                          Run 1     Run 2     Run 3    Run 4    Run 5
                                                                                                 AP@480        74.60   76.71     95.84     82.06    54.31
                                                                                      MAP@[50,100,250,480]     87.88   62.53     85.63     91.59    41.13
3    FLOOD-DETECTION (FDSI)
For the FDSI task, we employed CNNs with dilated (or a-trous)
convolution [4]. Unlike standard CNNs, networks composed of this                     Our results for FDSI subtask (Table 2) indicated that the solution
type of convolution learn the given task by processing the input                  that extracts features using several CNNs and combines them with
without downsampling it. This is only possible because dilated con-               SVM produced the best results (run 3) for test items referring to
volutions allow gaps (or “holes”) inside their filters, which represent           either locations seen before in training set, or new locations.
a great advantage in terms of computational processing, as well as
in terms of learning, given that internal feature maps do not lose                       Table 2: Mean Intersection Over Union (%) (FDSI).
resolution (and information).
   In this subtask, we proposed 4 CNNs. The most important one,
                                                                                                       Run 1      Run 2    Run 3         Run 4     Run 5
which was exploited in all runs, is composed of 6 dilated convolution
layers and the final fully-connected one, which is responsible for                    Same locations   87.64       86.56       88.23     78.06     87.93
the classification. There are no poolings or normalizations inside                    New locations    82.53       80.25       84.10     49.80     84.10
the network. The first two convolutions have 5 × 5 filters with
dilation rate 1. Convolutions 3 and 4 have 4 × 4 filters but larger
rate 2. Finally, the last convolutions have smaller filters (3 × 3) but
larger dilation rate 4. In a pre-processing stage, we normalized the              5     FINAL REMARKS & FUTURE WORK
images using the mean and standard deviation of each image band.                  As future work, for the DIRSM subtask, we intend to: (i) combine
   Run 1: We trained the aforementioned CNN by using overlap-                     different networks (VGG, AlexNet, ResNet, DenseNet) and (ii) use
ping patches of size 25 × 25 extracted from all training images. In               RNs with open vocabulary models, as the current approach has its
the prediction phase, we also extracted overlapping patches with                  vocabulary limited to the training data. For the FDSI subtask, we in-
the same resolution from the testing images and averaged the prob-                tend to: (i) explore different learning algorithms as post-processing,
abilities outputted by the network.                                               and (ii) combine distinct networks.
   Run 2: We processed the images exactly as in run 1 but using a
larger patch, with 50 × 50 pixels, which tends to aggregate more                  ACKNOWLEDGMENTS
context that could improve the learning process.                                  We thank FAPESP, FAPEMIG, CNPq, and CAPES.
Multimedia Satellite Task                                                     MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2015. Seg-
     net: A deep convolutional encoder-decoder architecture for image
     segmentation. (2015). Preprint at http://arxiv.org/abs/1511.00561.
 [2] Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language
     Processing with Python. O’Reilly Media Inc.
 [3] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
     Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite
     Task at MediaEval 2017: Emergence Response for Flooding Events.
     In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,
     Ireland.
 [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Mur-
     phy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation
     with deep convolutional nets, atrous convolution, and fully connected
     crfs. (2016). Preprint at http://arxiv.org/abs/1606.00915.
 [5] M.-L. Fernández and G. Valiente. 2001. A graph distance metric combin-
     ing maximum common subgraph and minimum common supergraph.
     Pattern Recognition Letters 22, 6 (2001), 753–758.
 [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep
     Learning. MIT Press.
 [7] K He, X Zhang, S Ren, and J Sun. 2016. Deep Residual Learning for
     Image Recognition. In 2016 IEEE Conference on Computer Vision and
     Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.
     2016.90
 [8] Sandro Martinis, André Twele, and Stefan Voigt. 2009. Towards oper-
     ational near real-time flood detection using a split-based automatic
     thresholding procedure on high resolution TerraSAR-X data. Natural
     Hazards and Earth System Sciences 9, 2 (2009), 303–314.
 [9] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski,
     Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple
     neural network module for relational reasoning. (June 2017). Preprint
     at http://arxiv.org/abs/1706.01427.
[10] Adam Schenker, Horst Bunke, Mark Last, and Abraham Kandel. 2005.
     Graph-Theoretic Techniques for Web Content Mining. World Scientific
     Publishing Co., Inc., NJ, USA.
[11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
     Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
     Rabinovich. 2015. Going Deeper with Convolutions. In Computer
     Vision and Pattern Recognition (CVPR).
[12] W. D. Wallis, P. Shoubridge, M. Kraetz, and D. Ray. 2001. Graph
     distances using graph union. Pattern Recognition Letters 22, 6 (2001),
     701–704.