=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_14 |storemode=property |title=Multi-modal Deep Learning Approach for Flood Detection |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_14.pdf |volume=Vol-1984 |authors=Laura Lopez-Fuentes,Joost van de Weijer,Marc Bolaños,Harald Skinnemoen |dblpUrl=https://dblp.org/rec/conf/mediaeval/Lopez-FuentesWB17 }} ==Multi-modal Deep Learning Approach for Flood Detection== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_14.pdf
        Multi-modal Deep Learning Approach for Flood Detection
                Laura Lopez-Fuentes1,2,3 , Joost van de Weijer2 , Marc Bolaños 4 , Harald Skinnemoen3
                                                      1 University of the Balearic Islands, Palma, Spain
                                                2 Autonomous University of Barcelona, Barcelona, Spain
                               3 AnsuR Technologies, Oslo, Norway, 4 Universitat de Barcelona, Barcelona, Spain

                                    l.lopez@uib.es,joost@cvc.uab.es,marc.bolanos@ub.edu,harald@ansur.no

ABSTRACT
In this paper we propose a multi-modal deep learning approach
to detect floods in social media posts. Social media posts normally
contain some metadata and/or visual information, therefore in order
to detect the floods we use this information. The model is based on
a Convolutional Neural Network which extracts the visual features
and a bidirectional Long Short-Term Memory network to extract
the semantic features from the textual metadata. We validate the                     (a) Classified as containing ev-   (b) Classified as not containing evi-
method on images extracted from Flickr which contain both visual                     idence of flood. Metadata: Im-     dence of flood. Metadata: Image ti-
                                                                                     age title: "Floods in Walton       tle: "The closest we have got to the
information and metadata and compare the results when using                          on Thames", image descrip-         flooding disaster", image descrip-
both, visual information only or metadata only. This work has been                   tion: "Most of those houses        tion: None, tags: "freefolk"
done in the context of the MediaEval Multimedia Satellite Task.                      looked like they had been
                                                                                     flooded.", tags: None

1    INTRODUCTION                                                                Figure 1: Example of an image with evidence of flood and
The growth in smart phone ownership and the almost omnipresent                   an image with no evidence of flood with the associated meta-
access to Internet has empowered the rapid growth of social net-                 data that has been considered relevant for the task. Note that
works such as Twitter or Instagram, where sharing comments and                   although the second image has no evidence of flood it also
pictures has become part of our daily lives. Using the vast amount               contains water and the word "flood" in the metadata, which
of data from social media to extract valuable information is a hot               makes the classification harder.
topic nowadays [8]. In this work we will focus on extracting in-                 having evidence of flood or not. All the images are associated with
formation to facilitate the task of emergency responders during                  metadata information, from which we will take into account for
floods. Images coming from citizens during a flood could be es-                  the task the name of the photo and the description and user tags,
sential for emergency responders to have situational awareness.                  if available. In Figure 1 we give an example of an image classified
However, given the tremendous amount of information posted in                    as having evidence of flood and one which has been classified as
social networks, it is necessary to automatize the search of rele-               not having evidence of flood and the metadata that we keep for the
vant information corresponding to floods. Therefore, in this work                analysis associated to both.
we propose an algorithm for the retrieval of flood-related posts.                    Moreover, since the initial dataset was unbalanced and had ap-
As stated in [4] algorithms for flood detection have received little             proximately 60% of images with no evidence of flood we have
attention in the field of computer vision. There exist two major                 downloaded the three top images from the Google Similar Images
trends in this direction: algorithms based on satellite images [5–7]             search engine using as input images from the dataset classified as
and algorithms based on on-ground images [3]. In this work we                    having evidence of flood. Then we have manually removed incor-
will focus on on-ground images taken by humans in the flooded                    rect images ending up with 989 extra images with evidence of flood.
regions and posted on social networks and therefore containing                   We do not have the corresponding metadata of these images, so
metadata. To the best of our knowledge, there is no published pre-               they will only influence the visual part of the algorithm.
vious work on multi-modal flood detection. However, combining
image and text features has recently received great attention to                 3    APPROACH
solve tasks such as image captioning, multimedia retrieval or visual             In this section we will discuss the deep learning algorithm design
question answering (VQA). The work presented in this paper has                   for the task of flood evidence retrieval in social media posts. The
been inspired by the VQA model presented in [1].                                 problem will be approached under a probabilistic framework. As
                                                                                 explained in Section 2, the posts contain an image and/or meta-
2    DATA                                                                        data. To extract rich visual information we apply the convolutional
The dataset used in this work was introduced for the MediaEval 2017              InceptionV3 network, using the pre-trained weights on ImageNet
Multimedia Satellite Task [? ], and contains 6600 images extracted               [2] and fine-tune the last inception model of the network. For the
from the YFCC100M-Dataset [10] which have been classified as                     metadata we use a word embedding to represent the textual infor-
                                                                                 mation in a continuous space and feed it to a bidirectional LSTM.
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                              The word embedding is initialized using Glove [9] vectors, which
                                                                                 we fine-tune with our metadata. Finally, we concatenate the image
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                  L. Lopez-Fuentes et al.

Table 1: Average Precision (AP) on the test set on the first 480 images retrieved as flood and the Mean AP at 50, 100, 250 and
480 cutoffs

                                                                     AP @ 480    Mean AP @ 50,100,250 and 480
                                    Metadata only                      67.54                70.16
                                     Image only                        61.58                66.38
                                 Metadata and Image                    81.60                83.96
                         Metadata and Image with extra images          68.40                75.96

and text features followed by a fully connected layer and a softmax      5      RESULTS AND ANALYSIS
classifier to give a final probability of the sample containing rele-    As can be seen in Table 1 the metadata that we have selected for
vant information about a flood. In Figure 2 we show a sketch of the      the task is certainly relevant to retrieve information about the ev-
multimodal system, which can also be applied using only one of           idence of flood-related images in social network posts, reaching
the modalities.                                                          over 70% mean precision over 4 retrieval cutoffs. Since the clas-
                                                                         sification of the posts as containing evidence of flood or not has
                                                                         been manually done using only the images, the image information
                                                                         should be enough for the retrieval problem. However the perfor-
                                                                         mance of the algorithm using only the image goes down to 66% in
                                                                         mean over average precision at different cutoffs. This shows that
                                                                         although images should be more discriminative for this task, due
                                                                         to the difficulty of processing images in comparison to text, the
                                                                         metadata analysis gives better performance. There is also a clear
                                                                         improvement when combining both types of information, reaching
                                                                         almost a 84% accuracy in mean over the average precision in several
                                                                         cutoffs which shows that the metadata and the image complement
                                                                         each other quite well. Surprisingly, when training the system with
                                                                         extra images, the Mean AP drops to 76%, since the images have
Figure 2: Visual representation of the proposed algorithm.               been manually inspected to make sure that there were no noisy
                                                                         images added to the dataset, this makes us suspect that that result
                                                                         degrades when adding images without metadata, as this performs
                                                                         the weakest among all experiments, however it should be further
4   EXPERIMENTS
                                                                         studied before drawing additional conclusions.
We have divided the development set in training (3960 + 989 extra
flood images) and validation (1320). As for the optimizer we have        6      DISCUSSION AND OUTLOOK
chosen RmsProp [11] wich uses the magnitude of recent gradients
to normalize the gradients, and set an initial learning rate of 0.001.   In this paper we have proposed a multi-modal deep learning ap-
   Since the dataset does not have a very large number of training       proach to retrieve posts from social networks containing valuable
data it is common to run into overfitting problems. In order to avoid    information about floods. The system can work using only visual
this problem we have used the validation set to determine when           information, only text or combining both types of information.
to stop the training. Thus, it is stopped when the performance on           It has been shown that combining both types of information
the validation set stops increasing or starts decreasing over the last   improves greatly the performance of the system. For future work
two epochs. Then we have used that number to retrain the system          it would be interesting to check if other type of metadata could
using the training and the validation set. We have followed this         also provide useful information for the task, as for example the
procedure for all the experiments.                                       location or time where the image was taken since there are regions
   We have trained the system in four different configurations: 1)       and seasons which are more prone to flooding. It would also be
having images and metadata as input, 2) having only images as input      interesting to study why adding more images to the training set
and 3) having only the metadata as input, and 4) having images           has worsened the performance of the system and how well does
and metadata in addition to the extra images obtained from Google        the system generalize to images outside of the dataset.
Similar Images. The results on the test set of these four experiments
are given in Table 1. The system has been evaluated as a retrieval       ACKNOWLEDGMENTS
task. All the posts from the test set have been given a probability      This work was partially supported by the Spanish Grants TIN2016-
of containing evidence of flood and have been put in order from          75404-P AEI/FEDER, UE, TIN2014-52072-P, TIN2016-79717-R, TIN2013-
higher probability to lower. In the first column of Table 1 we show      42795-P and the European Commission H2020 I-REACT project
the Average Precision (AP) of posts which have been classified as        no. 700256. Laura Lopez-Fuentes benefits from the NAERINGSPHD
containing evidence of flood in the first 480 retrieved posts. In the    fellowship of the Norwegian Research Council under the collabora-
second column we show the mean over average precision when               tion agreement Ref.3114 with the UIB. Marc Bolaños benefits from
evaluated on the first 50, 100, 250 and 480 posts.                       the FPU fellowship.
Multimedia Satellite Task                                                       MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Marc Bolaños, Álvaro Peris, Francisco Casacuberta, and Petia Radeva.
     2017. VIBIKNet: Visual bidirectional kernelized network for visual
     question answering. In Iberian Conference on Pattern Recognition and
     Image Analysis. Springer, 372–380.
 [2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
     2009. Imagenet: A large-scale hierarchical image database. In Computer
     Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.
     IEEE, 248–255.
 [3] CL Lai, JC Yang, and YH Chen. 2007. A real time video processing
     based surveillance system for early fire and flood detection. In Instru-
     mentation and Measurement Technology Conference Proceedings, 2007.
     IMTC 2007. IEEE. IEEE, 1–6.
 [4] Laura Lopez-Fuentes, Joost van de Weijer, Manuel González-Hidalgo,
     Harald Skinnemoen, and Andrew D. Bagdanov. 2017. Review on
     Computer Vision Techniques in Emergency Situations. arXiv preprint
     arXiv:1708.07455 (2017).
 [5] Sandro Martinis. 2010. Automatic near real-time flood detection in
     high resolution X-band synthetic aperture radar satellite data using
     context-based classification on irregular graphs. Ph.D. Dissertation.
     lmu.
 [6] David C Mason, Ian J Davenport, Jeffrey C Neal, Guy J-P Schumann,
     and Paul D Bates. 2012. Near real-time flood detection in urban and
     rural areas using high-resolution synthetic aperture radar images.
     Geoscience and Remote Sensing, IEEE Transactions on 50, 8 (2012), 3041–
     3052.
 [7] David C Mason, Rainer Speck, Bernard Devereux, Guy JP Schumann,
     Jeffrey C Neal, and Paul D Bates. 2010. Flood detection in urban areas
     using TerraSAR-X. Geoscience and Remote Sensing, IEEE Transactions
     on 48, 2 (2010), 882–894.
 [8] Michael Mathioudakis and Nick Koudas. 2010. Twittermonitor: trend
     detection over the twitter stream. In Proceedings of the 2010 ACM
     SIGMOD International Conference on Management of data. ACM, 1155–
     1158.
 [9] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014.
     Glove: Global vectors for word representation.. In EMNLP, Vol. 14.
     1532–1543.
[10] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde,
     Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M:
     The new data in multimedia research. Commun. ACM 59, 2 (2016),
     64–73.
[11] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-RMSProp,
     COURSERA: Neural networks for machine learning. University of
     Toronto, Tech. Rep (2012).