<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-modal Deep Learning Approach for Flood Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Lopez-Fuentes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joost van de Weijer</string-name>
          <email>joost@cvc.uab.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Bolaños</string-name>
          <email>marc.bolanos@ub.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Skinnemoen</string-name>
          <email>harald@ansur.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AnsuR Technologies</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Autonomous University of Barcelona</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat de Barcelona</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of the Balearic Islands</institution>
          ,
          <addr-line>Palma</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we propose a multi-modal deep learning approach to detect floods in social media posts. Social media posts normally contain some metadata and/or visual information, therefore in order to detect the floods we use this information. The model is based on a Convolutional Neural Network which extracts the visual features and a bidirectional Long Short-Term Memory network to extract the semantic features from the textual metadata. We validate the method on images extracted from Flickr which contain both visual information and metadata and compare the results when using both, visual information only or metadata only. This work has been done in the context of the MediaEval Multimedia Satellite Task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The growth in smart phone ownership and the almost omnipresent
access to Internet has empowered the rapid growth of social
networks such as Twitter or Instagram, where sharing comments and
pictures has become part of our daily lives. Using the vast amount
of data from social media to extract valuable information is a hot
topic nowadays [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this work we will focus on extracting
information to facilitate the task of emergency responders during
lfoods. Images coming from citizens during a flood could be
essential for emergency responders to have situational awareness.
However, given the tremendous amount of information posted in
social networks, it is necessary to automatize the search of
relevant information corresponding to floods. Therefore, in this work
we propose an algorithm for the retrieval of flood-related posts.
As stated in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] algorithms for flood detection have received little
attention in the field of computer vision. There exist two major
trends in this direction: algorithms based on satellite images [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5–7</xref>
        ]
and algorithms based on on-ground images [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this work we
will focus on on-ground images taken by humans in the flooded
regions and posted on social networks and therefore containing
metadata. To the best of our knowledge, there is no published
previous work on multi-modal flood detection. However, combining
image and text features has recently received great attention to
solve tasks such as image captioning, multimedia retrieval or visual
question answering (VQA). The work presented in this paper has
been inspired by the VQA model presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>DATA</title>
      <p>
        The dataset used in this work was introduced for the MediaEval 2017
Multimedia Satellite Task [? ], and contains 6600 images extracted
from the YFCC100M-Dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which have been classified as
(a) Classified as containing ev- (b) Classified as not containing
eviidence of flood. Metadata: Im- dence of flood. Metadata: Image
tiage title: "Floods in Walton tle: "The closest we have got to the
on Thames", image descrip- lfooding disaster", image
description: "Most of those houses tion: None, tags: "freefolk"
looked like they had been
lfooded.", tags: None
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>
        In this section we will discuss the deep learning algorithm design
for the task of flood evidence retrieval in social media posts. The
problem will be approached under a probabilistic framework. As
explained in Section 2, the posts contain an image and/or
metadata. To extract rich visual information we apply the convolutional
InceptionV3 network, using the pre-trained weights on ImageNet
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and fine-tune the last inception model of the network. For the
metadata we use a word embedding to represent the textual
information in a continuous space and feed it to a bidirectional LSTM.
The word embedding is initialized using Glove [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] vectors, which
we fine-tune with our metadata. Finally, we concatenate the image
and text features followed by a fully connected layer and a softmax
classifier to give a final probability of the sample containing
relevant information about a flood. In Figure 2 we show a sketch of the
multimodal system, which can also be applied using only one of
the modalities.
      </p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS</title>
      <p>
        We have divided the development set in training (3960 + 989 extra
lfood images) and validation (1320). As for the optimizer we have
chosen RmsProp [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] wich uses the magnitude of recent gradients
to normalize the gradients, and set an initial learning rate of 0.001.
      </p>
      <p>Since the dataset does not have a very large number of training
data it is common to run into overfitting problems. In order to avoid
this problem we have used the validation set to determine when
to stop the training. Thus, it is stopped when the performance on
the validation set stops increasing or starts decreasing over the last
two epochs. Then we have used that number to retrain the system
using the training and the validation set. We have followed this
procedure for all the experiments.</p>
      <p>We have trained the system in four diferent configurations: 1)
having images and metadata as input, 2) having only images as input
and 3) having only the metadata as input, and 4) having images
and metadata in addition to the extra images obtained from Google
Similar Images. The results on the test set of these four experiments
are given in Table 1. The system has been evaluated as a retrieval
task. All the posts from the test set have been given a probability
of containing evidence of flood and have been put in order from
higher probability to lower. In the first column of Table 1 we show
the Average Precision (AP) of posts which have been classified as
containing evidence of flood in the first 480 retrieved posts. In the
second column we show the mean over average precision when
evaluated on the first 50, 100, 250 and 480 posts.
5</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>As can be seen in Table 1 the metadata that we have selected for
the task is certainly relevant to retrieve information about the
evidence of flood-related images in social network posts, reaching
over 70% mean precision over 4 retrieval cutofs. Since the
classification of the posts as containing evidence of flood or not has
been manually done using only the images, the image information
should be enough for the retrieval problem. However the
performance of the algorithm using only the image goes down to 66% in
mean over average precision at diferent cutofs. This shows that
although images should be more discriminative for this task, due
to the dificulty of processing images in comparison to text, the
metadata analysis gives better performance. There is also a clear
improvement when combining both types of information, reaching
almost a 84% accuracy in mean over the average precision in several
cutofs which shows that the metadata and the image complement
each other quite well. Surprisingly, when training the system with
extra images, the Mean AP drops to 76%, since the images have
been manually inspected to make sure that there were no noisy
images added to the dataset, this makes us suspect that that result
degrades when adding images without metadata, as this performs
the weakest among all experiments, however it should be further
studied before drawing additional conclusions.
6</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>In this paper we have proposed a multi-modal deep learning
approach to retrieve posts from social networks containing valuable
information about floods. The system can work using only visual
information, only text or combining both types of information.</p>
      <p>It has been shown that combining both types of information
improves greatly the performance of the system. For future work
it would be interesting to check if other type of metadata could
also provide useful information for the task, as for example the
location or time where the image was taken since there are regions
and seasons which are more prone to flooding. It would also be
interesting to study why adding more images to the training set
has worsened the performance of the system and how well does
the system generalize to images outside of the dataset.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the Spanish Grants
TIN201675404-P AEI/FEDER, UE, TIN2014-52072-P, TIN2016-79717-R,
TIN201342795-P and the European Commission H2020 I-REACT project
no. 700256. Laura Lopez-Fuentes benefits from the NAERINGSPHD
fellowship of the Norwegian Research Council under the
collaboration agreement Ref.3114 with the UIB. Marc Bolaños benefits from
the FPU fellowship.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Marc</given-names>
            <surname>Bolaños</surname>
          </string-name>
          , Álvaro Peris, Francisco Casacuberta, and
          <string-name>
            <given-names>Petia</given-names>
            <surname>Radeva</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>VIBIKNet: Visual bidirectional kernelized network for visual question answering</article-title>
          .
          <source>In Iberian Conference on Pattern Recognition and Image Analysis</source>
          . Springer,
          <fpage>372</fpage>
          -
          <lpage>380</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wei Dong, Richard Socher,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2009</year>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          .
          <article-title>IEEE Conference on</article-title>
          . IEEE,
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>CL</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <source>JC Yang, and YH Chen</source>
          .
          <year>2007</year>
          .
          <article-title>A real time video processing based surveillance system for early fire and flood detection</article-title>
          .
          <source>In Instrumentation and Measurement Technology Conference Proceedings</source>
          ,
          <year>2007</year>
          .
          <article-title>IMTC 2007</article-title>
          .
          <article-title>IEEE</article-title>
          . IEEE, 1-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Lopez-Fuentes</surname>
          </string-name>
          , Joost van de Weijer, Manuel González-Hidalgo,
          <string-name>
            <given-names>Harald</given-names>
            <surname>Skinnemoen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andrew D.</given-names>
            <surname>Bagdanov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Review on Computer Vision Techniques in Emergency Situations</article-title>
          .
          <source>arXiv preprint arXiv:1708.07455</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sandro</given-names>
            <surname>Martinis</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Automatic near real-time flood detection in high resolution X-band synthetic aperture radar satellite data using context-based classification on irregular graphs</article-title>
          .
          <source>Ph.D. Dissertation</source>
          . lmu.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>David</surname>
            <given-names>C Mason</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ian J Davenport</surname>
          </string-name>
          , Jefrey C Neal,
          <string-name>
            <surname>Guy J-P Schumann</surname>
            , and
            <given-names>Paul D</given-names>
          </string-name>
          <string-name>
            <surname>Bates</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Near real-time flood detection in urban and rural areas using high-resolution synthetic aperture radar images</article-title>
          .
          <source>Geoscience and Remote Sensing, IEEE Transactions on 50, 8</source>
          (
          <year>2012</year>
          ),
          <fpage>3041</fpage>
          -
          <lpage>3052</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>David</surname>
            <given-names>C Mason</given-names>
          </string-name>
          , Rainer Speck, Bernard Devereux,
          <source>Guy JP Schumann</source>
          ,
          <article-title>Jefrey C Neal,</article-title>
          and
          <string-name>
            <given-names>Paul D</given-names>
            <surname>Bates</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Flood detection in urban areas using TerraSAR-X. Geoscience and Remote Sensing</article-title>
          ,
          <source>IEEE Transactions on 48, 2</source>
          (
          <year>2010</year>
          ),
          <fpage>882</fpage>
          -
          <lpage>894</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Mathioudakis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Koudas</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Twittermonitor: trend detection over the twitter stream</article-title>
          .
          <source>In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM</source>
          ,
          <volume>1155</volume>
          -
          <fpage>1158</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Pennington</surname>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          ..
          <source>In EMNLP</source>
          , Vol.
          <volume>14</volume>
          .
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Bart</surname>
            <given-names>Thomee</given-names>
          </string-name>
          , David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
          <string-name>
            <surname>Li-Jia Li</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>YFCC100M: The new data in multimedia research</article-title>
          .
          <source>Commun. ACM 59</source>
          ,
          <issue>2</issue>
          (
          <year>2016</year>
          ),
          <fpage>64</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Tijmen</given-names>
            <surname>Tieleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geofrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <source>2012. Lecture 6</source>
          .
          <fpage>5</fpage>
          -RMSProp, COURSERA:
          <article-title>Neural networks for machine learning</article-title>
          . University of Toronto, Tech.
          <source>Rep</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>