<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection of Flooding Events in Social Multimedia and Satellite Imagery using Deep Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Benjamin Bischke</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Research Center for Artificial Intelligence (DFKI)</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Kaiserslautern</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents the solution of the DFKI-team for the Multimedia Satellite Task at MediaEval 2017. In our approach, we strongly relied on deep neural networks. The results show that the fusion of visual and textual features extracted by deep networks can be efectively used to retrieve social multimedia reports which provide a directed evidence of flooding. Additionally, we extend existing network architectures for semantic segmentation to incorporate RGB and Infrared (IR) channels into the model. Our results show that IR information is of vital importance for the detection of flooded areas in satellite imagery.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Satellite imagery is becoming more and more accessible in the
recent years. Programs such as Copernicus from ESA and LandSat
from NASA facilitate this development by providing a public and
free access to the data. Large-scale datasets such as the
EuroSATDataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or the ImageCLEFremote-Dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have emerged
from these programs and build the foundation for the deeper
analysis of remotely sensed data. One major problem when analyzing
satellite imagery is the sparsity of data for particular locations over
time. Publicly available satellites are mostly non stationary and
require several days to revisit the same locations. To overcome
this problem, recent work leverages the advances of social
multimedia analysis and combines the two data sources [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Bischke
et. al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] demonstrated a system for the contextual enrichment of
remote-sensed events in satellite imagery by leveraging
contemporary content from social media. Similarly, the work by Ahmad
et. al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] crawled and linked social media data about technological
and environmental disasters to satellite imagery.
      </p>
      <p>
        Building upon these developments and putting a stronger focus
on flooding events, Bischke et. al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] released the Multimedia
Satellite Task at MediaEval 2017. The goal of this benchmarking task is
to augment events that are present in satellite images with social
media reports in order to provide a more comprehensive view of
the event. The task is divided into two subtasks: (1) The Disaster
Image Retrieval from Social Media Task has the goal to retrieve
social media reports that provide direct evidence of a flooding event.
(2) Flood-Detection in Satellite Images aims to identify regions in
satellite images which are afected by flooding.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Disaster Image Retrieval from Social Media</title>
      <p>In this section, we present our solution for first subtask by
considering visual, textual modalities as well as their fusion. For all
modalities, we train a Support Vector Machine (SVM) with a
radial basis function (RBF) kernel on the two classes flooding and no
lfooding . We obtain the ranked list of relevant social media reports
by computing the distance to the decision boundary of the SVM.
The features which we used for the classifier training are discussed
in detail in the following section.</p>
      <p>
        1.1.1 Visual Features. Motivated by the recent advances of
Convolutional Neural Networks (CNNs) to learn a high-level
representation of image content, we apply a CNN to obtain the semantic
feature representation of images. In particular, we use a pre-trained
network DeepSentiBank [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with the X-ResNet [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] architecture.
X-ResNet is an extension of ResNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with cross-residual
connections to predict multiple related tasks. We extract the internal
representation of X-ResNet’s anptask_pool5 layer, resulting in
1000dimensional feature vector for each image. Compared to CNNs
pre-trained on ImageNet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], this approach has two advantages: (1)
DeepSentiBank was trained to predict adjective noun pairs (ANPs).
Unlike ImageNet pre-trained models, this allows to not only rely
on information about objects-classes but additionally extract
details about the image-scence with adjectives (e.g. wet road, damaged
building, stormy clouds). (2) The domain change of DeepSentiBank is
smaller compared to ImageNet pre-trained models. DeepSentiBank
was trained on the Visual Sentiment Ontology (VSO) dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
which contains Flickr images similar to the dataset provided by the
task organizers. Such images often include more scenic information
whereas images from ImageNet mainly contain objects.
      </p>
      <p>1.1.2 Metadata Features. For the retrieval based on only
metadata of social media reports, we relied on the tags given by users.
We observed that only relying on the presence of single words
such as ’flooding’ or ’flood’ is not suficient and introduces a lot of
irrelevant social media reports. We therefore combine individual
tags to obtain a document representation for each report.</p>
      <p>
        In the first preprocessing step, we remove numbers and convert
all tags to lowercase. We then train a Word2Vec model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (with
200 dimensions) on the user tags. For each social media report, we
average the word vectors and obtain a document representation. In
order to incorporate the importance of each word into the document
representation, we additionally weight each word embedding with
the term frequency-inverse document frequency (TF-IDF) of the
corresponding word. The intuition behind this approach is fairly
straightforward, i.e. document vectors containing semantically
similar concepts (’flood’, ’river’, ’damage’) should point to a similar
direction in the embedding space as compared to documents with
word-vectors of diferent concepts (’flood’, ’book’, ’desk’, ’drink’).
      </p>
      <p>1.1.3 Visual-Textual Fused Features. We extract the visual and
textual feature representations using the two approaches as
described above. The two modalities are fused by concatenating the
feature vectors, resulting in a 1200-dimensional vector.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Flood Detection in Satellite Imagery</title>
      <p>In this section, we explain our approach for the segmentation of
lfooded areas in satellite images using deep neural networks.</p>
      <p>1.2.1 Pre-Processing. Before feeding the satellite data to the
networks, we perform a location based normalization step. The
goal of this step is to remove a location bias due to local changes
in images caused by diferent vegetation, lightning conditions and
atmospheric distortions. For each location we compute the mean
pixel values of each RGB and IR channel and subtract this value
from the corresponding channels of images belonging to the same
location. The pixel values in original satellite images are encoded
in the 16-bit number format which turned out to be problematic for
many frameworks. To overcome this, we additionally scale the min
and max pixel-values channel-wise within the range of 0 and 255.</p>
      <p>1.2.2 Network Architectures. We propose three diferent
network architectures for the segmentation problem. All networks use
the size of the original image patch (320 x 320 pixels) as input-size
and predict classification labels on a pixel-level.</p>
      <p>
        In our first approach, we use a fully convolutional network (FCN)
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which has a similar architecture as VGG13 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We remove
the fully connected layers and attach an up-sampling layer with
bilinear interpolation to scale the down-sampled feature maps to
the original image-size. An additional convolutional layer is used to
predict the class labels for each pixel and classification probabilities
are obtained by squashing the network output through a softmax
layer. Since the first input layer of VGG13 expects an input tensor
with dimension three, we only pass the RGB information of the
satellite data into the network. In the second network, we expand
our previous architecture by changing the input of the first layer to
four channels, allowing the network to incorporate IR information
into the prediction. We extend the previous two approaches by
investigating into more complex decoders. Therefore, we use the
second network as base-model and replace the up-sampling layer
with the reversed version of a VGG13 encoder as decoder.
      </p>
      <p>1.2.3 Network Training. In order to train the above described
networks from scratch we extend the dataset using data
augmentation. Every image patch is flipped (left to right and up down) and
rotated at 90 degree intervals, yielding 8 augmentations per image
patch. All networks are trained end-to-end with stochastic gradient
descend using the negative log likelihood loss, a learning rate of
0.01 and weight decay of 0.0005.</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>
        The results for the first subtask are shown in Table 1. Run 1 is
only based on visual information, Run 2 only on metadata and Run
3 on the fusion of both modalities as described in Section 1.1. It
can be seen that relying on visual information achieves a higher
Average Precision (AP) compared to metadata only. At the same
time, the fusion of both modalities further helps to improve the
retrieval accuracy by 1.7%. Run 4 uses only visual features from an
ImageNet pre-trained ResNet152 model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Compared to Run 1,
DeepSentiBank (X-ResNet) features perform significantly better.
      </p>
      <p>Table 2 contains the results of the second subtask for unseen
satellite images covering same and new locations as in the
development set. Each of the three runs corresponds to the three networks
as described in Section 1.2.2. Comparing the IoU of the last two
networks to the first one ( Run 1), shows that the IoU increases by more
than 10%. This illustrates the importance of the IR-channel for the
detection of flooded areas in satellite data. The comparison of the
last two networks against each other (Run 2 vs. Run 3) shows that
there is a minor improvement of the AP. (0.1% for same and 4% for
new locations). The AP’s of all runs on new locations demonstrate
that the networks generalize to new places.
3</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>In this paper, we presented our approach for the Multimedia
Satellite Task 2017 at MediaEval. One major insight is the importance
of a multi-modal fusion of text and visual content for the retrieval
of social multimedia. In our approach, we analyzed diferent
CNNfeatures and showed that DeepSentiBank X-ResNet can be used
to obtain a powerful image representation. In the second subtask
of the challenge, we applied segmentation networks on satellite
imagery to extract flooded regions. Our results show that
incorporating IR-information is very important. For future work, we would
like to extend the satellite imagery to active radar data (Synthetic
Aperture Radar) which can "look" through the clouds. We plan to
use the results of this work in the future for the monitoring and
prediction of flooding events.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors would like to thank NVIDIA for support within the
NVAIL program.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Kashif</given-names>
            <surname>Ahmad</surname>
          </string-name>
          , Michael Riegler, Ans Riaz, Nicola Conci,
          <string-name>
            <surname>Duc-Tien Dang-Nguyen</surname>
            , and
            <given-names>Pål</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The JORD System: Linking Sky and Social Multimedia Data to Natural Disasters</article-title>
          .
          <source>In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. ACM</source>
          ,
          <volume>461</volume>
          -
          <fpage>465</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Helbert</given-names>
            <surname>Arenas</surname>
          </string-name>
          , Md Bayzidul Islam, and
          <string-name>
            <given-names>Josiane</given-names>
            <surname>Mothe</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the ImageCLEF 2017 Population Estimation (Remote) Task</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Damian Borth,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Schulze</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Dengel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Contextual enrichment of remote-sensed events with social media streams</article-title>
          .
          <source>In Proceedings of the 2016 ACM on Multimedia Conference. ACM</source>
          ,
          <volume>1077</volume>
          -
          <fpage>1081</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          .
          <source>The Multimedia Satellite Task at MediaEval</source>
          <year>2017</year>
          :
          <article-title>Emergence Response for Flooding Events</article-title>
          .
          <source>In Proc. of the MediaEval 2017 Workshop (Sept</source>
          .
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          ). Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          , Rongrong Ji, Tao Chen, Thomas Breuel, and
          <string-name>
            <surname>Shih-Fu Chang</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Large-scale visual sentiment ontology and detectors using adjective noun pairs</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Multimedia. ACM</source>
          ,
          <volume>223</volume>
          -
          <fpage>232</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tao</given-names>
            <surname>Chen</surname>
          </string-name>
          , Damian Borth, Trevor Darrell, and
          <string-name>
            <surname>Shih-Fu Chang</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1410.8586</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wei Dong, Richard Socher,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2009</year>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          .
          <article-title>IEEE Conference on</article-title>
          . IEEE,
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Helber</surname>
          </string-name>
          , Benjamin Bischke, Andreas Dengel, and
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification</article-title>
          .
          <source>arXiv preprint arXiv:1709.00029</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Brendan</given-names>
            <surname>Jou</surname>
          </string-name>
          and
          <string-name>
            <surname>Shih-Fu Chang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Cross Residual Learning for Multitask Visual Recognition</article-title>
          .
          <source>In ACM Multimedia. Amsterdam</source>
          , The Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>Long</given-names>
          </string-name>
          , Evan Shelhamer, and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Darrell</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Fully convolutional networks for semantic segmentation</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>3431</fpage>
          -
          <lpage>3440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efifcient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Alan</surname>
            <given-names>Woodley</given-names>
          </string-name>
          , Shlomo Geva, Richi Nayak, and Timothy Campbell.
          <year>2016</year>
          .
          <article-title>Introducing the Sky and the Social Eye</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2016 Workshop</source>
          , Vol.
          <volume>1739</volume>
          . CEUR Workshop Proceedings.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>