<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual and textual analysis of social media and satellite images for flood detection @ multimedia satellite task MediaEval 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantinos Avgerinakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Moumtzidou</string-name>
          <email>moumtzid@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stelios Andreadis</string-name>
          <email>andreadisst@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanouil Michail</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilias Gialampoukidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanos Vrochidis</string-name>
          <email>stefanos@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Research &amp; Technology Hellas - Information Technologies Institute</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents the algorithms that CERTH team deployed in order to tackle disaster recognition tasks and more specifically Disaster Image Retrieval from Social Media (DIRSM) and FloodDetection in Satellite images (FDSI). Visual and textual analysis, as well as late fusion of their similarity scores, were deployed in social media images, while color analysis in the RGB and nearinfrared channel of satellite images was performed in order to discriminate flooded from non-flooded images. Deep Convolutional Neural Network (DCNN), DBpedia Spotlight and combMAX was implemented to tackle DIRSM, while Mahalanobis Distance-based classification and morphological post-processing were applied to deal with FDSI.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Security, surveillance and more specifically disaster prediction and
classification from social media and satellite sources have raised a
lot of interest in the computer science the last decade. The
unobtrusive and abundant nature of these data rendered them as one of
the most valuable sources to extract and deduct early warning or
identification of an ongoing or eminent disaster.</p>
      <p>
        Multimedia satellite task is a challenge of MediaEval that
comprises of two tasks: (a) Disaster Image Retrieval from Social
Media (DIRSM) and (b) Flood-Detection in Satellite Images (FDSI).
DIRSM provides a great amount of social media images
(YFCC100MDataset) and their metadata (Flickr), while FDSI is comprised of a
large amount of 4 colour-channel, 3 for the RGB spectrum and 1 for
the near-infrared, satellite images from PlanetLabs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Both tasks
ask from the participants to leverage any available technology so
as to determine whether a flood event occurs in the provided test
data. As far as visual data are concerned, a flood event is considered
when an image shows an "unexpected high water level in
industrial, residential, commercial and agricultural areas". The reader is
suggested to read [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for further information about the contest and
the provided data.
      </p>
      <p>
        In this work, CERTH presents its algorithms for DIRSM and FDSI
subtasks. For flood recognition in images, CERTH uses the output of
the last pooling layer of a trained GoogleNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for global keyframe
representation and trains an SVM classifier to recognize images that
are related to a flooding event. Textual information is also retrieved
by leveraging the metadata of the social media images by using
DBpedia Spotlight annotation tool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Both of these modalities are
fused with a novel multimodal approach which combines non-linear
graph-based fusion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with combMax scoring. For FDSI subtask
CERTH performs a Mahalanobis distance classification and several
morphological and adaptive filters, so as to separate flood from
non-flood areas inside satellite image scene.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Flood detection from social media (DIRSM)</title>
      <p>Social media were crawled in this task so as to acquire images and
text about flood scenarios. For that purposes, two modalities were
deployed and fused with a non-linear graph-based fusion approach.</p>
      <p>
        The first modality concerned visual analysis and more
specifically flood detection inside image samples by adopting a Deep
Convolutional Neural Network (DCNN) framework. GoogleNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
was trained on 5055 ImageNet concepts, and the output of the last
pooling layer with dimension 1024 was used as a global keyframe
representation. The provided development set was then splitted
into two subsets and used to train an SVM classifier and define its
optimal parameters: t (defines the kernel type) and g (gamma in
kernel function). The best results were achieved for t = 1
(polynomial function) and д = 0.5. The test environment that CERTH built,
included the evaluation of the precomputed features provided from
the Multimedia-Satellite challenge (i.e. acc, gabor, fcth, jcd, cedd,
eh, sc, cl, and tamura) and DCNN features that were produced from
the Places205 − GooдLeN et network by fusing the features from
the convolutional layers 3a and 3b. SVM classifiers were trained for
all of these features and results showed that the proposed DCNN
feature outperformed most of them significantly.
      </p>
      <p>
        The second modality concerns the detection of flood-related text
in social media metadata. For that purposes, DBpedia Spotlight [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
was adapted so as to detect flood, water and related keyphrases that
were acquired from the training set metadata (i.e. title, description,
user tags). A disambiguation algorithm followed up to compare the
aforementioned phrases with the collection, using Jaccard
similarities. The similarity scores of the two modalities were also combined
with the use of a late fusion approach that uses non-linear graph
based techniques (random walk, difusion-based) in a weighted
non-linear way [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The top-l multimodal objects are filtered with
respect to textual concepts, leading to l ×l similarity matrixes S1, S2
and query-based l × 1 similarity vectors s1 and s2. More specifically,
10 positive examples were selected from the training set as queries
so as to acquire 10 ranked lists and by using combMAX late fusion
to get the final list of relevant-to-the-flood multimodal objects. The
overall block diagram of this approach is depicted in Fig. 1.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Flood detection from satellite images (FDSI)</title>
      <p>
        Satellite images were collected from PlanetLabs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] so that we can
evaluate our localization algorithm in real case scenarios.
Localization is based on a Mahalanobis classification framework and
post-processing morphological operations.
      </p>
      <p>Mahalanobis distances with stratified covariance estimates were
computed to train our classifier by randomly selecting 10000
samples (RGB and infrared pixels) from each 7 sets of satellite images,
leading into a final population of 70000 samples. Linear, diagonal
linear, quadratic and diagonal quadratic discriminant functions
were also computed, but Mahalanobis distances achieved the
highest classification results. For every image of the testing set all pixels
of the image were extracted, creating a four dimensional (R,G,B,NI)
testing set consisting of 102400 samples (320 pixels × 320 pixels)
per image. The final outcome was a binary mask that denoted 1 for
lfooded pixels and 0 for non-flooded ones.</p>
      <p>Post-processing was then deployed on the acquired binary masks,
in order to eliminate erroneous areas that resulted from the noisy
nature of the dataset. A global filter was initially deployed on the
binary mask so as to eliminate population of flood-denoted pixels that
as a whole did not surpass the 5% of the image size. Similarly, a local
iflter followed up so as to eliminate the connected components of
lfood-denoted areas that did not surpass the size of 10 pixels. Image
dilation and erosion was finally applied around each pixel and its
surrounding area (circular cell with radius of 4 pixels) to eliminate
small areas that were falsely denoted as flood, but simultaneously
preserve the larger ones.
3</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>Social media results for flood situations (DIRSM) are gathered in
Table 1. Two retrieval approaches were used; (a) single cutof scheme
that returns the top-480 most similar samples and (b) multiple cutof
scheme that combines results from 4 diferent thresholds equal to
50, 100, 250, 480 by averaging their scores so as to conclude into a
ifnal list.</p>
      <p>It is obvious that multiple cutofs worked better than a single.
Furthermore, we can observe that visual modality surpassed the
textual by far and this is mainly attributed to the fact that some
keywords related to flood and water might be found under several
irrelevant contexts, leading text retrieval to very low accuracy
rates. Fusion is also afected by the low performance of the textual
modality and cannot leverage or complement the visual information
in the final deduction, leading to lower accuracy rates than visual
does.</p>
      <p>Results from Satellite images (FDSI) are presented in Table 2.
The accuracy rates are quite diverse amongst them as we acquired
very high rates in some locations such as loc01 and loc03, while
other ones such as loc04 and loc05 were too low. From our point
of view, this is attributed to the colour nature of the data in these
areas, as in the former the separation of water was clear, while in
the latter non-flood areas had similar colour with the flood ones.
Furthermore, groundtruth masks included some non-flood pixels
as flood and as the nature of our algorithm is pixel-wise they were
misclassified as positive samples lead to poor performance models.
Overall, our classifier lead to 74.67% localization accuracy rate.
4</p>
    </sec>
    <sec id="sec-6">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>Multimedia satellite challenge gave as the opportunity to test our
algorithm in real case disaster scenarios. Social media and satellite
sources proved extremely valuable and helped us separate flood
scenarios from others. The high average precision rate that visual
features achieved proves that computer vision community can
become ever more helpful in disaster detection and it is clear now that
can surpass the ambiguity that text can introduce in the decision
feature. On the other hand, satellite images proved quite noisy and
require deeper investigation in the future.</p>
      <p>As a future work, we plan to adopt deeper techniques that exist
in the literature to recognize and discriminate places from each
other, while we also plan to investigate hybrid representations that
combine shallow with deep features so as to achieve even higher
precision rates in the visual part of the system. Text approaches
should undoubtedly revised and get tailored to disaster related
scenarios, while fusion approaches that consider “semantic filtering”
stages based on textual concepts will be revised. Regarding FDSI,
we plan to build a shallow/deep representation scheme that will
leverage both texture (i.e. LBP) and deep features so as to learn to
separate flood from non-flood areas even more efectively.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is supported by beAWARE project, partially funded by
the European Commission (H2020-700475).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          .
          <source>The Multimedia Satellite Task at MediaEval</source>
          <year>2017</year>
          :
          <article-title>Emergence Response for Flooding Events</article-title>
          .
          <source>In Proc. of the MediaEval 2017 Workshop (Sept</source>
          .
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          ). Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Joachim</given-names>
            <surname>Daiber</surname>
          </string-name>
          , Max Jakob, Chris Hokamp, and
          <string-name>
            <surname>Pablo</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mendes</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Improving Eficiency and Accuracy in Multilingual Entity Extraction</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics).</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ilias</given-names>
            <surname>Gialampoukidis</surname>
          </string-name>
          , Anastasia Moumtzidou, Dimitris Liparas, Theodora Tsikrika, Stefanos Vrochidis, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Wei Liu, Yangqing Jia,
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Scott E.</given-names>
            <surname>Reed</surname>
          </string-name>
          , Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Going deeper with convolutions.</article-title>
          .
          <source>In CVPR. IEEE Computer Society</source>
          , 1-
          <fpage>9</fpage>
          . http://dblp.uni-trier.de/db/conf/cvpr/ cvpr2015.html#SzegedyLJSRAEVR15
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Planet</surname>
            <given-names>team.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Planet Application Program Interface: In Space for Life on Earth</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>