<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Retrieving Social Flooding Images Based on Multimodal Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>: Text</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Zhengyu Zhao, Martha Larson Radboud University</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents the participation of the RU-DS team at the MediaEval 2017 Multimedia Satellite Task. We design a system for retrieving social images that show direct evidence of flooding events using a multimodal approach based on visual features from images and the corresponding metadata. Specifically, we implement preprocessing operations including image cropping and test-set pre-filtering based on image color complexity or textual metadata, as well as re-ranking for fusion. Tests on the YFCC100M-Dataset show that the fusion-based approach outperforms the methods based on only visual features or metadata.</p>
      </abstract>
      <kwd-group>
        <kwd>Pre-processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Recent advances in satellite imagery and popularity of social media
are opening up a new interdisciplinary area for earth monitoring,
especially on natural disasters. The objective of the MediaEval 2017
Multimedia Satellite Task is to enrich the satellite information with
multimodal social media for a more comprehensive view of flooding
events [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We participate in the Disaster Image Retrieval from
Social Media subtask, which requires us to retrieve social images that
show a direct evidence of flooding events. Previous work in [
        <xref ref-type="bibr" rid="ref1 ref4 ref5">1, 4, 5</xref>
        ]
addresses a similar challenge by leveraging visual and textual
content from Social Media to enrich remote-sensed events in satellite
imagery. In this paper, we investigate the exploitation of visual
features and textual metadata for image representation, as well
as propose a fusion method based on test-set pre-filtering and list
re-ranking.
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
      <p>Table 1 contains a description of the approaches used for our three
runs, which involve three diefrent parts: pre-processing, feature
extraction and fusion strategy. For the first run, ( Visual), we apply
image cropping and test-set pre-filtering based on color complexity,
and use the SVM classifier on visual features to rank the images in
descending order by the output decision values. For the second run,
(Text), we rank the images by searching for flood-related keywords
in metadata without any preprocessing. Finally, for the third run,
(Fusion), we develop a 3-step approach: first the Run 2 system for
pre-filtering, then the Run 1 system for ranking, and finally the Run
2 system again for re-ranking.</p>
    </sec>
    <sec id="sec-3">
      <title>Visual Features</title>
      <p>We have investigated nine conventional visual descriptors provided
by the task organizers on the dev-set using an SVM classifier, and</p>
      <sec id="sec-3-1">
        <title>Features</title>
      </sec>
      <sec id="sec-3-2">
        <title>Fusion</title>
      </sec>
      <sec id="sec-3-3">
        <title>Visual</title>
      </sec>
      <sec id="sec-3-4">
        <title>Text</title>
        <p>Run 3: Fusion</p>
      </sec>
      <sec id="sec-3-5">
        <title>Pre-filtering Visual+text Re-ranking</title>
        <p>found the CEDD feature, which incorporates color and texture
information, achieved the best performance.</p>
        <p>
          The approach of our Visual run is based on the insight that the
body of flood water parts of the image are more important than
other parts for flood retrieval. Because the body of flood water is
usually located in the lower part of a flooding image [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we try to
extract this part from each test image. Experiments on dev-set show
that eliminating the top 60% of the image as well as 10% on each
side could bring about an accuracy improvement of 4.5%.
Moreover, using cropped images could save computation time of feature
extraction and eliminate the interference from the sky region.
        </p>
        <p>Another insight that we use is related to the observation that
lfood regions are visually homogeneous. We address this insight by
computing color complexity of the cropped images. Color
complexity here is defined by the equation: Color complexity = NSh , where
Nh indicates the number of hues of an HSV image and S is the
area of the image, i.e. the number of pixels. As shown in Fig. 1, the
cropped non-flooding images tend to have higher color complexity
than the flooding ones. We set an empirical threshold, T = 0.05
so as to remove a good number of non-flooding images, but few
lfooding ones. These removed images will be ranked in ascending
order by color complexity in the end part of the final list.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Textual Metadata Features</title>
      <p>We search for flood-related keywords including "flood(s)",
"flooding" and "flooded" in the three main fields "User_tags", "Title" and
"Description" of the accompanying metadata to rank the images.
Table 2 shows the relationship between keyword occurrence and
relevance reflected by the precision scores on the dev-set images,
where 1 indicates that flood-related keywords are present in a
specific field, and 0 means they are not. We use "x x x" in the first three
columns to indicate eight general conditions and two special ones
(one condition per row except for the header row). The latter two
columns show the corresponding retrieval precision scores for each
condition.</p>
      <p>
        Overall, as shown, we find that keywords in the "User_tags"
ifeld are the most helpful, and keywords in the "Title" field are less
reliable. Also, "Description" tends to give misleading information.
Furthermore, because the ground truth defined images showing
“unexpected high water levels in industrial, residential, commercial
and agricultural areas" as positives [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the conditions "1+water - -"
and "- flooded -" (where the presence of water bodies are implied)
are more likely to be positive.
      </p>
      <p>In order to create the final result list for Run 2, we concatenate
the sublists retrieved by each of above eight general conditions, in
descending order by precision scores. Meanwhile, for each sublist,
we will put the images that also meet the latter two conditions
"1+water - -" or "- flooded -" as the top part.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Feature Fusion</title>
      <p>In this section, we describe our fusion strategy based on pre-filtering
and re-ranking using both visual and metadata information. First,
we rank all the images that meet the conditions "0 0 0" and "0
0 1" using our metadata-based system to generate the sublist 2,
which will be the end part of the final list because as shown in
Table 2, these images are very unlikely to be positive. Then, the
rest of the images are fed into our visual-based system and ranked</p>
      <p>mAP @ (50, 100, 250, 480)
Run
Run 1
Run 2
Run 3
in descending order by decision value to generate the sublist 1.
Finally, we re-rank the images in sublist 1 whose decision values
are non-positive using our metadata-based system again.
3</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND DISCUSSION</title>
      <p>Table 3 presents the oficial results for our three submitted runs
on the test-set. We see that the third run achieves the best
performance for both evaluation metrics. We can also observe the
retrieval process benefits from fused visual and metadata
information. Specifically, implementing test-set pre-filtering based on
lfood-related keywords in our metadata-based approach leads to
considerable better performance that our visual-based approach
based on color complexity. Further, the visual-based approach is
verified to perform better than the metadata-based one in the
conditions except for "0 0 0" and "0 0 1". The reason for this efect could be
that some images mentioning flooding in the metadata are relevant
to flooding, but do not visually depict any floodwater. Such images
will be labeled as negatives in the ground truth.
4</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION AND OUTLOOK</title>
      <p>In this paper, we presented an approach for retrieving images
showing evidence of flooding events based on visual and textual
(metadata) information. Final results showed using both visual and textual
features outperforms using either feature individually.</p>
      <p>During the exploratory experiments that led to our Run 1 Visual
approach, we tried to first divide the cropped image into blocks
before the other steps, and then to compute the final score for an
image based on the scores of each block. This approach did not
achieve a better performance. It maybe because most blocks divided
from a homogeneous region in non-flooding images are more likely
to be regarded as a body of flood water without contribution from
a global feature that contains information such as the white line in
the road or the boats on the river.</p>
      <p>In the future, we will try segmentation algorithms to extract
the body of flood water more accurately and develop better visual
descriptors to diferentiate the body of flood water from other water
bodies in non-flooding images in the large-scale dataset. We will
also explore the word relations between user tags to avoid the
mistaken decision when the flood depicted in the image did not consist
of water. Finally, for the fusion strategy, methods of combining the
feature vectors of diferent modalities will be explored.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research is partially supported by China Scholarship Council
(201706250044).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Damian Borth,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Schulze</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Dengel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Contextual Enrichment of Remote-Sensed Events with Social Media Streams</article-title>
          .
          <source>In ACM Multimedia Conference</source>
          <year>2016</year>
          . ACM,
          <volume>1077</volume>
          -
          <fpage>1081</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          .
          <source>The Multimedia Satellite Task at MediaEval</source>
          <year>2017</year>
          :
          <article-title>Emergence Response for Flooding Events</article-title>
          .
          <source>In Proc. of the MediaEval 2017 Workshop (Sept</source>
          .
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          ). Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Paulo</given-names>
            <surname>Vinicius Koerich Borges</surname>
          </string-name>
          , Joceli Mayer, and
          <string-name>
            <given-names>Ebroul</given-names>
            <surname>Izquierdo</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>A Probabilistic Model for Flood Detection in Video Sequences</article-title>
          .
          <source>In IEEE International Conference on Image Processing</source>
          <year>2008</year>
          . IEEE,
          <fpage>13</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Takeshi</given-names>
            <surname>Sakaki</surname>
          </string-name>
          , Makoto Okazaki, and
          <string-name>
            <given-names>Yutaka</given-names>
            <surname>Matsuo</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Earthquake shakes twitter users: Real-time event detection by social sensors</article-title>
          .
          <source>In Proceedings of the 19th International Conference on World Wide Web. ACM</source>
          ,
          <volume>851</volume>
          -
          <fpage>860</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jie</given-names>
            <surname>Yin</surname>
          </string-name>
          , Andrew Lampert, Mark Cameron, Bella Robinson, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Power</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Using social media to enhance emergency situation awareness</article-title>
          .
          <source>IEEE Intelligent Systems 27, 6 (November</source>
          <year>2012</year>
          ),
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>