<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Domain-based Late-Fusion for Disaster Image Retrieval from Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Minh-Son Dao</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Quang-Nhat-Minh Pham</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duc-Tien Dang-Nguyen</string-name>
          <email>duc-tien.dang-nguyen@dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin City University</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>FPT Technology Research Institute, FPT University</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We introduce a domain-specific and late-fusion algorithm to cope with the challenge raised in The MediaEval 2017 Multimedia Satellite Task. Several known techniques are integrated based on domainspecific criteria such as late fusion, tuning, ensemble learning, object detection using deep learning, and temporal-spatial-based event confirmation. Experimental results show that the proposed algorithm can overcome the main challenges of the proper discrimination of the water levels in diferent areas as well as the consideration of diferent types of flooding events.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>This paper presents a method that is specially built to meet the
subtask 1 of the MediaEval 2017 Multimedia Satellite Task [2]. We
propose an ensemble learning and tuning method for Disaster Image
Retrieval from Social Media in order to overcome the challenge of
using restricted visual features and metadata as well as to increase
the accuracy of classification by using data from external resources.
Details about the methodology are given in the section 2, while
results on this task are reported and discussed in section 3, and the
conclusion and future works are stated in the last section.</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>Following subsections discuss each run required by the organizer.
We formalize the problem as an ensemble learning and tuning
task, in which we use visual features provided by the organizer
associated with each image. These visual features are used with
supervised learners to create classifiers. Our visual-based method
includes three components: late fusion, tuning, and ensemble
learning, designed as follows:
• Stage 1: a set of classifiers associated with each type of
features is created by using supervised learners (SLs) with
outputs reported in a regression form (i.e. the output
variable takes continuous values). The late fusion technique is
applied to these outputs to form the multimodal features
combination (MFC), as follows: MFC = ÍN
i=1(wi ∗ SLi )
where Í(wi ) = 1
• Stage 2: the bagging is utilized on these features and
learners to increase the accuracy of the system. A multiple data
set (MDS) is created by dividing randomly the training data
set into m non-intersecting folders (DSm ). At the end of
this stage, the combined classifier is created, namely the
combined learner (CL).
• Stage 3: after running a testing phase on these data sets,
a new training data set for tuning is created and learned
by using a specific supervised learner and a certain
subset of feature types. The output of this stage is called the
tuning leaner (TL). The subsection 2.1.1 describes how to
establish this tuning data set. The idea behind this stage
is to apply boosting and bagging techniques to create a
tuning classifier for samples which fall into a wrong side
of a hyperplane zone of a previous classifier.
• Stage 4: the ensemble learner (EL) is created as EL = w1 ∗</p>
      <p>T L + w2 ∗ CL where w1 + w2 = 1.
2.1.1</p>
      <p>
        Creating a tuning data set.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Let PRk = {predi , дti }i=1:N be an output set when using
CL from the stage 2 of DSk , where predi and дti denote the
predicted value and the ground-truth label of the ith image,
respectively. The descending sort algorithm is applied to
PRk so that the image with the biggest predicted value will
be on the top. Here, the predicted label of the ith image is
calculated by labeli = f (predi , threshold).
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) For each ith image whose ground-truth label дti and
predicted label labeli do not match, collecting its k nearest
neighbour images whose дtj equals labelj (i.e. collect k
true positive and true negative neighbours) by getting k/2
samples from position (i − 1) to 1 and from position (i + 1)
to N , respectively.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Meta-data-based Image Retrieval</title>
      <p>
        We formalize the image retrieval as a text categorization task, in
which we extract textual features from meta-data associated with
each image. Textual features are used as basic features for training
a feed forward neural network (FFNN) with just one hidden layer.
Our meta-based classification method includes three components:
preprocessing, feature extraction, and training neural network,
described as follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Pre-processing: we clean text data by removing
hyperlinks, image path, image names. We also remove all URL
in texts. We perform those steps by using some regular
expressions. After that, we do word tokenization and remove
all punctuations in a text. For a user tag containing
multiple words, we join words in the tag to form a phrase and
treat that phrase as a single word. We use nltk toolkit [1]
to perform word tokenization.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Textual Feature Extraction: For text categorization task,
bag-of-words are basic and straightforward features. We
just use bag-of-words features and represent an image as a
n-dimension sparse vector in which n is limit of the
number of words in the vocabulary extracted from the training
data set. A feature is activated if meta-data of the image
contains the corresponding word in the vocabulary. In our
experiments, we use n = 10, 000. We extract features from
three attributes in the meta-data including: title,
description, tags of an image.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Neural Network Architecture: We use a feed-forward
neural network with one hidden layer containing 128 units.
In training the network, we used batch size 20, and applied
drop-out technique with drop-out coeficient 0.5. In the
output layer, we use softmax layer so that the final network
can output probability values that an image is related to
lfood or not. We adopted keras framework [ 4] for building
the neural network.
2.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Visual-metadata-based Image Retrieval</title>
      <p>The method used for visual-based image retrieval, described in
subsection 2.1 is utilized for this task. All features and supervised
learners of visual-based and metadata-based methods are reused.
2.4</p>
    </sec>
    <sec id="sec-5">
      <title>Visual-metadata-external-resource-based</title>
    </sec>
    <sec id="sec-6">
      <title>Image Retrieval</title>
      <p>Based on the observation that a water texture and colour and a
spatial relation between the water area and its surrounding area
can lead to the misclassification when using the proposed method
with limited visual features provided by the organizer. Besides, a
metadata content and it’s associated image do not always
synchronize by an event meaning. Hence, we propose the domain-specific
algorithm to overcome these obstacles.</p>
      <p>We utilize the faster R-CNN [6] with the pre-computed object
detection model running on the Tensorflow platform [ 3] to
generate a bag of words containing objects that semantically related to
lfooded areas, especially in industrial, residential, commercial, and
agricultural areas. Moreover, we use location and time information
described in a metadata to confirm whether a flood really happened
by checking with weather databases that can be freely accessed on
the Internet (e.g. Europe area1, America area2, and Australia area3).
This task can be done by reusing the method introduced in [5].</p>
      <p>The former component, namely syn-content model (SC) is used
to give more weighted values to the pair image-metadata that shares
the similar content. The latter component is to strengthen the
accuracy of the meta-data-based model, namely spatio-temporal
1https://www.eswd.eu/
2 http://www.weather.gov/, https://water.usgs.gov/floods/reports/
3http://www.bom.gov.au/climate/data/
model (ST). These components are used to re-label and create a
tuning data set in the stage 3.
3</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTAL RESULTS</title>
      <p>
        We use the data set and evaluation metrics provided by the
organized [2]. All parameters used by our approach are set as follows:
• RUN 1: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Stage 1: we use features set {CED, EH , JC } and
SVM (SVM-Type: eps-regression, SVM-Kernel: radial, cost: 1,
gamma: 0.006944444, epsilon: 0.1) to create SLs, the weighted
wi s are set by 1/3, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Stage 2: we divide the development
set in to m=10 non-intersect data set, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Stage 3: k is set
by 10, and random forest (RF) is used to create TL (type
of random forest: regression, number of trees: 500, No. of
variables tried at each split: 48) with the JC feature as the
input, and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Stage 4: we set w1 = 0.4, w2 = 0.6
• RUN 2: We perform 5-fold cross validation on the
development set and report the average scores of 5 folds. In
testing, we train the model using development set and use
the trained model for the prediction on the test images.
• RUN 3 and RUN 4: We use pre-computed object detection
model without changing any parameter [6][3]. We also
reused methods in RUN 1 and RUN 2 with the same setup.
      </p>
      <p>
        For run 1 and 2, there is a big gap between results on the
development set and on test set. Possible explanations are that (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the
data distribution of the development set is diferent than of the
test set, and/or (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) there are too much non-synchronize between
text and image contents. In the current work, we did not deal with
inflections such as “flood” and “flooded”, so the feature space is
quite sparse. For run 3, the fusion of visual-based and
metadatabased outputs can improve the accuracy of flood images retrieval,
around 3% higher. It proves that these two diferent approaches
can compensate their weaknesses. For run 4, there is a significant
improvement of the accuracy when using SC model and ST
information whilst the visual-metadata-based method (e.g. run 3) reaches
it’s limitation.
4
      </p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS AND FUTURE WORKS</title>
      <p>We introduce the domain-based late-fusion method to retrieve flood
images using social media. Although the achievement at this stage
is acceptable, there are so many things can be improved to get
better results, especially to overcome the non-synchronized content
between image’s and metadata’s and the suitable features and/or
learners to distinguish water/flooded areas by its color, texture, and
its spatial relation with surrounding areas.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Processing with Python</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Bischke</surname>
          </string-name>
          , Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and
          <string-name>
            <given-names>Damian</given-names>
            <surname>Borth</surname>
          </string-name>
          .
          <source>The Multimedia Satellite Task at MediaEval</source>
          <year>2017</year>
          :
          <article-title>Emergence Response for Flooding Events</article-title>
          .
          <source>In Proc. of the MediaEval 2017 Workshop (Sept</source>
          .
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          ). Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Xinlei</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Abhinav</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>An Implementation of Faster RCNN with Study for Region Sampling</article-title>
          .
          <source>arXiv preprint arXiv:1702.02138</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>François</given-names>
            <surname>Chollet</surname>
          </string-name>
          and others.
          <source>2015</source>
          . Keras. https://github.com/fchollet/ keras. (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Minh-Son</surname>
            <given-names>Dao</given-names>
          </string-name>
          , Giulia Boato,
          <string-name>
            <surname>Francesco G.B. De Natale</surname>
          </string-name>
          , and
          <string-name>
            <surname>Truc-Vien Nguyen</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Jointly Exploiting Visual and Non-visual Information for Event-related Social Media Retrieval</article-title>
          .
          <source>In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval (ICMR '13)</source>
          . ACM, New York, NY, USA,
          <fpage>159</fpage>
          -
          <lpage>166</lpage>
          . https://doi.org/10. 1145/2461466.2461494
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Shaoqing</given-names>
            <surname>Ren</surname>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems (NIPS).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>