A Domain-based Late-Fusion for Disaster Image Retrieval from
                       Social Media
    ELEDIA@UTB and The 2017 Multimedia Satellite Task: Emergency Response for Flooding Events

                             Minh-Son Dao1 , Quang-Nhat-Minh Pham2 , Duc-Tien Dang-Nguyen3
                                               1 ELEDIA@UTB Lab., Universiti Teknologi Brunei, Brunei
                                       2 FPT Technology Research Institute, FPT University, Hanoi, Vietnam
                                                       3 Dublin City University, Dublin, Ireland

                                  minh_son@utb.edu.bn,minhpqn2@fe.edu.vn,duc-tien.dang-nguyen@dcu.ie

ABSTRACT                                                                             • Stage 2: the bagging is utilized on these features and learn-
We introduce a domain-specific and late-fusion algorithm to cope                       ers to increase the accuracy of the system. A multiple data
with the challenge raised in The MediaEval 2017 Multimedia Satel-                      set (MDS) is created by dividing randomly the training data
lite Task. Several known techniques are integrated based on domain-                    set into m non-intersecting folders (DSm ). At the end of
specific criteria such as late fusion, tuning, ensemble learning, object               this stage, the combined classifier is created, namely the
detection using deep learning, and temporal-spatial-based event                        combined learner (CL).
confirmation. Experimental results show that the proposed algo-                      • Stage 3: after running a testing phase on these data sets,
rithm can overcome the main challenges of the proper discrimina-                       a new training data set for tuning is created and learned
tion of the water levels in different areas as well as the consideration               by using a specific supervised learner and a certain sub-
of different types of flooding events.                                                 set of feature types. The output of this stage is called the
                                                                                       tuning leaner (TL). The subsection 2.1.1 describes how to
                                                                                       establish this tuning data set. The idea behind this stage
1     INTRODUCTION                                                                     is to apply boosting and bagging techniques to create a
This paper presents a method that is specially built to meet the                       tuning classifier for samples which fall into a wrong side
subtask 1 of the MediaEval 2017 Multimedia Satellite Task [2]. We                      of a hyperplane zone of a previous classifier.
propose an ensemble learning and tuning method for Disaster Image                    • Stage 4: the ensemble learner (EL) is created as EL = w1 ∗
Retrieval from Social Media in order to overcome the challenge of                      T L + w2 ∗ CL where w1 + w2 = 1.
using restricted visual features and metadata as well as to increase
                                                                                2.1.1    Creating a tuning data set.
the accuracy of classification by using data from external resources.
Details about the methodology are given in the section 2, while                     (1) Let PRk = {predi , дti }i=1:N be an output set when using
results on this task are reported and discussed in section 3, and the                   CL from the stage 2 of DSk , where predi and дti denote the
conclusion and future works are stated in the last section.                             predicted value and the ground-truth label of the i th image,
                                                                                        respectively. The descending sort algorithm is applied to
2     METHODOLOGY                                                                       PRk so that the image with the biggest predicted value will
Following subsections discuss each run required by the organizer.                       be on the top. Here, the predicted label of the i th image is
                                                                                        calculated by labeli = f (predi , threshold).
2.1     Visual-based Image Retrieval                                                (2) For each i th image whose ground-truth label дti and pre-
We formalize the problem as an ensemble learning and tuning                             dicted label labeli do not match, collecting its k nearest
task, in which we use visual features provided by the organizer                         neighbour images whose дt j equals label j (i.e. collect k
associated with each image. These visual features are used with                         true positive and true negative neighbours) by getting k/2
supervised learners to create classifiers. Our visual-based method                      samples from position (i − 1) to 1 and from position (i + 1)
includes three components: late fusion, tuning, and ensemble learn-                     to N , respectively.
ing, designed as follows:
       • Stage 1: a set of classifiers associated with each type of           2.2     Meta-data-based Image Retrieval
         features is created by using supervised learners (SLs) with          We formalize the image retrieval as a text categorization task, in
         outputs reported in a regression form (i.e. the output vari-         which we extract textual features from meta-data associated with
         able takes continuous values). The late fusion technique is          each image. Textual features are used as basic features for training
         applied to these outputs to form the multimodal features             a feed forward neural network (FFNN) with just one hidden layer.
                                                     ÍN
         combination (MFC), as follows: MFC = i=1         (w i ∗ SLi )        Our meta-based classification method includes three components:
         where (w i ) = 1                                                     preprocessing, feature extraction, and training neural network,
                Í
                                                                              described as follows:
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                                    (1) Pre-processing: we clean text data by removing hyper-
                                                                                        links, image path, image names. We also remove all URL
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                              M.S. Dao et al.


          in texts. We perform those steps by using some regular ex-       model (ST). These components are used to re-label and create a
          pressions. After that, we do word tokenization and remove        tuning data set in the stage 3.
          all punctuations in a text. For a user tag containing multi-
          ple words, we join words in the tag to form a phrase and         3   EXPERIMENTAL RESULTS
          treat that phrase as a single word. We use nltk toolkit [1]      We use the data set and evaluation metrics provided by the orga-
          to perform word tokenization.                                    nized [2]. All parameters used by our approach are set as follows:
      (2) Textual Feature Extraction: For text categorization task,              • RUN 1: (1) Stage 1: we use features set {CED, EH, JC} and
          bag-of-words are basic and straightforward features. We                   SVM (SVM-Type: eps-regression, SVM-Kernel: radial, cost: 1,
          just use bag-of-words features and represent an image as a                gamma: 0.006944444, epsilon: 0.1) to create SLs, the weighted
          n-dimension sparse vector in which n is limit of the num-                 w i s are set by 1/3, (2) Stage 2: we divide the development
          ber of words in the vocabulary extracted from the training                set in to m=10 non-intersect data set, (3) Stage 3: k is set
          data set. A feature is activated if meta-data of the image                by 10, and random forest (RF) is used to create TL (type
          contains the corresponding word in the vocabulary. In our                 of random forest: regression, number of trees: 500, No. of
          experiments, we use n = 10, 000. We extract features from                 variables tried at each split: 48) with the JC feature as the
          three attributes in the meta-data including: title, descrip-              input, and (4) Stage 4: we set w 1 = 0.4, w 2 = 0.6
          tion, tags of an image.                                                • RUN 2: We perform 5-fold cross validation on the devel-
      (3) Neural Network Architecture: We use a feed-forward                        opment set and report the average scores of 5 folds. In
          neural network with one hidden layer containing 128 units.                testing, we train the model using development set and use
          In training the network, we used batch size 20, and applied               the trained model for the prediction on the test images.
          drop-out technique with drop-out coefficient 0.5. In the               • RUN 3 and RUN 4: We use pre-computed object detection
          output layer, we use softmax layer so that the final network              model without changing any parameter [6][3]. We also
          can output probability values that an image is related to                 reused methods in RUN 1 and RUN 2 with the same setup.
          flood or not. We adopted keras framework [4] for building
                                                                              Table 1 shows the evaluation results on the development and
          the neural network.
                                                                           test data.
2.3     Visual-metadata-based Image Retrieval                              Table 1: Evaluation Results on Development and Test Data
The method used for visual-based image retrieval, described in
subsection 2.1 is utilized for this task. All features and supervised                              Dev. Set             Test Set
learners of visual-based and metadata-based methods are reused.                       Run      AP@480 MAP           AP@480 MAP
                                                                                      Run 1     82.17     88.84      77.62     87.87
2.4     Visual-metadata-external-resource-based                                       Run 2      83.4      87.8      57.07     57.12
        Image Retrieval                                                               Run 3     85.78     92.86      85.41     90.39
                                                                                      Run 4     92.53     98.38      90.69     97.36
Based on the observation that a water texture and colour and a
spatial relation between the water area and its surrounding area
can lead to the misclassification when using the proposed method               For run 1 and 2, there is a big gap between results on the devel-
with limited visual features provided by the organizer. Besides, a         opment set and on test set. Possible explanations are that (1) the
metadata content and it’s associated image do not always synchro-          data distribution of the development set is different than of the
nize by an event meaning. Hence, we propose the domain-specific            test set, and/or (2) there are too much non-synchronize between
algorithm to overcome these obstacles.                                     text and image contents. In the current work, we did not deal with
   We utilize the faster R-CNN [6] with the pre-computed object            inflections such as “flood” and “flooded”, so the feature space is
detection model running on the Tensorflow platform [3] to gener-           quite sparse. For run 3, the fusion of visual-based and metadata-
ate a bag of words containing objects that semantically related to         based outputs can improve the accuracy of flood images retrieval,
flooded areas, especially in industrial, residential, commercial, and      around 3% higher. It proves that these two different approaches
agricultural areas. Moreover, we use location and time information         can compensate their weaknesses. For run 4, there is a significant
described in a metadata to confirm whether a flood really happened         improvement of the accuracy when using SC model and ST informa-
by checking with weather databases that can be freely accessed on          tion whilst the visual-metadata-based method (e.g. run 3) reaches
the Internet (e.g. Europe area1 , America area2 , and Australia area3 ).   it’s limitation.
This task can be done by reusing the method introduced in [5].
   The former component, namely syn-content model (SC) is used             4   CONCLUSIONS AND FUTURE WORKS
to give more weighted values to the pair image-metadata that shares        We introduce the domain-based late-fusion method to retrieve flood
the similar content. The latter component is to strengthen the             images using social media. Although the achievement at this stage
accuracy of the meta-data-based model, namely spatio-temporal              is acceptable, there are so many things can be improved to get
                                                                           better results, especially to overcome the non-synchronized content
1 https://www.eswd.eu/                                                     between image’s and metadata’s and the suitable features and/or
2 http://www.weather.gov/, https://water.usgs.gov/floods/reports/          learners to distinguish water/flooded areas by its color, texture, and
3 http://www.bom.gov.au/climate/data/
                                                                           its spatial relation with surrounding areas.
Multimedia Satellite Task                                                    MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language
    Processing with Python.
[2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
    Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite
    Task at MediaEval 2017: Emergence Response for Flooding Events.
    In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,
    Ireland.
[3] Xinlei Chen and Abhinav Gupta. 2017. An Implementation of Faster
    RCNN with Study for Region Sampling. arXiv preprint arXiv:1702.02138
    (2017).
[4] François Chollet and others. 2015. Keras. https://github.com/fchollet/
    keras. (2015).
[5] Minh-Son Dao, Giulia Boato, Francesco G.B. De Natale, and Truc-Vien
    Nguyen. 2013. Jointly Exploiting Visual and Non-visual Information
    for Event-related Social Media Retrieval. In Proceedings of the 3rd
    ACM Conference on International Conference on Multimedia Retrieval
    (ICMR ’13). ACM, New York, NY, USA, 159–166. https://doi.org/10.
    1145/2461466.2461494
[6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-
    CNN: Towards Real-Time Object Detection with Region Proposal Net-
    works. In Advances in Neural Information Processing Systems (NIPS).