=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_31
|storemode=property
|title=Visual and Textual Analysis of Social Media and Satellite Images for Flood Detection @ Multimedia Satellite Task MediaEval 2017
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_31.pdf
|volume=Vol-1984
|authors=Konstantinos Avgerinakis,Anastasia Moumtzidou,Stelios Andreadis,Emmanouil Michail,Ilias Gialampoukidis,Stefanos Vrochidis,Ioannis Kompatsiaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AvgerinakisMAMG17
}}
==Visual and Textual Analysis of Social Media and Satellite Images for Flood Detection @ Multimedia Satellite Task MediaEval 2017==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_31.pdf</pdf>
<pre>
 Visual and textual analysis of social media and satellite images
 for flood detection @ multimedia satellite task MediaEval 2017
                   Konstantinos Avgerinakis1 , Anastasia Moumtzidou1 , Stelios Andreadis1 ,
              Emmanouil Michail1 , Ilias Gialampoukidis1 , Stefanos Vrochidis1 , Ioannis Kompatsiaris1
                          1 Centre for Research & Technology Hellas - Information Technologies Institute, Greece

                                                  koafgeri@iti.gr,moumtzid@iti.gr,andreadisst@iti.gr
                                               michem@iti.gr,heliasgj@iti.gr,stefanos@iti.gr,ikom@iti.gr

ABSTRACT
This paper presents the algorithms that CERTH team deployed
in order to tackle disaster recognition tasks and more specifically
Disaster Image Retrieval from Social Media (DIRSM) and Flood-
Detection in Satellite images (FDSI). Visual and textual analysis,
as well as late fusion of their similarity scores, were deployed in
social media images, while color analysis in the RGB and near-
infrared channel of satellite images was performed in order to
discriminate flooded from non-flooded images. Deep Convolutional
Neural Network (DCNN), DBpedia Spotlight and combMAX was
implemented to tackle DIRSM, while Mahalanobis Distance-based
classification and morphological post-processing were applied to
deal with FDSI.                                                               Figure 1: Block diagram of our multimodal retrieval system


1    INTRODUCTION
                                                                              fused with a novel multimodal approach which combines non-linear
Security, surveillance and more specifically disaster prediction and
                                                                              graph-based fusion [3] with combMax scoring. For FDSI subtask
classification from social media and satellite sources have raised a
                                                                              CERTH performs a Mahalanobis distance classification and several
lot of interest in the computer science the last decade. The unob-
                                                                              morphological and adaptive filters, so as to separate flood from
trusive and abundant nature of these data rendered them as one of
                                                                              non-flood areas inside satellite image scene.
the most valuable sources to extract and deduct early warning or
identification of an ongoing or eminent disaster.
   Multimedia satellite task is a challenge of MediaEval that com-
                                                                              2 APPROACH
prises of two tasks: (a) Disaster Image Retrieval from Social Me-             2.1 Flood detection from social media (DIRSM)
dia (DIRSM) and (b) Flood-Detection in Satellite Images (FDSI).               Social media were crawled in this task so as to acquire images and
DIRSM provides a great amount of social media images (YFCC100M-               text about flood scenarios. For that purposes, two modalities were
Dataset) and their metadata (Flickr), while FDSI is comprised of a            deployed and fused with a non-linear graph-based fusion approach.
large amount of 4 colour-channel, 3 for the RGB spectrum and 1 for               The first modality concerned visual analysis and more specif-
the near-infrared, satellite images from PlanetLabs [5]. Both tasks           ically flood detection inside image samples by adopting a Deep
ask from the participants to leverage any available technology so             Convolutional Neural Network (DCNN) framework. GoogleNet [4]
as to determine whether a flood event occurs in the provided test             was trained on 5055 ImageNet concepts, and the output of the last
data. As far as visual data are concerned, a flood event is considered        pooling layer with dimension 1024 was used as a global keyframe
when an image shows an "unexpected high water level in indus-                 representation. The provided development set was then splitted
trial, residential, commercial and agricultural areas". The reader is         into two subsets and used to train an SVM classifier and define its
suggested to read [1] for further information about the contest and           optimal parameters: t (defines the kernel type) and g (gamma in
the provided data.                                                            kernel function). The best results were achieved for t = 1 (polyno-
   In this work, CERTH presents its algorithms for DIRSM and FDSI             mial function) and д = 0.5. The test environment that CERTH built,
subtasks. For flood recognition in images, CERTH uses the output of           included the evaluation of the precomputed features provided from
the last pooling layer of a trained GoogleNet [4] for global keyframe         the Multimedia-Satellite challenge (i.e. acc, gabor, fcth, jcd, cedd,
representation and trains an SVM classifier to recognize images that          eh, sc, cl, and tamura) and DCNN features that were produced from
are related to a flooding event. Textual information is also retrieved        the Places205 − GooдLeN et network by fusing the features from
by leveraging the metadata of the social media images by using                the convolutional layers 3a and 3b. SVM classifiers were trained for
DBpedia Spotlight annotation tool [2]. Both of these modalities are           all of these features and results showed that the proposed DCNN
                                                                              feature outperformed most of them significantly.
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                              The second modality concerns the detection of flood-related text
                                                                              in social media metadata. For that purposes, DBpedia Spotlight [2]
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                              K. Avgerinakis et al.


was adapted so as to detect flood, water and related keyphrases that                     Table 1: CERTH results in DIRSM task
were acquired from the training set metadata (i.e. title, description,
user tags). A disambiguation algorithm followed up to compare the                        Modalities     single cutoff   several cutoffs
aforementioned phrases with the collection, using Jaccard similari-
                                                                                           Visual            78.82%         92.27%
ties. The similarity scores of the two modalities were also combined
with the use of a late fusion approach that uses non-linear graph                          Textual           36.15%         39.90%
based techniques (random walk, diffusion-based) in a weighted                              Fusion            68.57%         83.37%
non-linear way [3]. The top-l multimodal objects are filtered with
respect to textual concepts, leading to l ×l similarity matrixes S 1 , S 2
and query-based l × 1 similarity vectors s 1 and s 2 . More specifically,                 Table 2: CERTH results in FDSI task
10 positive examples were selected from the training set as queries
so as to acquire 10 ranked lists and by using combMAX late fusion
to get the final list of relevant-to-the-flood multimodal objects. The           loc01   loc02       loc03     loc04    loc05       loc06     loc07
overall block diagram of this approach is depicted in Fig. 1.                 81.71%     68.33%     82.08%    47.01%    45.84%     64.92%    56.27%

2.2    Flood detection from satellite images (FDSI)
Satellite images were collected from PlanetLabs [5] so that we can           modality and cannot leverage or complement the visual information
evaluate our localization algorithm in real case scenarios. Local-           in the final deduction, leading to lower accuracy rates than visual
ization is based on a Mahalanobis classification framework and               does.
post-processing morphological operations.                                       Results from Satellite images (FDSI) are presented in Table 2.
    Mahalanobis distances with stratified covariance estimates were          The accuracy rates are quite diverse amongst them as we acquired
computed to train our classifier by randomly selecting 10000 sam-            very high rates in some locations such as loc01 and loc03, while
ples (RGB and infrared pixels) from each 7 sets of satellite images,         other ones such as loc04 and loc05 were too low. From our point
leading into a final population of 70000 samples. Linear, diagonal           of view, this is attributed to the colour nature of the data in these
linear, quadratic and diagonal quadratic discriminant functions              areas, as in the former the separation of water was clear, while in
were also computed, but Mahalanobis distances achieved the high-             the latter non-flood areas had similar colour with the flood ones.
est classification results. For every image of the testing set all pixels    Furthermore, groundtruth masks included some non-flood pixels
of the image were extracted, creating a four dimensional (R,G,B,NI)          as flood and as the nature of our algorithm is pixel-wise they were
testing set consisting of 102400 samples (320 pixels × 320 pixels)           misclassified as positive samples lead to poor performance models.
per image. The final outcome was a binary mask that denoted 1 for            Overall, our classifier lead to 74.67% localization accuracy rate.
flooded pixels and 0 for non-flooded ones.
    Post-processing was then deployed on the acquired binary masks,          4     DISCUSSION AND OUTLOOK
in order to eliminate erroneous areas that resulted from the noisy           Multimedia satellite challenge gave as the opportunity to test our
nature of the dataset. A global filter was initially deployed on the bi-     algorithm in real case disaster scenarios. Social media and satellite
nary mask so as to eliminate population of flood-denoted pixels that         sources proved extremely valuable and helped us separate flood
as a whole did not surpass the 5% of the image size. Similarly, a local      scenarios from others. The high average precision rate that visual
filter followed up so as to eliminate the connected components of            features achieved proves that computer vision community can be-
flood-denoted areas that did not surpass the size of 10 pixels. Image        come ever more helpful in disaster detection and it is clear now that
dilation and erosion was finally applied around each pixel and its           can surpass the ambiguity that text can introduce in the decision
surrounding area (circular cell with radius of 4 pixels) to eliminate        feature. On the other hand, satellite images proved quite noisy and
small areas that were falsely denoted as flood, but simultaneously           require deeper investigation in the future.
preserve the larger ones.                                                       As a future work, we plan to adopt deeper techniques that exist
                                                                             in the literature to recognize and discriminate places from each
3     RESULTS AND ANALYSIS                                                   other, while we also plan to investigate hybrid representations that
Social media results for flood situations (DIRSM) are gathered in Ta-        combine shallow with deep features so as to achieve even higher
ble 1. Two retrieval approaches were used; (a) single cutoff scheme          precision rates in the visual part of the system. Text approaches
that returns the top-480 most similar samples and (b) multiple cutoff        should undoubtedly revised and get tailored to disaster related
scheme that combines results from 4 different thresholds equal to            scenarios, while fusion approaches that consider “semantic filtering”
50, 100, 250, 480 by averaging their scores so as to conclude into a         stages based on textual concepts will be revised. Regarding FDSI,
final list.                                                                  we plan to build a shallow/deep representation scheme that will
   It is obvious that multiple cutoffs worked better than a single.          leverage both texture (i.e. LBP) and deep features so as to learn to
Furthermore, we can observe that visual modality surpassed the               separate flood from non-flood areas even more effectively.
textual by far and this is mainly attributed to the fact that some
keywords related to flood and water might be found under several             ACKNOWLEDGMENTS
irrelevant contexts, leading text retrieval to very low accuracy             This work is supported by beAWARE project, partially funded by
rates. Fusion is also affected by the low performance of the textual         the European Commission (H2020-700475).
Multimedia Satellite Task                                                      MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
    Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite
    Task at MediaEval 2017: Emergence Response for Flooding Events.
    In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,
    Ireland.
[2] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013.
    Improving Efficiency and Accuracy in Multilingual Entity Extraction.
    In Proceedings of the 9th International Conference on Semantic Systems
    (I-Semantics).
[3] Ilias Gialampoukidis, Anastasia Moumtzidou, Dimitris Liparas,
    Theodora Tsikrika, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2017.
    Multimedia retrieval based on non-linear graph-based fusion and par-
    tial least squares regression. Multimedia Tools and Applications (2017),
    1–21.
[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E.
    Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
    Andrew Rabinovich. 2015. Going deeper with convolutions.. In CVPR.
    IEEE Computer Society, 1–9. http://dblp.uni-trier.de/db/conf/cvpr/
    cvpr2015.html#SzegedyLJSRAEVR15
[5] Planet team. 2017. Planet Application Program Interface: In Space for
    Life on Earth. (2017).

</pre>