=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_46
|storemode=property
|title=A Multimodal Approach in Estimating Road Passability Through a Flooded Area Using Social Media and Satellite Images
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_46.pdf
|volume=Vol-2283
|authors=Anastasia Moumtzidou,Panagiotis Giannakeris,Stelios Andreadis,Athanasios Mavropoulos,Georgios Meditskos,Ilias Gialampoukidis,Konstantinos Avgerinakis,Stefanos Vrochidis,Ioannis Kompatsiaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MoumtzidouGAMMG18
}}
==A Multimodal Approach in Estimating Road Passability Through a Flooded Area Using Social Media and Satellite Images==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_46.pdf</pdf>
<pre>
A multimodal approach in estimating road passability through a
     flooded area using social media and satellite images
      Anastasia Moumtzidou1 , Panagiotis Giannakeris1 , Stelios Andreadis1 , Athanasios Mavropoulos1 ,
       Georgios Meditskos1 , Ilias Gialampoukidis1 , Konstantinos Avgerinakis1 , Stefanos Vrochidis1 ,
                                          Ioannis Kompatsiaris1
                                                              1 CERTH-ITI, Greece

                    {moumtzid,giannakeris,andreadisst,mavrathan,gmeditsk,heliasgj,koafger,stefanos,ikom}@iti.gr

ABSTRACT                                                                   (CBOW) and the Skip-gram models. Semantic networks and lex-
This paper presents the algorithms that CERTH-ITI team deployed            ical knowledge bases, such as WordNet [4] and ConceptNet [1],
to tackle flood detection and road passability from social media           provide useful, multilingual representations and interconnections
and satellite data. Computer vision and deep learning techniques           among terms, named entities, concepts, and relations. They can
are combined in order to analyze social media and satellite images,        be used for word-sense disambiguation and context enrichment,
while word2vec is used to analyze textual data. Multimodal fusion          creating graph-based semantic interpretations by linking candidate
is also deployed in CERTH-ITI framework, both in early and late            meanings, such as the outputs of visual analysis, with lexical re-
stage, by combining deep representation features in the former             sources, creating semantic signatures. Regarding the analysis of
and semantic logic in the latter so as to provide a deeper and more        satellite image, our approach is based on applying DCNN on satel-
meaningful understanding of the flood events.                              lite data, following e.g. [10] on with Sentinel-1 imagery for oil spill
                                                                           identification.

1    INTRODUCTION                                                          2 APPROACH
The high popularity of social media around the world and the large         2.1 Analyzing social media images
streams of satellite data that are openly available can be considered      Two separate Deep Convolutional Neural Networks (DCNN) were
as useful sources in the case of natural disasters, such as floods,        trained and evaluated in order to carry out each one of the ’evidence’
hurricanes and fires. Several H2020 projects, such as beAWARE [2]          and ’passability’ analysis levels, while the VGG architecture [13]
and EOPEN [3], already apply their technologies on one or both of          was adopted in both of them for the sake of extracting deep features
these kind of sources to extract knowledge and assist civil protection     of images in a holistic manner. The first model seeks for evidence
agencies to monitor a flood event and have a holistic view of an           in the images and classifies them between relevant or non-relevant
area during an emergency event.                                            in the context of road passability. Thereafter any images that pass
   Evidence and road passability recognition is performed in “Flood        the first check are fed to the second model that classifies between
classification for social multimedia” dataset within Multimedia            images showing passable or non-passable roads.
Satellite Task 2018, which contains a list of tweets with images for          During the learning phase, we initialized our models with the
the three big Hurricane events of 2017. While, flood detection in          previously learned weights of a VGG architecture acquired from
satellite images from the same events was also performed in the            Places365 scene recognition dataset [15]. Furthermore, 5 splits of the
compiled satellite dataset [6]. CERTH contribution involves the            MediaEval 2018 development set were made so as to perform cross
implementation of recent computer vision and deep learning tech-           validation and select the best epoch to stop training our models.
niques, which analyze social media and satellite images to identify        The best parameter results were 6 epochs for the ’evidence’ model,
evidence of flooded regions and perform road passability classifi-         and 15 epochs for the ’passability’ model.
cation. CERTH also deploys deep learning algorithms for textual
recognition, while low and high level fusion was also performed            2.2    Textual analysis of social media
in the acquired textual and visual data in order to leverage both
contexts and get a more meaningful flood classification outcome.           The textual analysis initially involves a preprocessing step of the
   Flood detection from social media images has also been presented        given text by applying tokenization, stop word removal, and word
in our previous work [5] that took place in last year’s MediaEval          stemming, and then text representation by using word2vec [9]
Satellite task competition and more recently in [7]. For textual classi-   method. Parameter selection was deployed both for selecting vector
fication a more recent work is adopted, based on word2vec [9] repre-       dimension (i.e. 50, 100, 200, 300, 400, 500, 600) and window size (i.e.
sentation, which includes the use of novel architectures and models        2, 3, 4). Furthermore, we exploited a set of Twitter posts that have
for producing word embeddings (i.e. representation of words from           been collected inside the scope of the beAWARE project.
a given vocabulary as vectors in a low-dimensional space), based on            Before finalizing the content of the corpus, we tried to filter
deep neural networks (NN), namely the Continuous Bag-of-Words              out tweets that are irrelevant to actual events of floods. First, we
                                                                           removed texts in which the keyword “flooding” is used metaphori-
Copyright held by the owner/author(s).
                                                                           cally, by defining phrases often met in Twitter, e.g. “flooding my
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                 timeline”. Next, we removed all texts where words of hateful com-
                                                                           munication or sexual intent appear, based on an available list of
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                  A. Moumtzidou et al.


dirty words onlineThe final step of textual analysis involves serving                         Table 1: Evaluation Results
the text feature vector as input to a classifier (i.e. SVM, Naïve Bayes
or Random Forests) which is tuned.                                             Run submissions                          Averaged F1-Score (%)
                                                                                         Flood classification for social multimedia
2.3    Early fusion of visual and textual features
                                                                               Visual                                  66.65
A multimodal analysis approach is also explored here by combining              Textual                                 30.17
the information that is provided in the text and the accompanied               Early fusion                            66.43
images in social media tweets.                                                 Semantic Enrichment (ConceptNet) 55.12
   A novel scheme was designed in order to fuse Deep CNN visual                Semantic Enrichment (WordNet)           54.48
features and text features and produce a single feature per tweet
that will then be used for classification purposes. For the extraction              Road passability estimation from satellite images
of the feature vector from the text, we followed the same procedure            Visual                               56.45
as described in the previous section. As far as the visual analysis,
the activations of the last fully connected layer of the VGG network
was chosen as the feature extractor. The feature map there is a 4096-
dimensionality vector. Our scheme follows closely the bi-modal
                                                                           were tuned were the following; the learning rate, the batch size and
stacked AutoEncoder of [14], but with the addition of an extra
                                                                           the optimizer function. The epoch value was set to 15 and the loss
fully connected layer attached to the DCNN framework, used for
                                                                           function considered was the sparse categorical crossentropy. To
classifying the fused feature vector.
                                                                           evaluate the performance of the different networks we considered
   For all the hidden layers tanh activations are used and for the
                                                                           accuracy as the evaluation metric and the results showed that the
output reconstruction layers linear activations are used so that the
                                                                           parameters of best performing network were the following: SGD
network is able to reconstruct accurately the input features that
                                                                           as optimizer function, 0.001 as learning rate and 10 as batch size.
are of arbitrary range. Stochastic gradient descent (SGD) is used
for the optimization with a learning rate of 0.01 and momentum
equal to 0.9. A separate model was trained for 5000 epochs.                3     RESULTS AND ANALYSIS
                                                                           At this point, it should be noted that our system was tested using
2.4    Flood detection and Semantic enrichment                             embeddings not only based on the word2vec model (predictive),
The semantic event fusion is based on the use of concepts annotat-         but also on the Glove [11] model (count). The latter, while it per-
ing the images. Specifically, each image is annotated with concepts        formed quite well on our problem, it did not manage to outclass
from a predefined concept pool of 345 concepts (TRECVID SIN                the word2vec results. The most probable reason is that in order to
concepts) and each concept is accompagnied with a score that indi-         perform optimally, Glove needs to be trained over more data than
cates the probability that it appears in the image. To obtain such         what is available in our dataset. The size of the text is also an issue
scores, we used a DCNN that was trained according to the 22-layer          for the text classification part.As far as visual analysis, it is safe to
GoogLeNet architecture on the ImageNet 2011 dataset for 5055               assume that some of the most difficult samples were pictures that
categories. Then, we fine-tuned the network on the 345 concepts            didn’t contain any civilians or vehicles inside flooded roads, as fea-
by using the extension strategy proposed in [12].                          tures of large water bodies are not informative enough to provide
   In an effort to semantically enrich the context of the predefined       information about water depth. In a higher level of conclusions,
concept pool, we mapped each concept to WordNet and ConceptNet             we can see that the visual component overpassed all the others,
resources. For each term tw in WordNet, we create a vector with            including early and late fusion of the low level data. That infers
the synsets that belong to the hierarchy of hypernyms of tw (up to         that the visual indications can provide more meaningful and less
the third level). For each ConceptNet term tc , the pertinent vectors      ambiguous information than the text that accompanies Twitter.
contain all the terms in the knowledge graph that are considered
relevant to tc with a plausibility score above 80%. These vectors are      4     DISCUSSION AND OUTLOOK
then used to semantically enrich the annotations derived by visual
                                                                           Our participation in Social Media Satellite Task, gave CERTH the
analysis, by adding semantically relevant concepts.
                                                                           opportunity to test and enhance its algorithms in computer vision,
                                                                           textual analysis and semantic fusion in realistic datasets. The results
2.5    Road passability from satellite images                              of these challenge highlight to us that DCNN can provide very
In order to classify satellite images to the class “road passability” we   meaningful results for flood detection in both tasks and especially
built models by using a pretrained ResNet-50 [8] DCNN. ResNet-             when visual context is taken under consideration. We plan to deploy
50 uses residual functions to help add considerable stability to           more sophisticated fusion techniques that will be able to leverage
deep networks, and its input are 224x224 images. Then we fine-             the low level (text, visual) information in a more efficient way.
tuned it by removing the last pooling layer and attached a new
pooling layer with a softmax activation function with size 2. The
NN was trained on 1000 images and validated on the remaining
                                                                           ACKNOWLEDGMENTS
437 images. It should be noted that several experiments were run           This work was supported by EC-funded projects H2020-700475-
in order to find the best performing model. The parameters that            beAWARE and H2020-776019-EOPEN.
2018 Multimedia Satellite Task                                                 MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] ConceptNet - An open, multilingual knowledge graph.
     =http://conceptnet.io.
 [2] H2020, beAWARE project. https://beaware-project.eu/.
 [3] H2020, eOPEN project. https://eopen-project.eu/.
 [4] Princeton WordNet 3.1. http://wordnet-rdf.princeton.edu.
 [5] Konstantinos Avgerinakis, Anastasia Moumtzidou, Stelios Andreadis,
     Emmanouil Michail, Ilias Gialampoukidis, Stefanos Vrochidis, and
     Ioannis Kompatsiaris. 2017. Visual and textual analysis of social
     media and satellite images for flood detection@ multimedia satellite
     task MediaEval 2017. In Proceedings of the Working Notes Proceeding
     MediaEval Workshop, Dublin, Ireland. 13–15.
 [6] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and
     Damian Borth. The Multimedia Satellite Task at MediaEval 2018:
     Emergency Response for Flooding Events. In Proc. of the MediaEval
     2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France.
 [7] Panagiotis Giannakeris, Konstantinos Avgerinakis, Anastasios
     Karakostas, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2018. Peo-
     ple and vehicles in danger-A fire and flood detection system in social
     media. In IEEE Image, Video, and Multidimensional Signal Processing
     (IVMSP) Workshop.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     conference on computer vision and pattern recognition. 770–778.
 [9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
     Dean. 2013. Distributed representations of words and phrases and
     their compositionality. In Advances in neural information processing
     systems. 3111–3119.
[10] Georgios Orfanidis, Konstantinos Ioannidis, Konstantinos Avgerinakis,
     Stefanos Vrochidis, and Ioannis Kompatsiaris. 2018. A Deep Neural
     Network for Oil Spill Semantic Segmentation in SAR Images. In ICIP.
     IEEE, 3773–3777.
[11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
     2014. GloVe: Global Vectors for Word Representation. In Empirical
     Methods in Natural Language Processing (EMNLP). 1532–1543. http:
     //www.aclweb.org/anthology/D14-1162
[12] Nikiforos Pittaras, Foteini Markatopoulou, Vasileios Mezaris, and Ioan-
     nis Patras. 2017. Comparison of fine-tuning and extension strategies
     for deep convolutional neural networks. In International Conference
     on Multimedia Modeling. Springer, 102–114.
[13] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).
[14] Pengfei Zhang, Xiaoping Ma, Wenyu Zhang, Shaowei Lin, Huilin
     Chen, Arthur Lee Yirun, and Gaoxi Xiao. 2015. Multimodal fusion
     for sensor data using stacked autoencoders. In Intelligent Sensors,
     Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth
     International Conference on. IEEE, 1–2.
[15] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
     Torralba. 2017. Places: A 10 million Image Database for Scene Recog-
     nition. IEEE Transactions on Pattern Analysis and Machine Intelligence
     (2017).

</pre>