=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_46
|storemode=property
|title=A Multimodal Approach in Estimating Road Passability Through a Flooded Area Using Social Media and Satellite Images
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_46.pdf
|volume=Vol-2283
|authors=Anastasia Moumtzidou,Panagiotis Giannakeris,Stelios Andreadis,Athanasios Mavropoulos,Georgios Meditskos,Ilias Gialampoukidis,Konstantinos Avgerinakis,Stefanos Vrochidis,Ioannis Kompatsiaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MoumtzidouGAMMG18
}}
==A Multimodal Approach in Estimating Road Passability Through a Flooded Area Using Social Media and Satellite Images==
A multimodal approach in estimating road passability through a flooded area using social media and satellite images Anastasia Moumtzidou1 , Panagiotis Giannakeris1 , Stelios Andreadis1 , Athanasios Mavropoulos1 , Georgios Meditskos1 , Ilias Gialampoukidis1 , Konstantinos Avgerinakis1 , Stefanos Vrochidis1 , Ioannis Kompatsiaris1 1 CERTH-ITI, Greece {moumtzid,giannakeris,andreadisst,mavrathan,gmeditsk,heliasgj,koafger,stefanos,ikom}@iti.gr ABSTRACT (CBOW) and the Skip-gram models. Semantic networks and lex- This paper presents the algorithms that CERTH-ITI team deployed ical knowledge bases, such as WordNet [4] and ConceptNet [1], to tackle flood detection and road passability from social media provide useful, multilingual representations and interconnections and satellite data. Computer vision and deep learning techniques among terms, named entities, concepts, and relations. They can are combined in order to analyze social media and satellite images, be used for word-sense disambiguation and context enrichment, while word2vec is used to analyze textual data. Multimodal fusion creating graph-based semantic interpretations by linking candidate is also deployed in CERTH-ITI framework, both in early and late meanings, such as the outputs of visual analysis, with lexical re- stage, by combining deep representation features in the former sources, creating semantic signatures. Regarding the analysis of and semantic logic in the latter so as to provide a deeper and more satellite image, our approach is based on applying DCNN on satel- meaningful understanding of the flood events. lite data, following e.g. [10] on with Sentinel-1 imagery for oil spill identification. 1 INTRODUCTION 2 APPROACH The high popularity of social media around the world and the large 2.1 Analyzing social media images streams of satellite data that are openly available can be considered Two separate Deep Convolutional Neural Networks (DCNN) were as useful sources in the case of natural disasters, such as floods, trained and evaluated in order to carry out each one of the ’evidence’ hurricanes and fires. Several H2020 projects, such as beAWARE [2] and ’passability’ analysis levels, while the VGG architecture [13] and EOPEN [3], already apply their technologies on one or both of was adopted in both of them for the sake of extracting deep features these kind of sources to extract knowledge and assist civil protection of images in a holistic manner. The first model seeks for evidence agencies to monitor a flood event and have a holistic view of an in the images and classifies them between relevant or non-relevant area during an emergency event. in the context of road passability. Thereafter any images that pass Evidence and road passability recognition is performed in “Flood the first check are fed to the second model that classifies between classification for social multimedia” dataset within Multimedia images showing passable or non-passable roads. Satellite Task 2018, which contains a list of tweets with images for During the learning phase, we initialized our models with the the three big Hurricane events of 2017. While, flood detection in previously learned weights of a VGG architecture acquired from satellite images from the same events was also performed in the Places365 scene recognition dataset [15]. Furthermore, 5 splits of the compiled satellite dataset [6]. CERTH contribution involves the MediaEval 2018 development set were made so as to perform cross implementation of recent computer vision and deep learning tech- validation and select the best epoch to stop training our models. niques, which analyze social media and satellite images to identify The best parameter results were 6 epochs for the ’evidence’ model, evidence of flooded regions and perform road passability classifi- and 15 epochs for the ’passability’ model. cation. CERTH also deploys deep learning algorithms for textual recognition, while low and high level fusion was also performed 2.2 Textual analysis of social media in the acquired textual and visual data in order to leverage both contexts and get a more meaningful flood classification outcome. The textual analysis initially involves a preprocessing step of the Flood detection from social media images has also been presented given text by applying tokenization, stop word removal, and word in our previous work [5] that took place in last year’s MediaEval stemming, and then text representation by using word2vec [9] Satellite task competition and more recently in [7]. For textual classi- method. Parameter selection was deployed both for selecting vector fication a more recent work is adopted, based on word2vec [9] repre- dimension (i.e. 50, 100, 200, 300, 400, 500, 600) and window size (i.e. sentation, which includes the use of novel architectures and models 2, 3, 4). Furthermore, we exploited a set of Twitter posts that have for producing word embeddings (i.e. representation of words from been collected inside the scope of the beAWARE project. a given vocabulary as vectors in a low-dimensional space), based on Before finalizing the content of the corpus, we tried to filter deep neural networks (NN), namely the Continuous Bag-of-Words out tweets that are irrelevant to actual events of floods. First, we removed texts in which the keyword “flooding” is used metaphori- Copyright held by the owner/author(s). cally, by defining phrases often met in Twitter, e.g. “flooding my MediaEval’18, 29-31 October 2018, Sophia Antipolis, France timeline”. Next, we removed all texts where words of hateful com- munication or sexual intent appear, based on an available list of MediaEval’18, 29-31 October 2018, Sophia Antipolis, France A. Moumtzidou et al. dirty words onlineThe final step of textual analysis involves serving Table 1: Evaluation Results the text feature vector as input to a classifier (i.e. SVM, Naïve Bayes or Random Forests) which is tuned. Run submissions Averaged F1-Score (%) Flood classification for social multimedia 2.3 Early fusion of visual and textual features Visual 66.65 A multimodal analysis approach is also explored here by combining Textual 30.17 the information that is provided in the text and the accompanied Early fusion 66.43 images in social media tweets. Semantic Enrichment (ConceptNet) 55.12 A novel scheme was designed in order to fuse Deep CNN visual Semantic Enrichment (WordNet) 54.48 features and text features and produce a single feature per tweet that will then be used for classification purposes. For the extraction Road passability estimation from satellite images of the feature vector from the text, we followed the same procedure Visual 56.45 as described in the previous section. As far as the visual analysis, the activations of the last fully connected layer of the VGG network was chosen as the feature extractor. The feature map there is a 4096- dimensionality vector. Our scheme follows closely the bi-modal were tuned were the following; the learning rate, the batch size and stacked AutoEncoder of [14], but with the addition of an extra the optimizer function. The epoch value was set to 15 and the loss fully connected layer attached to the DCNN framework, used for function considered was the sparse categorical crossentropy. To classifying the fused feature vector. evaluate the performance of the different networks we considered For all the hidden layers tanh activations are used and for the accuracy as the evaluation metric and the results showed that the output reconstruction layers linear activations are used so that the parameters of best performing network were the following: SGD network is able to reconstruct accurately the input features that as optimizer function, 0.001 as learning rate and 10 as batch size. are of arbitrary range. Stochastic gradient descent (SGD) is used for the optimization with a learning rate of 0.01 and momentum equal to 0.9. A separate model was trained for 5000 epochs. 3 RESULTS AND ANALYSIS At this point, it should be noted that our system was tested using 2.4 Flood detection and Semantic enrichment embeddings not only based on the word2vec model (predictive), The semantic event fusion is based on the use of concepts annotat- but also on the Glove [11] model (count). The latter, while it per- ing the images. Specifically, each image is annotated with concepts formed quite well on our problem, it did not manage to outclass from a predefined concept pool of 345 concepts (TRECVID SIN the word2vec results. The most probable reason is that in order to concepts) and each concept is accompagnied with a score that indi- perform optimally, Glove needs to be trained over more data than cates the probability that it appears in the image. To obtain such what is available in our dataset. The size of the text is also an issue scores, we used a DCNN that was trained according to the 22-layer for the text classification part.As far as visual analysis, it is safe to GoogLeNet architecture on the ImageNet 2011 dataset for 5055 assume that some of the most difficult samples were pictures that categories. Then, we fine-tuned the network on the 345 concepts didn’t contain any civilians or vehicles inside flooded roads, as fea- by using the extension strategy proposed in [12]. tures of large water bodies are not informative enough to provide In an effort to semantically enrich the context of the predefined information about water depth. In a higher level of conclusions, concept pool, we mapped each concept to WordNet and ConceptNet we can see that the visual component overpassed all the others, resources. For each term tw in WordNet, we create a vector with including early and late fusion of the low level data. That infers the synsets that belong to the hierarchy of hypernyms of tw (up to that the visual indications can provide more meaningful and less the third level). For each ConceptNet term tc , the pertinent vectors ambiguous information than the text that accompanies Twitter. contain all the terms in the knowledge graph that are considered relevant to tc with a plausibility score above 80%. These vectors are 4 DISCUSSION AND OUTLOOK then used to semantically enrich the annotations derived by visual Our participation in Social Media Satellite Task, gave CERTH the analysis, by adding semantically relevant concepts. opportunity to test and enhance its algorithms in computer vision, textual analysis and semantic fusion in realistic datasets. The results 2.5 Road passability from satellite images of these challenge highlight to us that DCNN can provide very In order to classify satellite images to the class “road passability” we meaningful results for flood detection in both tasks and especially built models by using a pretrained ResNet-50 [8] DCNN. ResNet- when visual context is taken under consideration. We plan to deploy 50 uses residual functions to help add considerable stability to more sophisticated fusion techniques that will be able to leverage deep networks, and its input are 224x224 images. Then we fine- the low level (text, visual) information in a more efficient way. tuned it by removing the last pooling layer and attached a new pooling layer with a softmax activation function with size 2. The NN was trained on 1000 images and validated on the remaining ACKNOWLEDGMENTS 437 images. It should be noted that several experiments were run This work was supported by EC-funded projects H2020-700475- in order to find the best performing model. The parameters that beAWARE and H2020-776019-EOPEN. 2018 Multimedia Satellite Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] ConceptNet - An open, multilingual knowledge graph. =http://conceptnet.io. [2] H2020, beAWARE project. https://beaware-project.eu/. [3] H2020, eOPEN project. https://eopen-project.eu/. [4] Princeton WordNet 3.1. http://wordnet-rdf.princeton.edu. [5] Konstantinos Avgerinakis, Anastasia Moumtzidou, Stelios Andreadis, Emmanouil Michail, Ilias Gialampoukidis, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2017. Visual and textual analysis of social media and satellite images for flood detection@ multimedia satellite task MediaEval 2017. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland. 13–15. [6] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and Damian Borth. The Multimedia Satellite Task at MediaEval 2018: Emergency Response for Flooding Events. In Proc. of the MediaEval 2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France. [7] Panagiotis Giannakeris, Konstantinos Avgerinakis, Anastasios Karakostas, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2018. Peo- ple and vehicles in danger-A fire and flood detection system in social media. In IEEE Image, Video, and Multidimensional Signal Processing (IVMSP) Workshop. [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. [9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. [10] Georgios Orfanidis, Konstantinos Ioannidis, Konstantinos Avgerinakis, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2018. A Deep Neural Network for Oil Spill Semantic Segmentation in SAR Images. In ICIP. IEEE, 3773–3777. [11] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. http: //www.aclweb.org/anthology/D14-1162 [12] Nikiforos Pittaras, Foteini Markatopoulou, Vasileios Mezaris, and Ioan- nis Patras. 2017. Comparison of fine-tuning and extension strategies for deep convolutional neural networks. In International Conference on Multimedia Modeling. Springer, 102–114. [13] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [14] Pengfei Zhang, Xiaoping Ma, Wenyu Zhang, Shaowei Lin, Huilin Chen, Arthur Lee Yirun, and Gaoxi Xiao. 2015. Multimodal fusion for sensor data using stacked autoencoders. In Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 1–2. [15] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million Image Database for Scene Recog- nition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).