=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_52 |storemode=property |title=Multi-Modal Machine Learning for Floods Detection in News, Social Media and Satellite Sequences |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_52.pdf |volume=Vol-2670 |authors=Kashif Ahmad,Konstantin Pogorelov,Mohib Ullah,Michael Riegler,Nicola Conci,Johannes Langguth,Ala Al-Fuqaha |dblpUrl=https://dblp.org/rec/conf/mediaeval/AhmadPURCLA19 }} ==Multi-Modal Machine Learning for Floods Detection in News, Social Media and Satellite Sequences== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_52.pdf
     Multi-Modal Machine Learning for Flood Detection in News,
               Social Media and Satellite Sequences
                                   Kashif Ahmad1 , Konstantin Pogorelov 2 , Mohib Ullah3 ,
                            Michael Riegler4 , Nicola Conci5 , Johannes Langguth 2 , Ala Al-Fuqaha 1
                             1 Hamad Bin Khalifa University, Doha, Qatar, 2 Simula Research Laboratory, Norway
        3 Norwegian University of Science and Technology, Norway, 4 Simula Metropolitan Center for Digitalisation and

                                     Kristiania University College, Norway, 5 University of Trento, Italy
                                {kahmad,aalfuqaha}@hbku.edu.qa,mohib.ullah@ntnu.no,nicola.conci@unitn.it
                                                 {konstantin,michael,langguth}@simula.no

ABSTRACT                                                                   2 PROPOSED APPROACH
In this paper we present our methods for the MediaEval 2019 Mul-           2.1 Methodology for NITD task
timedia Satellite Task, which is aiming to extract complementary
                                                                           Considering the diversity of the content covered by natural disaster-
information associated with adverse events from Social Media and
                                                                           related images, based on our previous experience [2], we utilize a
satellites. For the first challenge, we propose a framework jointly uti-
                                                                           diversified set of visual features including colour, texture, object
lizing colour, object and scene-level information to predict whether
                                                                           and scene-level features. The object and scene-level features are
the topic of an article containing an image is a flood event or not.
                                                                           extracted through three different Convolutional Neural Network
Visual features are combined using early and late fusion techniques
                                                                           (CNN) models, namely AlexNet[9], VggNet [13] and ResNet [8],
achieving an average F1-score of 82.63, 82.40, 81.40 and 76.77. For
                                                                           pre-trained on the ImageNet dataset [7] and the Places dataset [15].
the multi-modal flood level estimation, we rely on both visual
                                                                           The models pre-trained on ImageNet correspond to object level
and textual information achieving an average F1-score of 58.48
                                                                           information while the ones pre-trained on the Places dataset extract
and 46.03, respectively. Finally, for the flooding detection in time-
                                                                           scene level information. For feature extraction from all models, we
based satellite image sequences we used a combination of classical
                                                                           use the Caffe toolbox1 . For colour and texture features we rely on
computer-vision and machine learning approaches achieving an
                                                                           the LIRE open source library [10] which we used to extract joint
average F1-score of 58.82.
                                                                           composite descriptor (JCD) features from the images.
                                                                              In order to combine the features, we use both early and late fusion
1    INTRODUCTION                                                          techniques. For the early fusion, feature vectors are concatenated.
When natural disasters occur, an instant access to relevant informa-       For late fusion two different techniques namely (i) simple averaging
tion may be crucial to mitigate loss in terms of property and human        and (ii) Particle Swarm Optimization (PSO) based technique is used
lives, and may result in a speedy recovery [12]. In this regards,          for late fusion. The basic motivation behind PSO based fusion is to
social media and remotely sensed information have been proved              assign merit based weights to the deep models. For classification
very effective [1, 3, 12]. Similar to the 2017 [5] and 2018 [6] versions   purposes, we rely on Support Vector Machines (SVMs) in all of
of the task, the MediaEval 2019 Multimedia Satellite task [4] aims         the submitted fusion runs. Moreover, to deal with class imbalance
to combine the information from the two complementary sources,             problem, we use ensemble different re-sampled data sets technique
namely social media and satellites.                                        where five different models are trained using all the samples of the
   This paper provides detailed description of the methods pro-            rare class and n-differing samples of the abundant class.
posed by team UTAOS for the MediaEval 2019 Multimedia Satellite
challenge. The challenge consists of three parts, namely (i) News          2.2     Methodology for MFLE task
Image Topic Disambiguation (NITD), (ii) Multimodal Flood Level Es-         For the MFLE task, we proposed two different solutions exploiting
timation (MFLE) and (iii) City-centered Satellite Sequences (CCSS).        both: visual and textual information. For visual features based flood
   The first two tasks(NITD and MFLE) are based on social media            estimation, we proposed a two step framework where as a first
data aiming to (a) predict whether the topic of the article containing     step an ensemble of binary image classifiers trained on deep visual
the image was a water-related natural-disaster event or not, and (b)       features, extracted through AlexNet pre-trained on ImageNet and
to build a binary classifier that predicts whether or not the image        Places datasets, is used to differentiate between flood and non-
contains at least one person standing in water above the knee.             flooded images. In the second step, we rely on tracking techniques
   In the CCSS task, the participants are provided with a set of           [14], for which an open source library, namely OpenPose2 , has
sequences of satellite images depicting a certain city over a certain      been used to draw and extract body points on the people in the
length of time, and the they need to propose and develop a frame-          flood related images. Subsequently, the generated coordinates are
work able to determine whether or not there was a flooding event           analyzed to identify the images having at least one person standing
ongoing in that city at that time.                                         in water and the water level is above the knee height by checking
                                                                           the knee joints at the corresponding index of the generated files
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                       1 https://github.com/BVLC/caffe
4.0 International (CC BY 4.0).                                             2 https://www.learnopencv.com/tag/openpose/
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                     K. Ahmad et al.


of the joints extracted for each person. On the other hand, for text      Table 1: Evaluation of our proposed approaches for (a) NITD
analysis we employed two methods, namely (i) Bag-of-words Model           and (b) MLFE tasks in terms of F1-scores.
(BoW) and (ii) LSTM network. Before applying the methods, the data                     (a) NITD
                                                                                                                          (b) MFLE
was pre-processed for tokenization and removing of punctuation.
                                                                              Run     Dev. Set    Test Set
                                                                                                               Run      Dev. Set     Test Set
                                                                              Run 1    81.08       82.63
2.3    Methodology for CCSS task                                                                               Run 1     61.00        58.48
                                                                              Run 2    79.43       82.40
                                                                                                               Run 2     56.71        46.03
For CCSS task, first, we tried to employ a recurrent convolutional            Run 3    77.77       81.40
                                                                                                               Run 4     57.56        44.91
neural network architecture designed for change detection in multi-           Run 4    75.70       76.77
spectral satellite imagery data (ReCNN) [11]. This network was
initially designed to solve the task very similar to CCSS task goals,     3.2    Runs Description in MLFE Task
and the results depicted by ReCNN’s authors are promising. How-           For MLFE task, we submitted two mandatory and one optional run.
ever, despite high expectations, ReCNN was not able to achieve suf-       The first run is based on visual information where a two phase
ficient and better-than-random performance of detection changes           approach has been proposed for flood level estimation starting
caused by flooding. Our assumption is that was caused by the "real"       with deep features based classification of flooded and non-flooded
nature of the dataset provided in CCSS task. Images were taken            images, followed by human body points detection via Openpose
in different seasons, often partially or fully covered by clouds and      library in the flood-related images. Our second and third runs are
sometimes have noticeable pixel offset between each other. After          based on textual information where Bag-of-words (BoW) and LSTM
a series of unsuccessful experiments, we decided to use a classical       based techniques are used for the article classification, respectively.
image processing and analysis approach with multi-stage image             Table 1b shows the experimental results of our solutions for MLFE
processing using simple operations. First, we mask-out from the           task on both development and test sets. Overall, better results are
further analysis all cloud-covered image areas by applying simple         obtained with visual information. Moreover, BoW features produce
threshold function. Reference threshold value is computed per-            slightly better results over LSTM based approach.
image by averaging the values of the pixels located in monotoni-
cally white-colored areas. The same masking is performed for dark         3.3    Runs Description in CCSS Task
underexposured and areas with missing imaging data. Next, we              For CCSS task we submitted the mandatory run only. Evaluation
scaled images down to uniform size of 128 ∗ 128 pixels to reduce          performed by the task organizers showed F1-score of 58.82% for
noise and soften image-shifting influence. Then, scaled images are        flooding detection performance on the provided test set. The rela-
converted into hue-saturation-value (HSV) color space, and further        tively high performance for our simple detection approach can be
analysis is performed on HSV bands. Using the same thresholding           explained by the used aggressive image masking technique which
methodology, we mask-out pixels with too-low and too-high satu-           allow us to perform comparison of only clearly visible areas. How-
ration (S) and Value (V) channel values. The resulting masks are          ever, our own evaluation shows that our approach is not able to
filtered with median filter and processed by dilation filter. Resulting   distinguish correctly between image changes caused by flooding
images are compared in sequential pairs within non-masked-out             and seasonal vegetation grow.
regions using grey level co-occurrence matrix texture feature. Final
flooding presence detection is made by using random tree classifier.      4     CONCLUSIONS AND FUTURE WORK
                                                                          This year, the social multimedia satellite task introduced a new
3 RESULTS AND ANALYSIS                                                    and important challenges including image based news topic dis-
                                                                          ambiguation (NITD), multi-modal flood level estimation in social
3.1 Runs Description in NITD Task                                         media content (MFLE) and predicting a flood event in a set of se-
For NITD, we submitted total four runs. In run 1, we used the PSO         quences of satellite images of a certain city over a certain length
based weight optimization method for assigning weights to each            of time (CCSS). For the NITD task, we mainly relied on ensembles
model on merit basis. For run 2, the deep models are treated equally      of classifiers trained on deep features extracted through several
by assigning equal weights to all models. In our run 3, we added          pre-trained deep models as well as global features (GF). During
colour based features to our pool of features descriptors in a late       the experiments, we observed that the object and scene-level fea-
fusion method where the scores of all models are simply added to          tures complement each others when jointly utilized in a proper
obtain the final prediction. Our run 4 is based early fusion where        way. Moreover, deep features are proved more effective compared
the deep features are simply concatenated for training SVMs. Table        to GF. For MLFE task, we used both textual and visual information
1a provides the experimental results our proposed solutions for           where better results were obtained with visual information. How-
NITD task on both development and test sets. Overall better results       ever, textual and visual information can complement each other. In
are obtained with PSO based late fusion which shows the advantage         the future, we aim to analyze the task with more advanced early
of merit based late fusion of the models. On the other hand, least        and late fusion techniques to better utilize the multi-modal infor-
F1-score is obtained with early fusion. Moreover, the colour based        mation. Furthermore, we plan to use complex GF. For CCSS task,
features did not contribute positively in the performance of the          we used the combination of computer-vision and machine learning
framework. This might be due to the fact that the JCD feature is          approaches. For future results improvement, we will continue inves-
very compressed and does not contain much information that the            tigating recurrent CNN and GAN-based approaches in combination
fusion algorithm could exploit.                                           with classical image processing algorithms.
The Multimedia Satellite Task                                                         MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                           Ieee, 248–255.
 [1] Kashif Ahmad, Konstantin Pogorelov, Michael Riegler, Olga Os-               [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     troukhova, Pål Halvorsen, Nicola Conci, and Rozenn Dahyot. 2019.                residual learning for image recognition. In Proceedings of the IEEE
     Automatic detection of passable roads after floods in remote sensed             conference on computer vision and pattern recognition. 770–778.
     and social media data. Signal Processing: Image Communication 74            [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
     (2019), 110–118.                                                                agenet classification with deep convolutional neural networks. In
 [2] Kashif Ahmad, Amir Sohail, Nicola Conci, and Francesco De Natale.               Advances in neural information processing systems. 1097–1105.
     2018. A Comparative study of Global and Deep Features for the              [10] Mathias Lux, Michael Riegler, Pål Halvorsen, Konstantin Pogorelov,
     analysis of user-generated natural disaster related images. In 2018 IEEE        and Nektarios Anagnostopoulos. 2016. LIRE: open source visual infor-
     13th Image, Video, and Multidimensional Signal Processing Workshop              mation retrieval. In Proceedings of the 7th International Conference on
     (IVMSP). IEEE, 1–5.                                                             Multimedia Systems. ACM, 30.
 [3] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas             [11] Lichao Mou, Lorenzo Bruzzone, and Xiao Xiang Zhu. 2018. Learn-
     Dengel. 2016. Contextual enrichment of remote-sensed events with                ing spectral-spatial-temporal features via a recurrent convolutional
     social media streams. In Proceedings of the 24th ACM international              neural network for change detection in multispectral imagery. IEEE
     conference on Multimedia. ACM, 1077–1081.                                       Transactions on Geoscience and Remote Sensing 57, 2 (2018), 924–935.
 [4] Benjamin Bischke, Patrick Helber, Erkan Basar, Simon Brugman,              [12] Naina Said, Kashif Ahmad, Michael Riegler, Konstantin Pogorelov,
     Zhengyu Zhao, and Konstantin Pogorelov. The Multimedia Satel-                   Laiq Hassan, Nasir Ahmad, and Nicola Conci. 2019. Natural disas-
     lite Task at MediaEval 2019: Flood Severity Estimation. In Proc. of the         ters detection in social media and satellite imagery: a survey. Mul-
     MediaEval 2019 Workshop (Oct. 27-29, 2019). Sophia Antipolis, France.           timedia Tools and Applications (17 Jul 2019). https://doi.org/10.1007/
 [5] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan                 s11042-019-07942-1
     Venkat, Andreas Dengel, and Damian Borth. 2017. The Multimedia             [13] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     Satellite Task at MediaEval 2017: Emergence Response for Flooding               lutional networks for large-scale image recognition. arXiv preprint
     Events. In Proceedings of the MediaEval 2017 Workshop (Sept. 13-15,             arXiv:1409.1556 (2014).
     2017). Dublin, Ireland.                                                    [14] Mohib Ullah and Faouzi Alaya Cheikh. 2018. A directed sparse graph-
 [6] Benjamin Bischke, Patrick Helber, Zhengyu Zhao, Jens de Bruijn, and             ical model for multi-target tracking. In Proceedings of the IEEE Con-
     Damian Borth. The Multimedia Satellite Task at MediaEval 2018:                  ference on Computer Vision and Pattern Recognition Workshops. 1816–
     Emergency Response for Flooding Events. In Proc. of the MediaEval               1823.
     2018 Workshop (Oct. 29-31, 2018). Sophia-Antipolis, France.                [15] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and
 [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.          Aude Oliva. 2014. Learning deep features for scene recognition using
     2009. Imagenet: A large-scale hierarchical image database. In Computer          places database. In Advances in neural information processing systems.
     Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.            487–495.