=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_51
|storemode=property
|title=Detection of Flooding Events in Social Multimedia and Satellite Imagery using Deep Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_51.pdf
|volume=Vol-1984
|authors=Benjamin Bischke,Prakriti Bhardwaj,Aman Gautam,Patrick Helber,Damian Borth,Andreas Dengel
|dblpUrl=https://dblp.org/rec/conf/mediaeval/BischkeBGHBD17
}}
==Detection of Flooding Events in Social Multimedia and Satellite Imagery using Deep Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_51.pdf</pdf>
<pre>
 Detection of Flooding Events in Social Multimedia and Satellite
              Imagery using Deep Neural Networks

                                Benjamin Bischke1,2 Prakriti Bhardwaj1,2 Aman Gautam1,2
                                   Patrick Helber1,2 Damian Borth2 Andreas Dengel1,2
                                         1 University of Kaiserslautern, Germany
                          2 German Research Center for Artificial Intelligence (DFKI), Germany

                       {Benjamin.Bischke, Patrick.Helber, Damian.Borth, Andreas.Dengel}@dfki.de
                                       {p_bhardwaj14, a_gautam14}@cs.uni-kl.de
ABSTRACT                                                                  (2) Flood-Detection in Satellite Images aims to identify regions in
This paper presents the solution of the DFKI-team for the Multime-        satellite images which are affected by flooding.
dia Satellite Task at MediaEval 2017. In our approach, we strongly
relied on deep neural networks. The results show that the fusion of       1.1    Disaster Image Retrieval from Social Media
visual and textual features extracted by deep networks can be effec-      In this section, we present our solution for first subtask by con-
tively used to retrieve social multimedia reports which provide a         sidering visual, textual modalities as well as their fusion. For all
directed evidence of flooding. Additionally, we extend existing net-      modalities, we train a Support Vector Machine (SVM) with a ra-
work architectures for semantic segmentation to incorporate RGB           dial basis function (RBF) kernel on the two classes flooding and no
and Infrared (IR) channels into the model. Our results show that          flooding. We obtain the ranked list of relevant social media reports
IR information is of vital importance for the detection of flooded        by computing the distance to the decision boundary of the SVM.
areas in satellite imagery.                                               The features which we used for the classifier training are discussed
                                                                          in detail in the following section.
                                                                             1.1.1 Visual Features. Motivated by the recent advances of Con-
1    INTRODUCTION
                                                                          volutional Neural Networks (CNNs) to learn a high-level represen-
Satellite imagery is becoming more and more accessible in the re-         tation of image content, we apply a CNN to obtain the semantic
cent years. Programs such as Copernicus from ESA and LandSat              feature representation of images. In particular, we use a pre-trained
from NASA facilitate this development by providing a public and           network DeepSentiBank [6] with the X-ResNet [10] architecture.
free access to the data. Large-scale datasets such as the EuroSAT-        X-ResNet is an extension of ResNet [8] with cross-residual con-
Dataset [9] or the ImageCLEFremote-Dataset [2] have emerged               nections to predict multiple related tasks. We extract the internal
from these programs and build the foundation for the deeper anal-         representation of X-ResNet’s anptask_pool5 layer, resulting in 1000-
ysis of remotely sensed data. One major problem when analyzing            dimensional feature vector for each image. Compared to CNNs
satellite imagery is the sparsity of data for particular locations over   pre-trained on ImageNet [7], this approach has two advantages: (1)
time. Publicly available satellites are mostly non stationary and         DeepSentiBank was trained to predict adjective noun pairs (ANPs).
require several days to revisit the same locations. To overcome           Unlike ImageNet pre-trained models, this allows to not only rely
this problem, recent work leverages the advances of social multi-         on information about objects-classes but additionally extract de-
media analysis and combines the two data sources [14]. Bischke            tails about the image-scence with adjectives (e.g. wet road, damaged
et. al. [3] demonstrated a system for the contextual enrichment of        building, stormy clouds). (2) The domain change of DeepSentiBank is
remote-sensed events in satellite imagery by leveraging contem-           smaller compared to ImageNet pre-trained models. DeepSentiBank
porary content from social media. Similarly, the work by Ahmad            was trained on the Visual Sentiment Ontology (VSO) dataset [5],
et. al. [1] crawled and linked social media data about technological      which contains Flickr images similar to the dataset provided by the
and environmental disasters to satellite imagery.                         task organizers. Such images often include more scenic information
    Building upon these developments and putting a stronger focus         whereas images from ImageNet mainly contain objects.
on flooding events, Bischke et. al. [4] released the Multimedia Satel-
lite Task at MediaEval 2017. The goal of this benchmarking task is           1.1.2 Metadata Features. For the retrieval based on only meta-
to augment events that are present in satellite images with social        data of social media reports, we relied on the tags given by users.
media reports in order to provide a more comprehensive view of            We observed that only relying on the presence of single words
the event. The task is divided into two subtasks: (1) The Disaster        such as ’flooding’ or ’flood’ is not sufficient and introduces a lot of
Image Retrieval from Social Media Task has the goal to retrieve so-       irrelevant social media reports. We therefore combine individual
cial media reports that provide direct evidence of a flooding event.      tags to obtain a document representation for each report.
                                                                             In the first preprocessing step, we remove numbers and convert
Copyright held by the owner/author(s).
                                                                          all tags to lowercase. We then train a Word2Vec model [12] (with
MediaEval’17, 13-15 September 2017, Dublin, Ireland                       200 dimensions) on the user tags. For each social media report, we
                                                                          average the word vectors and obtain a document representation. In
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                           B. Bischke et al.


order to incorporate the importance of each word into the document         Table 1: Average Precision at 480 and the mean of Average
representation, we additionally weight each word embedding with            Precisions at different cutoffs for the first subtask (DIRSM).
the term frequency-inverse document frequency (TF-IDF) of the
corresponding word. The intuition behind this approach is fairly                                           Run 1     Run 2     Run 3    Run 4
straightforward, i.e. document vectors containing semantically sim-               AP@480                   86.64     63.41     90.45    74.08
ilar concepts (’flood’, ’river’, ’damage’) should point to a similar        MAP@[50,100,150,240,480]       95.71     77.64     97.40    64.50
direction in the embedding space as compared to documents with
word-vectors of different concepts (’flood’, ’book’, ’desk’, ’drink’).     Table 2: Intersection over Union (IoU) for the second subtask
                                                                           (FDSI). The results are listed for unseen patches covering (i)
   1.1.3 Visual-Textual Fused Features. We extract the visual and
                                                                           same locations as in the dev-set and (ii) new locations.
textual feature representations using the two approaches as de-
scribed above. The two modalities are fused by concatenating the
feature vectors, resulting in a 1200-dimensional vector.                                                 Run 1     Run 2     Run 3
                                                                                      Same locations     73.56     84.27     84.36
                                                                                      New locations      69.32     70.87     74.13
1.2    Flood Detection in Satellite Imagery
In this section, we explain our approach for the segmentation of
flooded areas in satellite images using deep neural networks.              2   EXPERIMENTS AND RESULTS
                                                                           The results for the first subtask are shown in Table 1. Run 1 is
   1.2.1 Pre-Processing. Before feeding the satellite data to the
                                                                           only based on visual information, Run 2 only on metadata and Run
networks, we perform a location based normalization step. The
                                                                           3 on the fusion of both modalities as described in Section 1.1. It
goal of this step is to remove a location bias due to local changes
                                                                           can be seen that relying on visual information achieves a higher
in images caused by different vegetation, lightning conditions and
                                                                           Average Precision (AP) compared to metadata only. At the same
atmospheric distortions. For each location we compute the mean
                                                                           time, the fusion of both modalities further helps to improve the
pixel values of each RGB and IR channel and subtract this value
                                                                           retrieval accuracy by 1.7%. Run 4 uses only visual features from an
from the corresponding channels of images belonging to the same
                                                                           ImageNet pre-trained ResNet152 model [8]. Compared to Run 1,
location. The pixel values in original satellite images are encoded
                                                                           DeepSentiBank (X-ResNet) features perform significantly better.
in the 16-bit number format which turned out to be problematic for
                                                                              Table 2 contains the results of the second subtask for unseen
many frameworks. To overcome this, we additionally scale the min
                                                                           satellite images covering same and new locations as in the develop-
and max pixel-values channel-wise within the range of 0 and 255.
                                                                           ment set. Each of the three runs corresponds to the three networks
   1.2.2 Network Architectures. We propose three different net-            as described in Section 1.2.2. Comparing the IoU of the last two net-
work architectures for the segmentation problem. All networks use          works to the first one (Run 1), shows that the IoU increases by more
the size of the original image patch (320 x 320 pixels) as input-size      than 10%. This illustrates the importance of the IR-channel for the
and predict classification labels on a pixel-level.                        detection of flooded areas in satellite data. The comparison of the
   In our first approach, we use a fully convolutional network (FCN)       last two networks against each other (Run 2 vs. Run 3) shows that
[11] which has a similar architecture as VGG13 [13]. We remove             there is a minor improvement of the AP. (0.1% for same and 4% for
the fully connected layers and attach an up-sampling layer with            new locations). The AP’s of all runs on new locations demonstrate
bilinear interpolation to scale the down-sampled feature maps to           that the networks generalize to new places.
the original image-size. An additional convolutional layer is used to
predict the class labels for each pixel and classification probabilities   3   CONCLUSION
are obtained by squashing the network output through a softmax             In this paper, we presented our approach for the Multimedia Satel-
layer. Since the first input layer of VGG13 expects an input tensor        lite Task 2017 at MediaEval. One major insight is the importance
with dimension three, we only pass the RGB information of the              of a multi-modal fusion of text and visual content for the retrieval
satellite data into the network. In the second network, we expand          of social multimedia. In our approach, we analyzed different CNN-
our previous architecture by changing the input of the first layer to      features and showed that DeepSentiBank X-ResNet can be used
four channels, allowing the network to incorporate IR information          to obtain a powerful image representation. In the second subtask
into the prediction. We extend the previous two approaches by              of the challenge, we applied segmentation networks on satellite
investigating into more complex decoders. Therefore, we use the            imagery to extract flooded regions. Our results show that incorpo-
second network as base-model and replace the up-sampling layer             rating IR-information is very important. For future work, we would
with the reversed version of a VGG13 encoder as decoder.                   like to extend the satellite imagery to active radar data (Synthetic
                                                                           Aperture Radar) which can "look" through the clouds. We plan to
   1.2.3 Network Training. In order to train the above described
                                                                           use the results of this work in the future for the monitoring and
networks from scratch we extend the dataset using data augmenta-
                                                                           prediction of flooding events.
tion. Every image patch is flipped (left to right and up down) and
rotated at 90 degree intervals, yielding 8 augmentations per image         ACKNOWLEDGMENTS
patch. All networks are trained end-to-end with stochastic gradient
descend using the negative log likelihood loss, a learning rate of         The authors would like to thank NVIDIA for support within the
0.01 and weight decay of 0.0005.                                           NVAIL program.
Multimedia Satellite Task                                                       MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Kashif Ahmad, Michael Riegler, Ans Riaz, Nicola Conci, Duc-Tien
     Dang-Nguyen, and Pål Halvorsen. 2017. The JORD System: Linking
     Sky and Social Multimedia Data to Natural Disasters. In Proceedings
     of the 2017 ACM on International Conference on Multimedia Retrieval.
     ACM, 461–465.
 [2] Helbert Arenas, Md Bayzidul Islam, and Josiane Mothe. 2017. Overview
     of the ImageCLEF 2017 Population Estimation (Remote) Task. (2017).
 [3] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas
     Dengel. 2016. Contextual enrichment of remote-sensed events with
     social media streams. In Proceedings of the 2016 ACM on Multimedia
     Conference. ACM, 1077–1081.
 [4] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan
     Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite
     Task at MediaEval 2017: Emergence Response for Flooding Events.
     In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,
     Ireland.
 [5] Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu
     Chang. 2013. Large-scale visual sentiment ontology and detectors
     using adjective noun pairs. In Proceedings of the 21st ACM international
     conference on Multimedia. ACM, 223–232.
 [6] Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014.
     Deepsentibank: Visual sentiment concept classification with deep
     convolutional neural networks. arXiv preprint arXiv:1410.8586 (2014).
 [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
     2009. Imagenet: A large-scale hierarchical image database. In Computer
     Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.
     IEEE, 248–255.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     conference on computer vision and pattern recognition. 770–778.
 [9] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian
     Borth. 2017. EuroSAT: A Novel Dataset and Deep Learning Bench-
     mark for Land Use and Land Cover Classification. arXiv preprint
     arXiv:1709.00029 (2017).
[10] Brendan Jou and Shih-Fu Chang. 2016. Deep Cross Residual Learning
     for Multitask Visual Recognition. In ACM Multimedia. Amsterdam,
     The Netherlands.
[11] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully
     convolutional networks for semantic segmentation. In Proceedings
     of the IEEE Conference on Computer Vision and Pattern Recognition.
     3431–3440.
[12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Ef-
     ficient estimation of word representations in vector space. arXiv
     preprint arXiv:1301.3781 (2013).
[13] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).
[14] Alan Woodley, Shlomo Geva, Richi Nayak, and Timothy Campbell.
     2016. Introducing the Sky and the Social Eye. In Working Notes Pro-
     ceedings of the MediaEval 2016 Workshop, Vol. 1739. CEUR Workshop
     Proceedings.

</pre>