=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_15 |storemode=property |title=CNN and GAN Based Satellite and Social Media Data Fusion for Disaster Detection |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_15.pdf |volume=Vol-1984 |authors=Kashif Ahmad,Konstantin Pogorelov,Michael Riegler,Nicola Conci,Pål Halvorsen |dblpUrl=https://dblp.org/rec/conf/mediaeval/AhmadPRCH17 }} ==CNN and GAN Based Satellite and Social Media Data Fusion for Disaster Detection== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_15.pdf
 CNN and GAN Based Satellite and Social Media Data Fusion for
                    Disaster Detection
               Kashif Ahmad1 , Pogorelov Konstantin2 , Michael Riegler2 ,Nicola Conci1 ,Pal Holversen2
                                                        1 DISI-University of Trento, Italy
                                                      2 Simula Research Labs Oslo, Norway

              kashif.ahmad@unitn.it,konstantin@simula.no,michael@simula.no,nicola.conci@unitn.it,paalh@ifi.uio.no

ABSTRACT
This paper presents the method proposed by team UTAOS for the
Mediaeval 2017 challenge on Multi-media and Satellite. In the first
task, we mainly rely on different Convolutional Neural Network
(CNN) models combined with two different late fusion methods.
We also utilize the additional information available in the form
of meta-data. The average and mean over precision at different
cut-offs for our best runs are 84.94% and 95.11%, respectively. For
challenge two, we utilize a Generative Adversarial Network (GAN).
The mean Intersection-over-Union (IoU) for our best run is 0.8315.

1     INTRODUCTION
Linking social media information to remote sensed data holds large
possibilities for society and research [1–3]. The Multimedia and           Figure 1: Block diagram of the proposed methodology for
Satellite task in Mediaeval 2017 [4] aims to integrate information         DIRSM task.
                                                                           terms of posterior classification probabilities. We also consider
from both sources, sensed data and social media, to provide a better       user’s tags, date taken along with GPS information from the avail-
overview of a disaster. This paper provides a detailed description         able meta-data. For the meta-data we rely on Random Tree classifier
of the methods developed by the UTOS team for the Mediaeval                provided by the WEKA toolbox [6]. Finally, the classification scores
2017 Multimedia Satellite Task. The challenge consists of two sub          obtained through Random Trees and SVM trained on meta-data
tasks, (i) Disaster Image Retrieval from Social Media (DIRSM) and          and visual features are fused using late fusion. For the late fusion
(ii) Flood Detection in Satellite Images (FDSI).                           we propose two different methods, namely, (i) Induced Ordered
                                                                           fusion scheme inspired by Induced Ordered Weighting Averaging
2 PROPOSED APPROACH                                                        Operators (IOWA) by Yager et al. [13] and (ii) Particle Swarm Opti-
2.1 Methodology for DIRSM Task                                             mization (PSO). Figure 1 provides a block diagram of the proposed
To tackle challenge (i), we rely on Convolutional Neural Network           methodology for the Disaster Images Retrieval from Social Media
(CNN) features. In detail, we first extract CNN features for seven         (DIRSM) task.
different models from state-of-the-art architectures pre-trained on
the ImageNet [5] and places datasets [14]. These models include            2.2    Methodology for FDSI Task
AlexNet [8] (pre-trained on both ImageNet and places datasets),            For the challenge (ii), we started from the visual analysis of the
GoogleNet [12] (pre-trained on ImageNet ), VGGNet 19 [10] (pre-            provided development set. We observed that it is not possible to
trained on both ImagNet and places datasets) and different con-            use any already existing open-source framework due to the nature
figurations of ResNet [7] with 50, 101 and 152 layers. For feature         of the provided satellite data. Furthermore, we observed that the
extraction from Alexnet and VGGNet19 we use the Caffe toolbox1             used four-channel 16-bit TIFF file format is too specific and cannot
while in the case of GoogleNet and Resnet we exploited Vlfeat              be correctly processed and even viewed by existing libraries.
Matcovnet2 .                                                                  To perform the visual analysis we developed a conversion code
   All in all, we extract eight feature vectors through four differ-       which provide a conversion from geo-TIFF to a pair of images:
ent network architectures from the same image. AlexNet and VG-             RGB and infrared (IR). For the RGB images we used the per-three-
GNet16 provide a feature vector of size 4096 while GoogleNet and           channels normalization which fits all the R, G and B pixel values of
Resnet provide feature vectors of 1024 and 2048, respectively. Sub-        the input geo-image into standard 0-255 RGB region. Normalization
sequently, the extracted features are fed into ensembles of Support        coefficients are the same for all three channels to achieve real color
Vector Machines (SVMs), which provide classification scores in             balance even in cases of low variations in one of the components.
1 http://caffe.berkeleyvision.org/                                         The normalization of the IR component is performed separately.
2 http://www.vlfeat.org/matconvnet/                                                        rдbmin = min(min r i , min дi , min bi )
                                                                                                          i ∈R   i ∈G    i ∈B
Copyright held by the owner/author(s).
                                                                                         rдbmax = max(max r i , max дi , max bi )
                                                                                                          i ∈R    i ∈G    i ∈B
MediaEval’17, 13-15 September 2017, Dublin, Ireland
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                    K. Ahmad et al.
                  irmin = min ir k ,       irmax = max ir k                Table 1: Evaluations of the proposed approach in terms of
                            k ∈I R                   k ∈I R                precision at 480 and mean over average precision at different
                                     ({r |д|b}i − rдbmin ) ∗ 255
        ∀i ∈ {R|G |B}       {r |д|b}i∗ =                                   cutoffs (50, 100, 250 and 480).
                                          rдbmax − rдbmin
                                                                               Run         Features               Precision at 480   Mean precision
                                  (ir k −  irmin ) ∗ 255
                 ∀k ∈ IR ir i∗ =                                                1         Visual only                 84.94%            95.11%
                                     irmax − irmin                              2       Meta-data only                25.88%            31.45%
   Moreover, we performed a human-expert-driven visual analy-                   3     Meta-data and Visual            54.74%            68.12%
sis of the images and found them all to be non-contrast, blurry                 4         Visual only                 81.15%            89.77%
and color-range-limited. From our previous experience [9] we de-                5     Meta-data and Visual            73.83%            82.68%
cided to use a generative adversarial network (GAN). GANs3 are
                                                                           Table 2: Evaluations of our approach for Flood Detection in
a class of artificial intelligence algorithms used in unsupervised
                                                                           Satellite Images (FDSI) task
machine learning, implemented by a system of two neural networks
contesting with each other in a zero-sum game framework.                          Run                           Mean IoU per Location
                                                                               (Thresh.)    01     02     03      04   05     06 Overall     07 (new )
   As the basis for our method we selected a neural network archi-
                                                                                1 (0.78)   0.79   0.81   0.88    0.78 0.75 0.80      0.82       0.73
tecture used for retinal vessel segmentation in fundoscopic images              2 (0.94)   0.77   0.78   0.86    0.74 0.72 0.78      0.80       0.70
with generative adversarial networks (V-GAN) 4 . The V-GAN archi-                3 (0.5)   0.79   0.82   0.88    0.79 0.76 0.81      0.83       0.74
tecture is designed [11] for processing of retinal images that have             4 (0.35)   0.79   0.82   0.87    0.79 0.77 0.80      0.83       0.74
comparable visual properties and provides the required output with              5 (0.12)   0.78   0.80   0.86    0.78 0.77 0.78      0.81       0.73
one-class image segmentation masks.
   V-GAN is implemented in Python on top of Keras with Ten-
sorflow GPU-enabled back-end. We have modified the network                 visual information. Again, PSO based fusion performs better. One
architecture by changing the top-layers configuration in order to          of the main limitations of IOWA based fusion is its mechanism of
support four-channel floating-point geo-image-compatible input.            assigning more weight to a more confident model. In this particular
The final generator network output layer used for creation of prob-        case, we noticed that our classifier trained on meta-data provides
abilistic output segmentation image was extended by the simple             more confident decisions with high probabilities causing significant
threshold activation layer to generate the binary segmentation map.        reduction in the performance. This can also be concluded from the
   First, we have performed experiments with the development set           results on run 2 where the meta-data obtain worst results. The
only and found that the modified V-GAN is able to perform the              degradation in the performance due to the inclusion of meta-data
segmentation of the provided satellite images, but the estimated           shows that the additional information available are not much useful.
performance metrics were below the expected level. Additional
visual analysis of the converted RGB and IR images showed that             3.2       Runs Description in FDSI Task
sometimes IR component of the sourced geo-images was irrelevant
                                                                           Table 2 represents the experimental results of our method for FDSI
to the flooding areas that probably caused our GAN to bias during
                                                                           task. In total, we submitted 5 different runs for 7 different target
training process and prevent it from the correct flooding areas prop-
                                                                           locations that are represented by image patches of satellite images
erties extraction. Thus, we have decided to exclude IR component
                                                                           of different regions affected by flooding. We have used the different
from the model input and process only the RGB components of the
                                                                           binarization threshold level for the different runs with the same
converted normalized geo-images. This resulted in the significant
                                                                           model in order to find the optimum balance in the number of false-
performance improvement and correct segmentation most of the
                                                                           positive and false-negative pixels in the segmented images. The
developments set flooding areas except for the some images taken
                                                                           selection of used threshold values was performed based on the
in not-common lighting and cloudy conditions.
                                                                           visual analysis of the segmentation results in order to maximize the
                                                                           variability of detected flooding area. The best results are reported
3 RESULTS AND ANALYSIS
                                                                           for location 03 (which have the best ground visibility without clouds
3.1 Runs Description in DIRSM Task                                         and proper lighting with strong light reflections from the water
For DIRSM, we submitted five different runs. Table 1 provides the          surface in the flooded areas) in all runs. Overall better results are
official results of our methods in terms of average precision at cut-      obtained at runs 3 and 4 with mean IoU of 0.83. For the new location
off 480 and mean over precision at different cutoff (50, 100, 250, 480).   (07) better results are obtained at runs 3 and 4.
Run 1 and run 4 are mainly based on visual information extracted
with seven different CNN models and jointly utilized in PSO and            4     CONCLUSION AND FUTURE WORK
IOWA based fusions, respectively. As it can be seen in Table 1, the        This paper provides a detailed description of the methods proposed
PSO based fusion method outperforms IOWA with a significant                by UTAOS for the Mediaeval 2017 challenge on Multimedia and
gain of 3.79% and 5.34%. On the other hand, run 2 is based on meta-        Satellite. During the experimental evaluation of sub-task 1 (DIRSM),
data achieving the worst results among the all runs. Similarly, run        we noticed that visual information seems more useful compared to
3 and run 5 represents two different variations of our method used         meta-data for the retrieval of disaster images. For sub-task 2 (FDSI),
for combining meta-data and visual information. Run 3 is based on          we rely on a Generative Adversarial Network where better results
IOWA while run 5 represents our PSO based fusion of meta-data and          are obtained in 3 and 4. Based on the experiments conducted in this
3 http://en.wikipedia.org/wiki/Generative_adversarial_networks             work we believe that a proper fusion of social media information
4 https://bitbucket.org/woalsdnd/v-gan                                     and satellite data can provide a better story of a natural disaster.
Multimedia Satellite Task                                                                  MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES                                                                    [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
[1] Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Nicola Conci,            residual learning for image recognition. In Proceedings of the IEEE
    Pål Halvorsen, and Francesco De Natale. 2017. JORD: A System for              conference on computer vision and pattern recognition. 770–778.
    Collecting Information and Monitoring Natural Disasters by Link-          [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
    ing Social Media with Satellite Imagery. In Proceedings of the 15th           agenet classification with deep convolutional neural networks. In
    International Workshop on Content-Based Multimedia Indexing. ACM,             Advances in neural information processing systems. 1097–1105.
    12.                                                                       [9] Konstantin Pogorelov, Michael Riegler, Sigrun Losada Eskeland,
[2] Kashif Ahmad, Michael Riegler, Ans Riaz, Nicola Conci, Duc-Tien               Thomas de Lange, Dag Johansen, Carsten Griwodz, Peter Thelin
    Dang-Nguyen, and Pål Halvorsen. 2017. The JORD System: Linking                Schmidt, and Pål Halvorsen. 2017. Efficient disease detection in gas-
    Sky and Social Multimedia Data to Natural Disasters. In Proceedings           trointestinal videos–global features versus neural networks. Multime-
    of the 2017 ACM on International Conference on Multimedia Retrieval.          dia Tools and Applications (2017), 1–33.
    ACM, 461–465.                                                            [10] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
[3] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas                lutional networks for large-scale image recognition. arXiv preprint
    Dengel. 2016. Contextual enrichment of remote-sensed events with              arXiv:1409.1556 (2014).
    social media streams. In Proceedings of the 2016 ACM on Multimedia       [11] Jaemin Son, Sang Jun Park, and Kyu-Hwan Jung. 2017. Retinal Vessel
    Conference. ACM, 1077–1081.                                                   Segmentation in Fundoscopic Images with Generative Adversarial
[4] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan               Networks. arXiv preprint arXiv:1706.09318 (2017).
    Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite       [12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
    Task at MediaEval 2017: Emergence Response for Flooding Events.               Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
    In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,          Rabinovich. 2015. Going deeper with convolutions. In Proceedings of
    Ireland.                                                                      the IEEE conference on computer vision and pattern recognition. 1–9.
[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.   [13] Ronald R Yager and Dimitar P Filev. 1999. Induced ordered weighted
    2009. Imagenet: A large-scale hierarchical image database. In Computer        averaging operators. IEEE Transactions on Systems, Man, and Cyber-
    Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.          netics, Part B (Cybernetics) 29, 2 (1999), 141–150.
    IEEE, 248–255.                                                           [14] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and
[6] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter            Aude Oliva. 2014. Learning deep features for scene recognition using
    Reutemann, and Ian H Witten. 2009. The WEKA data mining software:             places database. In Advances in neural information processing systems.
    an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10–18.            487–495.