Convolutional Neural Networks for Disaster Images Retrieval
                                Sheharyar Ahmad1 ,Kashif Ahmad2 , Nasir Ahmad1 , Nicola Conci2
                                                       1 DCSE, UET Peshawar, Pakistan
                                                      2 DISI-University of Trento, Trento

                             engr_sheharyar@yahoo.com,kashif.ahmad@unitn.it,n.ahmad@uetpeshawar.edu.pk
                                                       nicola.conci@unitn.it
ABSTRACT
This paper presents the method proposed by MRLDCSE team for
the disaster image retrieval task in Mediaeval 2017 challenge on
Multimedia and Satellite. In the proposed work, for visual informa-
tion, we rely on Convolutional Neural Networks (CNN) features
extracted with two different models pre-trained on ImageNet and
places datasets. Moreover, a late fusion technique is employed to         Figure 1: Block diagram of the proposed methodology for
jointly utilize visual and the additional information available in the    DIRSM Task.
form of meta-data for the retrieval of disaster images from social
                                                                          social media with satellite images was introduced as a challenge in
media. The average precision for our three different runs with vi-
                                                                          ACM MM 20161 .
sual information only, meta-data and combination of meta-data and
                                                                              This paper provides a detailed description of the method pro-
visual information are 95.73%, 18.23% and 92.55%, respectively.
                                                                          posed by team MLRDCSE for the first task of Mediaeval2017 Mul-
                                                                          timedia and Satellite challenge [6]. The basic insight of the task
                                                                          is to jointly utilize satellite imagery and social media as a source
1    INTRODUCTION                                                         of information to provide a detailed story of the disaster. The pro-
In recent years, social media emerged as an important source of           posed challenge is composed of two sub-tasks namely (i) Disaster
information and communication; especially in disaster situations          Image Retrieval from Social Media (DIRSM) and (ii) Flood Detec-
where usually news agencies are unable to provide information in          tion in Satellite Images (FDSI). Detailed description of the tasks are
time due to unavailability of reporters in the area. For instance, the    provided in [6].
authors in [9, 14] have proved social network as an effective medium
of mass communication in emergency situations. A rather recent            2     PROPOSED APPROACH
trend is to infer events from information shared through social           Figure 1 provides a block diagram of the proposed methodology.
media [2, 15]. The analysis of recent literature reveals that social      As can be seen, the proposed approach is composed of three main
media platforms, particularly Twitter and Flickr have been heavily        phases namely feature extraction, classification and fusion. In the
exploited for inferring information about different types of events,      next sub-sections, we provide a detailed description of the each
such as social and sports events. In this regards, an interesting         phase.
application is to collect and analyze information about natural
disasters available on social network. To this aim, a number of           2.1      Feature Extraction
interesting solutions have been proposed to effectively utilize social    DIRSM is composed of three mandatory runs involving (i) Visual
media for information collection and analyzing the impact of a            information only (ii) Meta-data only and (iii) combination of meta-
natural disaster [4, 12].                                                 data and visual information. For visual information, we extract
   On the other hand, satellite images have also been proved very         Convolutional Neural Network (CNN) features via AlexNet [11]
effective to explore and monitor the surface of the earth and its         pre-trained on ImageNet [8] and Places dataset [16] from each
environment [13]. In this regards, Joyce et al. [10] provides a de-       image at hand. AlextNet is composed of 8 fully connected layers
tailed review of different techniques developed to efficiently utilize    including five convolutional and three fully connected layers. The
remote-sensed data for the monitoring and assessment of damage            ultimate insight of the proposed scheme for visual information is
due to natural hazards and disasters.                                     to utilize both object specific and scene-level information for the
   A rather recent trend is to combine remote-send data with social       representation of the disaster related images. A model pre-trained
media information allowing to provide a better overview of a disas-       on ImageNet corresponds to object specific information while the
ter [3, 5]. For instance, in [4], a system called "JORD" is introduced    one pre-trained on the places dataset is intended to extract scene-
to automatically collect information from different platforms of          level information. This scheme has also been proved very effective
social media and link it with remote-sensed data to provide a more        in social event detection in single images [1]. We extract a 4096-
detailed story of a disaster. Similarly, a task to automatically link     dimensional feature vector from each model in caffe toolbox 2 . On
                                                                          the other hand, we also consider user tags, title and GPS information
                                                                          from the available meta-data.
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                       1 http://www.acmmm.org/2016/wp-content/uploads/
                                                                          2 http://caffe.berkeleyvision.org/tutorial/
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                            S. Ahmad et al.


2.2     Classification and Fusion                                        Table 1: Evaluation of the proposed approach in terms of av-
                                                                         erage Precision at cutoff 480
In the proposed methodology, next steps correspond to classifica-
tion and fusion of the classification results obtained in the previous
step. For the classification purposes, we rely on Support Vector             Run                Feature                    Avg. Precision
Machines (SVM) based on its proven performances in object recog-              1         Visual Information only                 86.81
nition and classification [7]. We train separate Support Vector Ma-           2             Meta-data only                      22.83
chines (SVM) classifiers for both CNN models on the complete                  3     Meta-data and Visual Information            83.73
development dataset. Subsequently, test images are classified with
the trained classifiers providing results in the form of posterior
probabilities. On the other hand, for meta-data, we rely on Random
Forest classifier in WEKA Machine Learning library 3 . The trained       Table 2: Evaluations of the proposed approach in terms of
classifier provide the results in terms of posterior probabilities.      mean over avg. precision at different cutoffs (50,100,250,480)
   In the subsequent phase, we fuse the scores of the individual
classifiers in a late fusion method as shown in Equ. 1 where w1, w2          Run               Features                   Mean precision
and w3 represent the weights used for each type of information                1        Visual Information only                95.73
while p1, p2 and p3 represent the posterior probabilities obtained            2            Meta-data only                     18.23
with classifiers trained with features obtained with AlexNet pre-             3    Meta-data and Visual Information           92.55
trained on ImageNet, AlexNet pre-trained on Places dataset and
meta-data, respectively.

                   S = w1 ∗ p1 + w2 ∗ p2 + w3 ∗ p3              (1)
                                                                         namely 50, 100, 250 and 480. Again, better results are reported for
   In the current implementation, we use equal weights for each
                                                                         run 1 relying on visual information only. Similarly, worst results
classifier. Results can be further improved if some optimization
                                                                         are achieved with meta-data. It can also be noticed in Table 2 that
techniques, such as Genetic Algorithms (GA), are used. It is impor-
                                                                         mean average precision at different cutoffs for meta-data is lower
tant to mention that fusion method is used in both run 1 (fusion of
                                                                         than the average precision at maximum cutoff 480. However, in
two classifiers trained on features extracted with both CNN models)
                                                                         the case of run 1 and run 3 different behaviour can be noticed by
and run 3 (fusion of all types of information including meta-data
                                                                         achieving better performances at lower cutoffs, which shows the
and visual information).
                                                                         strength of visual information in differentiating among flooded and
3     RESULTS AND ANALYSIS                                               non-flooded images.
                                                                            Moreover, a significant increase can be noticed in the precision
Table 1 provides the experimental results of our method proposed         using lower cutoffs (mean of 50,100 and 250, 480), which shows
for the Mediaeval2017 Multimedia and Satellite task in terms of          that increasing the cutoff allows false positive to be included in the
average precision at cut-toffs 480. As can be seen, we achieve best      threshold.
results in run 1, where we use visual information extracted with
two different CNN models of AlexNet pre-trained on ImageNet
and Places dataset. On the other hand, in Run 2, which is mainly         4    CONCLUSIONS AND FUTURE WORK
based on meta-data, we achieve the worst results among all runs by       This paper reports the description of the method proposed by team
having an average precision of just 22.83%. A significant difference     MRLDCSE along with a detailed description and analysis of the
of around 64% can be noticed in the performances of meta-data and        experimental results. For visual information, we rely on the com-
visual information. This huge difference in the performances shows       bination of object and scene-level information extracted through
a clear advantage of visual information over the meta-data in this       two different Convolutional Neural Networks (CNN) models pre-
particular application. Moreover, run 2 also shows the limitations       trained on ImageNet and Places datasets. On the other hand, we
of meta-data. Some common problems with meta-data includes               use user’s tags, title, description and geo-location information from
missing time stamps and geo-location information. Moreover, the          the available meta-data. Over all, better results are obtained with vi-
ambiguous meaning of user’s tags also affects the performance of         sual information only. In contrast, meta-data produce worst results
the model.                                                               among all the runs we submitted. We also noticed that the inclu-
   The third run requires to combine meta-data and visual infor-         sion of meta-data degrades the performance of the model when
mation. In this experiment, our team achieves an average precision       combined with visual information in this particular application.
of 83.73%, which is significantly lower than the performance with           In the current implementation, we are relying on a single deep
visual information only. This is mainly caused by equally treating       architecture, in future, we aim to incorporate multiple deep architec-
the classifiers trained on visual information and meta-data. This        tures to better utilize visual information for the retrieval of flooded
fact can also be concluded from the results of run 2, where the          images. Moreover, very low performance has been noticed with
classifier trained on meta-data achieves very low precision.             meta-data, in future we aim to employ more sophisticated methods
   In Table 2, we provide the experimental results of the proposed       to better utilize the additional information. An other interesting
method in terms of mean average precision at different cutoffs           direction can be using some optimization techniques for learning
3 http://www.cs.waikato.ac.nz/ml/weka/                                   weights of each classifier to fuse them, properly.
Multimedia Satellite Task                                                                   MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES                                                                         with support vector machines (2002), 571–591.
[1] Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco GB De Na-          [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
    tale. 2016. USED: a large-scale social event detection dataset. In Pro-        2009. Imagenet: A large-scale hierarchical image database. In Computer
    ceedings of the 7th International Conference on Multimedia Systems.            Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.
    ACM, 50.                                                                       IEEE, 248–255.
[2] Kashif Ahmad, Francesco De Natale, Giulia Boato, and Andrea Rosani.        [9] Amanda Lee Hughes and Leysia Palen. 2009. Twitter adoption and
    2016. A hierarchical approach to event discovery from single images            use in mass convergence and emergency events. IJEM 6, 3-4 (2009),
    using MIL framework. In Signal and Information Processing (GlobalSIP),         248–260.
    2016 IEEE Global Conference on. IEEE, 1223–1227.                          [10] Karen E Joyce, Stella E Belliss, Sergey V Samsonov, Stephen J McNeill,
[3] Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Nicola Conci,             and Phil J Glassey. 2009. A review of the status of satellite remote
    Pål Halvorsen, and Francesco De Natale. 2017. JORD: A System for               sensing and image processing techniques for mapping natural hazards
    Collecting Information and Monitoring Natural Disasters by Link-               and disasters. Progress in Physical Geography (2009).
    ing Social Media with Satellite Imagery. In Proceedings of the 15th       [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im-
    International Workshop on Content-Based Multimedia Indexing. ACM,              agenet classification with deep convolutional neural networks. In
    12.                                                                            Advances in neural information processing systems. 1097–1105.
[4] Kashif Ahmad, Michael Riegler, Ans Riaz, Nicola Conci, Duc-Tien           [12] Chenliang Li, Aixin Sun, and Anwitaman Datta. 2012. Twevent:
    Dang-Nguyen, and Pål Halvorsen. 2017. The JORD System: Linking                 segment-based event detection from tweets. In Proc. of ACM IKM.
    Sky and Social Multimedia Data to Natural Disasters. In Proceedings            ACM, 155–164.
    of the 2017 ACM on International Conference on Multimedia Retrieval.      [13] Ashbindu Singh. 1989. Review article digital change detection tech-
    ACM, 461–465.                                                                  niques using remotely-sensed data. International journal of remote
[5] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas                 sensing 10, 6 (1989), 989–1003.
    Dengel. 2016. Contextual enrichment of remote-sensed events with          [14] Brian Stelter and Noam Cohen. 2008. Citizen journalists provided
    social media streams. In Proceedings of the 2016 ACM on Multimedia             glimpses of Mumbai attacks. The New York Times 30 (2008).
    Conference. ACM, 1077–1081.                                               [15] Christos Tzelepis, Zhigang Ma, Vasileios Mezaris, Bogdan Ionescu,
[6] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan                Ioannis Kompatsiaris, Giulia Boato, Nicu Sebe, and Shuicheng Yan.
    Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite             2016. Event-based media processing and analysis: A survey of the
    Task at MediaEval 2017: Emergence Response for Flooding Events.                literature. Image and Vision Computing 53 (2016), 3–19.
    In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin,      [16] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and
    Ireland.                                                                       Aude Oliva. 2014. Learning deep features for scene recognition using
[7] Hyeran Byun and Seong-Whan Lee. 2002. Applications of support                  places database. In Advances in neural information processing systems.
    vector machines for pattern recognition: A survey. Pattern recognition         487–495.