Convolutional Neural Networks for Disaster Images Retrieval Sheharyar Ahmad1 ,Kashif Ahmad2 , Nasir Ahmad1 , Nicola Conci2 1 DCSE, UET Peshawar, Pakistan 2 DISI-University of Trento, Trento engr_sheharyar@yahoo.com,kashif.ahmad@unitn.it,n.ahmad@uetpeshawar.edu.pk nicola.conci@unitn.it ABSTRACT This paper presents the method proposed by MRLDCSE team for the disaster image retrieval task in Mediaeval 2017 challenge on Multimedia and Satellite. In the proposed work, for visual informa- tion, we rely on Convolutional Neural Networks (CNN) features extracted with two different models pre-trained on ImageNet and places datasets. Moreover, a late fusion technique is employed to Figure 1: Block diagram of the proposed methodology for jointly utilize visual and the additional information available in the DIRSM Task. form of meta-data for the retrieval of disaster images from social social media with satellite images was introduced as a challenge in media. The average precision for our three different runs with vi- ACM MM 20161 . sual information only, meta-data and combination of meta-data and This paper provides a detailed description of the method pro- visual information are 95.73%, 18.23% and 92.55%, respectively. posed by team MLRDCSE for the first task of Mediaeval2017 Mul- timedia and Satellite challenge [6]. The basic insight of the task is to jointly utilize satellite imagery and social media as a source 1 INTRODUCTION of information to provide a detailed story of the disaster. The pro- In recent years, social media emerged as an important source of posed challenge is composed of two sub-tasks namely (i) Disaster information and communication; especially in disaster situations Image Retrieval from Social Media (DIRSM) and (ii) Flood Detec- where usually news agencies are unable to provide information in tion in Satellite Images (FDSI). Detailed description of the tasks are time due to unavailability of reporters in the area. For instance, the provided in [6]. authors in [9, 14] have proved social network as an effective medium of mass communication in emergency situations. A rather recent 2 PROPOSED APPROACH trend is to infer events from information shared through social Figure 1 provides a block diagram of the proposed methodology. media [2, 15]. The analysis of recent literature reveals that social As can be seen, the proposed approach is composed of three main media platforms, particularly Twitter and Flickr have been heavily phases namely feature extraction, classification and fusion. In the exploited for inferring information about different types of events, next sub-sections, we provide a detailed description of the each such as social and sports events. In this regards, an interesting phase. application is to collect and analyze information about natural disasters available on social network. To this aim, a number of 2.1 Feature Extraction interesting solutions have been proposed to effectively utilize social DIRSM is composed of three mandatory runs involving (i) Visual media for information collection and analyzing the impact of a information only (ii) Meta-data only and (iii) combination of meta- natural disaster [4, 12]. data and visual information. For visual information, we extract On the other hand, satellite images have also been proved very Convolutional Neural Network (CNN) features via AlexNet [11] effective to explore and monitor the surface of the earth and its pre-trained on ImageNet [8] and Places dataset [16] from each environment [13]. In this regards, Joyce et al. [10] provides a de- image at hand. AlextNet is composed of 8 fully connected layers tailed review of different techniques developed to efficiently utilize including five convolutional and three fully connected layers. The remote-sensed data for the monitoring and assessment of damage ultimate insight of the proposed scheme for visual information is due to natural hazards and disasters. to utilize both object specific and scene-level information for the A rather recent trend is to combine remote-send data with social representation of the disaster related images. A model pre-trained media information allowing to provide a better overview of a disas- on ImageNet corresponds to object specific information while the ter [3, 5]. For instance, in [4], a system called "JORD" is introduced one pre-trained on the places dataset is intended to extract scene- to automatically collect information from different platforms of level information. This scheme has also been proved very effective social media and link it with remote-sensed data to provide a more in social event detection in single images [1]. We extract a 4096- detailed story of a disaster. Similarly, a task to automatically link dimensional feature vector from each model in caffe toolbox 2 . On the other hand, we also consider user tags, title and GPS information from the available meta-data. Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 http://www.acmmm.org/2016/wp-content/uploads/ 2 http://caffe.berkeleyvision.org/tutorial/ MediaEval’17, 13-15 September 2017, Dublin, Ireland S. Ahmad et al. 2.2 Classification and Fusion Table 1: Evaluation of the proposed approach in terms of av- erage Precision at cutoff 480 In the proposed methodology, next steps correspond to classifica- tion and fusion of the classification results obtained in the previous step. For the classification purposes, we rely on Support Vector Run Feature Avg. Precision Machines (SVM) based on its proven performances in object recog- 1 Visual Information only 86.81 nition and classification [7]. We train separate Support Vector Ma- 2 Meta-data only 22.83 chines (SVM) classifiers for both CNN models on the complete 3 Meta-data and Visual Information 83.73 development dataset. Subsequently, test images are classified with the trained classifiers providing results in the form of posterior probabilities. On the other hand, for meta-data, we rely on Random Forest classifier in WEKA Machine Learning library 3 . The trained Table 2: Evaluations of the proposed approach in terms of classifier provide the results in terms of posterior probabilities. mean over avg. precision at different cutoffs (50,100,250,480) In the subsequent phase, we fuse the scores of the individual classifiers in a late fusion method as shown in Equ. 1 where w1, w2 Run Features Mean precision and w3 represent the weights used for each type of information 1 Visual Information only 95.73 while p1, p2 and p3 represent the posterior probabilities obtained 2 Meta-data only 18.23 with classifiers trained with features obtained with AlexNet pre- 3 Meta-data and Visual Information 92.55 trained on ImageNet, AlexNet pre-trained on Places dataset and meta-data, respectively. S = w1 ∗ p1 + w2 ∗ p2 + w3 ∗ p3 (1) namely 50, 100, 250 and 480. Again, better results are reported for In the current implementation, we use equal weights for each run 1 relying on visual information only. Similarly, worst results classifier. Results can be further improved if some optimization are achieved with meta-data. It can also be noticed in Table 2 that techniques, such as Genetic Algorithms (GA), are used. It is impor- mean average precision at different cutoffs for meta-data is lower tant to mention that fusion method is used in both run 1 (fusion of than the average precision at maximum cutoff 480. However, in two classifiers trained on features extracted with both CNN models) the case of run 1 and run 3 different behaviour can be noticed by and run 3 (fusion of all types of information including meta-data achieving better performances at lower cutoffs, which shows the and visual information). strength of visual information in differentiating among flooded and 3 RESULTS AND ANALYSIS non-flooded images. Moreover, a significant increase can be noticed in the precision Table 1 provides the experimental results of our method proposed using lower cutoffs (mean of 50,100 and 250, 480), which shows for the Mediaeval2017 Multimedia and Satellite task in terms of that increasing the cutoff allows false positive to be included in the average precision at cut-toffs 480. As can be seen, we achieve best threshold. results in run 1, where we use visual information extracted with two different CNN models of AlexNet pre-trained on ImageNet and Places dataset. On the other hand, in Run 2, which is mainly 4 CONCLUSIONS AND FUTURE WORK based on meta-data, we achieve the worst results among all runs by This paper reports the description of the method proposed by team having an average precision of just 22.83%. A significant difference MRLDCSE along with a detailed description and analysis of the of around 64% can be noticed in the performances of meta-data and experimental results. For visual information, we rely on the com- visual information. This huge difference in the performances shows bination of object and scene-level information extracted through a clear advantage of visual information over the meta-data in this two different Convolutional Neural Networks (CNN) models pre- particular application. Moreover, run 2 also shows the limitations trained on ImageNet and Places datasets. On the other hand, we of meta-data. Some common problems with meta-data includes use user’s tags, title, description and geo-location information from missing time stamps and geo-location information. Moreover, the the available meta-data. Over all, better results are obtained with vi- ambiguous meaning of user’s tags also affects the performance of sual information only. In contrast, meta-data produce worst results the model. among all the runs we submitted. We also noticed that the inclu- The third run requires to combine meta-data and visual infor- sion of meta-data degrades the performance of the model when mation. In this experiment, our team achieves an average precision combined with visual information in this particular application. of 83.73%, which is significantly lower than the performance with In the current implementation, we are relying on a single deep visual information only. This is mainly caused by equally treating architecture, in future, we aim to incorporate multiple deep architec- the classifiers trained on visual information and meta-data. This tures to better utilize visual information for the retrieval of flooded fact can also be concluded from the results of run 2, where the images. Moreover, very low performance has been noticed with classifier trained on meta-data achieves very low precision. meta-data, in future we aim to employ more sophisticated methods In Table 2, we provide the experimental results of the proposed to better utilize the additional information. An other interesting method in terms of mean average precision at different cutoffs direction can be using some optimization techniques for learning 3 http://www.cs.waikato.ac.nz/ml/weka/ weights of each classifier to fuse them, properly. Multimedia Satellite Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES with support vector machines (2002), 571–591. [1] Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco GB De Na- [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. tale. 2016. USED: a large-scale social event detection dataset. In Pro- 2009. Imagenet: A large-scale hierarchical image database. In Computer ceedings of the 7th International Conference on Multimedia Systems. Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. ACM, 50. IEEE, 248–255. [2] Kashif Ahmad, Francesco De Natale, Giulia Boato, and Andrea Rosani. [9] Amanda Lee Hughes and Leysia Palen. 2009. Twitter adoption and 2016. A hierarchical approach to event discovery from single images use in mass convergence and emergency events. IJEM 6, 3-4 (2009), using MIL framework. In Signal and Information Processing (GlobalSIP), 248–260. 2016 IEEE Global Conference on. IEEE, 1223–1227. [10] Karen E Joyce, Stella E Belliss, Sergey V Samsonov, Stephen J McNeill, [3] Kashif Ahmad, Michael Riegler, Konstantin Pogorelov, Nicola Conci, and Phil J Glassey. 2009. A review of the status of satellite remote Pål Halvorsen, and Francesco De Natale. 2017. JORD: A System for sensing and image processing techniques for mapping natural hazards Collecting Information and Monitoring Natural Disasters by Link- and disasters. Progress in Physical Geography (2009). ing Social Media with Satellite Imagery. In Proceedings of the 15th [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- International Workshop on Content-Based Multimedia Indexing. ACM, agenet classification with deep convolutional neural networks. In 12. Advances in neural information processing systems. 1097–1105. [4] Kashif Ahmad, Michael Riegler, Ans Riaz, Nicola Conci, Duc-Tien [12] Chenliang Li, Aixin Sun, and Anwitaman Datta. 2012. Twevent: Dang-Nguyen, and Pål Halvorsen. 2017. The JORD System: Linking segment-based event detection from tweets. In Proc. of ACM IKM. Sky and Social Multimedia Data to Natural Disasters. In Proceedings ACM, 155–164. of the 2017 ACM on International Conference on Multimedia Retrieval. [13] Ashbindu Singh. 1989. Review article digital change detection tech- ACM, 461–465. niques using remotely-sensed data. International journal of remote [5] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas sensing 10, 6 (1989), 989–1003. Dengel. 2016. Contextual enrichment of remote-sensed events with [14] Brian Stelter and Noam Cohen. 2008. Citizen journalists provided social media streams. In Proceedings of the 2016 ACM on Multimedia glimpses of Mumbai attacks. The New York Times 30 (2008). Conference. ACM, 1077–1081. [15] Christos Tzelepis, Zhigang Ma, Vasileios Mezaris, Bogdan Ionescu, [6] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Ioannis Kompatsiaris, Giulia Boato, Nicu Sebe, and Shuicheng Yan. Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite 2016. Event-based media processing and analysis: A survey of the Task at MediaEval 2017: Emergence Response for Flooding Events. literature. Image and Vision Computing 53 (2016), 3–19. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, [16] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Ireland. Aude Oliva. 2014. Learning deep features for scene recognition using [7] Hyeran Byun and Seong-Whan Lee. 2002. Applications of support places database. In Advances in neural information processing systems. vector machines for pattern recognition: A survey. Pattern recognition 487–495.