A Domain-based Late-Fusion for Disaster Image Retrieval from Social Media ELEDIA@UTB and The 2017 Multimedia Satellite Task: Emergency Response for Flooding Events Minh-Son Dao1 , Quang-Nhat-Minh Pham2 , Duc-Tien Dang-Nguyen3 1 ELEDIA@UTB Lab., Universiti Teknologi Brunei, Brunei 2 FPT Technology Research Institute, FPT University, Hanoi, Vietnam 3 Dublin City University, Dublin, Ireland minh_son@utb.edu.bn,minhpqn2@fe.edu.vn,duc-tien.dang-nguyen@dcu.ie ABSTRACT • Stage 2: the bagging is utilized on these features and learn- We introduce a domain-specific and late-fusion algorithm to cope ers to increase the accuracy of the system. A multiple data with the challenge raised in The MediaEval 2017 Multimedia Satel- set (MDS) is created by dividing randomly the training data lite Task. Several known techniques are integrated based on domain- set into m non-intersecting folders (DSm ). At the end of specific criteria such as late fusion, tuning, ensemble learning, object this stage, the combined classifier is created, namely the detection using deep learning, and temporal-spatial-based event combined learner (CL). confirmation. Experimental results show that the proposed algo- • Stage 3: after running a testing phase on these data sets, rithm can overcome the main challenges of the proper discrimina- a new training data set for tuning is created and learned tion of the water levels in different areas as well as the consideration by using a specific supervised learner and a certain sub- of different types of flooding events. set of feature types. The output of this stage is called the tuning leaner (TL). The subsection 2.1.1 describes how to establish this tuning data set. The idea behind this stage 1 INTRODUCTION is to apply boosting and bagging techniques to create a This paper presents a method that is specially built to meet the tuning classifier for samples which fall into a wrong side subtask 1 of the MediaEval 2017 Multimedia Satellite Task [2]. We of a hyperplane zone of a previous classifier. propose an ensemble learning and tuning method for Disaster Image • Stage 4: the ensemble learner (EL) is created as EL = w1 ∗ Retrieval from Social Media in order to overcome the challenge of T L + w2 ∗ CL where w1 + w2 = 1. using restricted visual features and metadata as well as to increase 2.1.1 Creating a tuning data set. the accuracy of classification by using data from external resources. Details about the methodology are given in the section 2, while (1) Let PRk = {predi , дti }i=1:N be an output set when using results on this task are reported and discussed in section 3, and the CL from the stage 2 of DSk , where predi and дti denote the conclusion and future works are stated in the last section. predicted value and the ground-truth label of the i th image, respectively. The descending sort algorithm is applied to 2 METHODOLOGY PRk so that the image with the biggest predicted value will Following subsections discuss each run required by the organizer. be on the top. Here, the predicted label of the i th image is calculated by labeli = f (predi , threshold). 2.1 Visual-based Image Retrieval (2) For each i th image whose ground-truth label дti and pre- We formalize the problem as an ensemble learning and tuning dicted label labeli do not match, collecting its k nearest task, in which we use visual features provided by the organizer neighbour images whose дt j equals label j (i.e. collect k associated with each image. These visual features are used with true positive and true negative neighbours) by getting k/2 supervised learners to create classifiers. Our visual-based method samples from position (i − 1) to 1 and from position (i + 1) includes three components: late fusion, tuning, and ensemble learn- to N , respectively. ing, designed as follows: • Stage 1: a set of classifiers associated with each type of 2.2 Meta-data-based Image Retrieval features is created by using supervised learners (SLs) with We formalize the image retrieval as a text categorization task, in outputs reported in a regression form (i.e. the output vari- which we extract textual features from meta-data associated with able takes continuous values). The late fusion technique is each image. Textual features are used as basic features for training applied to these outputs to form the multimodal features a feed forward neural network (FFNN) with just one hidden layer. ÍN combination (MFC), as follows: MFC = i=1 (w i ∗ SLi ) Our meta-based classification method includes three components: where (w i ) = 1 preprocessing, feature extraction, and training neural network, Í described as follows: Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland (1) Pre-processing: we clean text data by removing hyper- links, image path, image names. We also remove all URL MediaEval’17, 13-15 September 2017, Dublin, Ireland M.S. Dao et al. in texts. We perform those steps by using some regular ex- model (ST). These components are used to re-label and create a pressions. After that, we do word tokenization and remove tuning data set in the stage 3. all punctuations in a text. For a user tag containing multi- ple words, we join words in the tag to form a phrase and 3 EXPERIMENTAL RESULTS treat that phrase as a single word. We use nltk toolkit [1] We use the data set and evaluation metrics provided by the orga- to perform word tokenization. nized [2]. All parameters used by our approach are set as follows: (2) Textual Feature Extraction: For text categorization task, • RUN 1: (1) Stage 1: we use features set {CED, EH, JC} and bag-of-words are basic and straightforward features. We SVM (SVM-Type: eps-regression, SVM-Kernel: radial, cost: 1, just use bag-of-words features and represent an image as a gamma: 0.006944444, epsilon: 0.1) to create SLs, the weighted n-dimension sparse vector in which n is limit of the num- w i s are set by 1/3, (2) Stage 2: we divide the development ber of words in the vocabulary extracted from the training set in to m=10 non-intersect data set, (3) Stage 3: k is set data set. A feature is activated if meta-data of the image by 10, and random forest (RF) is used to create TL (type contains the corresponding word in the vocabulary. In our of random forest: regression, number of trees: 500, No. of experiments, we use n = 10, 000. We extract features from variables tried at each split: 48) with the JC feature as the three attributes in the meta-data including: title, descrip- input, and (4) Stage 4: we set w 1 = 0.4, w 2 = 0.6 tion, tags of an image. • RUN 2: We perform 5-fold cross validation on the devel- (3) Neural Network Architecture: We use a feed-forward opment set and report the average scores of 5 folds. In neural network with one hidden layer containing 128 units. testing, we train the model using development set and use In training the network, we used batch size 20, and applied the trained model for the prediction on the test images. drop-out technique with drop-out coefficient 0.5. In the • RUN 3 and RUN 4: We use pre-computed object detection output layer, we use softmax layer so that the final network model without changing any parameter [6][3]. We also can output probability values that an image is related to reused methods in RUN 1 and RUN 2 with the same setup. flood or not. We adopted keras framework [4] for building Table 1 shows the evaluation results on the development and the neural network. test data. 2.3 Visual-metadata-based Image Retrieval Table 1: Evaluation Results on Development and Test Data The method used for visual-based image retrieval, described in subsection 2.1 is utilized for this task. All features and supervised Dev. Set Test Set learners of visual-based and metadata-based methods are reused. Run AP@480 MAP AP@480 MAP Run 1 82.17 88.84 77.62 87.87 2.4 Visual-metadata-external-resource-based Run 2 83.4 87.8 57.07 57.12 Image Retrieval Run 3 85.78 92.86 85.41 90.39 Run 4 92.53 98.38 90.69 97.36 Based on the observation that a water texture and colour and a spatial relation between the water area and its surrounding area can lead to the misclassification when using the proposed method For run 1 and 2, there is a big gap between results on the devel- with limited visual features provided by the organizer. Besides, a opment set and on test set. Possible explanations are that (1) the metadata content and it’s associated image do not always synchro- data distribution of the development set is different than of the nize by an event meaning. Hence, we propose the domain-specific test set, and/or (2) there are too much non-synchronize between algorithm to overcome these obstacles. text and image contents. In the current work, we did not deal with We utilize the faster R-CNN [6] with the pre-computed object inflections such as “flood” and “flooded”, so the feature space is detection model running on the Tensorflow platform [3] to gener- quite sparse. For run 3, the fusion of visual-based and metadata- ate a bag of words containing objects that semantically related to based outputs can improve the accuracy of flood images retrieval, flooded areas, especially in industrial, residential, commercial, and around 3% higher. It proves that these two different approaches agricultural areas. Moreover, we use location and time information can compensate their weaknesses. For run 4, there is a significant described in a metadata to confirm whether a flood really happened improvement of the accuracy when using SC model and ST informa- by checking with weather databases that can be freely accessed on tion whilst the visual-metadata-based method (e.g. run 3) reaches the Internet (e.g. Europe area1 , America area2 , and Australia area3 ). it’s limitation. This task can be done by reusing the method introduced in [5]. The former component, namely syn-content model (SC) is used 4 CONCLUSIONS AND FUTURE WORKS to give more weighted values to the pair image-metadata that shares We introduce the domain-based late-fusion method to retrieve flood the similar content. The latter component is to strengthen the images using social media. Although the achievement at this stage accuracy of the meta-data-based model, namely spatio-temporal is acceptable, there are so many things can be improved to get better results, especially to overcome the non-synchronized content 1 https://www.eswd.eu/ between image’s and metadata’s and the suitable features and/or 2 http://www.weather.gov/, https://water.usgs.gov/floods/reports/ learners to distinguish water/flooded areas by its color, texture, and 3 http://www.bom.gov.au/climate/data/ its spatial relation with surrounding areas. Multimedia Satellite Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite Task at MediaEval 2017: Emergence Response for Flooding Events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [3] Xinlei Chen and Abhinav Gupta. 2017. An Implementation of Faster RCNN with Study for Region Sampling. arXiv preprint arXiv:1702.02138 (2017). [4] François Chollet and others. 2015. Keras. https://github.com/fchollet/ keras. (2015). [5] Minh-Son Dao, Giulia Boato, Francesco G.B. De Natale, and Truc-Vien Nguyen. 2013. Jointly Exploiting Visual and Non-visual Information for Event-related Social Media Retrieval. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval (ICMR ’13). ACM, New York, NY, USA, 159–166. https://doi.org/10. 1145/2461466.2461494 [6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R- CNN: Towards Real-Time Object Detection with Region Proposal Net- works. In Advances in Neural Information Processing Systems (NIPS).