Retrieving Social Flooding Images Based on Multimodal Information Zhengyu Zhao, Martha Larson Radboud University, Netherlands z.zhao@cs.ru.nl,m.larson@cs.ru.nl ABSTRACT Table 1: Run Description This paper presents the participation of the RU-DS team at the MediaEval 2017 Multimedia Satellite Task. We design a system Run Pre-processing Features Fusion for retrieving social images that show direct evidence of flooding Image cropping events using a multimodal approach based on visual features from Run 1: Visual Visual - and pre-filtering images and the corresponding metadata. Specifically, we implement preprocessing operations including image cropping and test-set Run 2: Text - Text - pre-filtering based on image color complexity or textual metadata, Run 3: Fusion Pre-filtering Visual+text Re-ranking as well as re-ranking for fusion. Tests on the YFCC100M-Dataset show that the fusion-based approach outperforms the methods based on only visual features or metadata. 1 INTRODUCTION Recent advances in satellite imagery and popularity of social media are opening up a new interdisciplinary area for earth monitoring, especially on natural disasters. The objective of the MediaEval 2017 Multimedia Satellite Task is to enrich the satellite information with multimodal social media for a more comprehensive view of flooding events [2]. We participate in the Disaster Image Retrieval from So- cial Media subtask, which requires us to retrieve social images that show a direct evidence of flooding events. Previous work in [1, 4, 5] addresses a similar challenge by leveraging visual and textual con- tent from Social Media to enrich remote-sensed events in satellite imagery. In this paper, we investigate the exploitation of visual features and textual metadata for image representation, as well as propose a fusion method based on test-set pre-filtering and list Figure 1: Color Complexity of 3960 Cropped Dev-set Images re-ranking. 2 PROPOSED APPROACH found the CEDD feature, which incorporates color and texture information, achieved the best performance. Table 1 contains a description of the approaches used for our three The approach of our Visual run is based on the insight that the runs, which involve three different parts: pre-processing, feature body of flood water parts of the image are more important than extraction and fusion strategy. For the first run, (Visual), we apply other parts for flood retrieval. Because the body of flood water is image cropping and test-set pre-filtering based on color complexity, usually located in the lower part of a flooding image [3], we try to and use the SVM classifier on visual features to rank the images in extract this part from each test image. Experiments on dev-set show descending order by the output decision values. For the second run, that eliminating the top 60% of the image as well as 10% on each (Text), we rank the images by searching for flood-related keywords side could bring about an accuracy improvement of 4.5%. More- in metadata without any preprocessing. Finally, for the third run, over, using cropped images could save computation time of feature (Fusion), we develop a 3-step approach: first the Run 2 system for extraction and eliminate the interference from the sky region. pre-filtering, then the Run 1 system for ranking, and finally the Run Another insight that we use is related to the observation that 2 system again for re-ranking. flood regions are visually homogeneous. We address this insight by computing color complexity of the cropped images. Color complex- 2.1 Visual Features ity here is defined by the equation: Color complexity = NSh , where We have investigated nine conventional visual descriptors provided Nh indicates the number of hues of an HSV image and S is the by the task organizers on the dev-set using an SVM classifier, and area of the image, i.e. the number of pixels. As shown in Fig. 1, the Copyright held by the owner/author(s). cropped non-flooding images tend to have higher color complexity MediaEval’17, 13-15 September 2017, Dublin, Ireland than the flooding ones. We set an empirical threshold, T = 0.05 so as to remove a good number of non-flooding images, but few MediaEval’17, 13-15 September 2017, Dublin, Ireland Z. Zhao et al. Table 2: Keywords Distribution on the Dev-set Images Table 3: Official Evaluation Results on the Test-set (Best re- sults in bold) User_tags Title Description Positives / all Precision Run AP @ 480 mAP @ (50, 100, 250, 480) 1 1 0 421 / 478 88.07% Run 1 51.46 64.70 1 1 1 166 / 208 79.81% Run 2 63.70 75.74 1 0 0 780 / 1013 77.00% Run 3 73.16 85.43 1 0 1 174 / 256 67.97% 0 1 1 50 / 76 65.79% in descending order by decision value to generate the sublist 1. 0 1 0 152 / 266 57.14% Finally, we re-rank the images in sublist 1 whose decision values 0 0 1 177 / 751 23.57% are non-positive using our metadata-based system again. 0 0 0 0 / 2233 0 3 RESULTS AND DISCUSSION - flooded - 181 / 208 87.02% Table 3 presents the official results for our three submitted runs 1+water - - 398/ 467 85.22% on the test-set. We see that the third run achieves the best per- formance for both evaluation metrics. We can also observe the retrieval process benefits from fused visual and metadata infor- flooding ones. These removed images will be ranked in ascending mation. Specifically, implementing test-set pre-filtering based on order by color complexity in the end part of the final list. flood-related keywords in our metadata-based approach leads to considerable better performance that our visual-based approach 2.2 Textual Metadata Features based on color complexity. Further, the visual-based approach is We search for flood-related keywords including "flood(s)", "flood- verified to perform better than the metadata-based one in the condi- ing" and "flooded" in the three main fields "User_tags", "Title" and tions except for "0 0 0" and "0 0 1". The reason for this effect could be "Description" of the accompanying metadata to rank the images. that some images mentioning flooding in the metadata are relevant Table 2 shows the relationship between keyword occurrence and to flooding, but do not visually depict any floodwater. Such images relevance reflected by the precision scores on the dev-set images, will be labeled as negatives in the ground truth. where 1 indicates that flood-related keywords are present in a spe- cific field, and 0 means they are not. We use "x x x" in the first three 4 CONCLUSION AND OUTLOOK columns to indicate eight general conditions and two special ones In this paper, we presented an approach for retrieving images show- (one condition per row except for the header row). The latter two ing evidence of flooding events based on visual and textual (meta- columns show the corresponding retrieval precision scores for each data) information. Final results showed using both visual and textual condition. features outperforms using either feature individually. Overall, as shown, we find that keywords in the "User_tags" During the exploratory experiments that led to our Run 1 Visual field are the most helpful, and keywords in the "Title" field are less approach, we tried to first divide the cropped image into blocks reliable. Also, "Description" tends to give misleading information. before the other steps, and then to compute the final score for an Furthermore, because the ground truth defined images showing image based on the scores of each block. This approach did not “unexpected high water levels in industrial, residential, commercial achieve a better performance. It maybe because most blocks divided and agricultural areas" as positives [2], the conditions "1+water - -" from a homogeneous region in non-flooding images are more likely and "- flooded -" (where the presence of water bodies are implied) to be regarded as a body of flood water without contribution from are more likely to be positive. a global feature that contains information such as the white line in In order to create the final result list for Run 2, we concatenate the road or the boats on the river. the sublists retrieved by each of above eight general conditions, in In the future, we will try segmentation algorithms to extract descending order by precision scores. Meanwhile, for each sublist, the body of flood water more accurately and develop better visual we will put the images that also meet the latter two conditions descriptors to differentiate the body of flood water from other water "1+water - -" or "- flooded -" as the top part. bodies in non-flooding images in the large-scale dataset. We will also explore the word relations between user tags to avoid the mis- 2.3 Feature Fusion taken decision when the flood depicted in the image did not consist In this section, we describe our fusion strategy based on pre-filtering of water. Finally, for the fusion strategy, methods of combining the and re-ranking using both visual and metadata information. First, feature vectors of different modalities will be explored. we rank all the images that meet the conditions "0 0 0" and "0 0 1" using our metadata-based system to generate the sublist 2, ACKNOWLEDGMENTS which will be the end part of the final list because as shown in This research is partially supported by China Scholarship Council Table 2, these images are very unlikely to be positive. Then, the (201706250044). rest of the images are fed into our visual-based system and ranked Multimedia Satellite Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas Dengel. 2016. Contextual Enrichment of Remote-Sensed Events with Social Media Streams. In ACM Multimedia Conference 2016. ACM, 1077–1081. [2] Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. The Multimedia Satellite Task at MediaEval 2017: Emergence Response for Flooding Events. In Proc. of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland. [3] Paulo Vinicius Koerich Borges, Joceli Mayer, and Ebroul Izquierdo. 2008. A Probabilistic Model for Flood Detection in Video Sequences. In IEEE International Conference on Image Processing 2008. IEEE, 13–16. [4] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earth- quake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web. ACM, 851–860. [5] Jie Yin, Andrew Lampert, Mark Cameron, Bella Robinson, and Robert Power. 2012. Using social media to enhance emergency situation awareness. IEEE Intelligent Systems 27, 6 (November 2012), 52–59.