1. Introduction

HFFD: Hybrid Fusion Based Multimodal Flood Relevance Detection

Yi Shao

Yang Zhang

Ye Jiang

Wenbo Wan

Jing Li

Jiande Sun

1 0 Qingdao University of Science and Technology , China 1 Shandong Normal University , China

Social media, such as Twitter, has increasingly afected information dissemination and consumption, demonstrating its potential to alarm the upcoming natural disaster beforehand. This paper describes the design of a novel natural disaster event detection method that used hybrid fusion to utilize multimodal information in tweets, called HFFD (Hybrid Fusion based Flood Detection). The goal of this work is to discover flood-related event when related information spread at the early stage. The performance on the oficial dataset confirms the efectiveness of our model.

1. Introduction

With the development of online social media techniques, social media allows people to seek and share information more efectively and overcome the barriers of traditional communication such as time lag or geographical constraints. Such characteristic of social media shows its capacity of detecting natural disaster at early stage when the disaster related information starte to spread on social media platform, such as Twitter [ 1 ].

This paper discusses the RCTP subtask of MediaEval2022’s DisasterMM task [ 2 ], which aims to detect flood-related content in tweets based on multi-modal information data on Twitter.

The proposed model adopts the hybrid fusion [ 3 ] in the multimodal fusion method, i.e., the model comprehensively adopts the early fusion [ 4 ] and the late fusion. Since the early fusion captures the low-level interactions of diferent modalities, and the late feature integrates a large amount of complex modal information, this method can better deal with the lack of some modalities when the flood-related information first spreads on the network. This way, the model can also make use of the existing modal information to a greater extent, so as to realize the early detection of flood information. In addition, we are also actively exploring the relationship between more modalities and task goals, such as mentions (@), hashtags (#), urls, tweet creation time, posting location, etc. contained in tweets, and strive to find more inspiration. These are described in detail in Section 3 and Section 4.

2. Related Work

Early fusion is more of an early exploration of multimodal research. Early fusion refers to the feature-level fusion of the features of diferent modalities before the decision task, which can better capture the low-level interaction between diferent modal information. However, due to the existence of the modal gap, it is dificult to find a model that can perform transfer learning between more than two modalities, that is, early fusion cannot fully achieve cross-modal feature fusion.

In contrast, late fusion fuses at the final decision level, that is, uses features from diferent modalities to perform decision-making tasks separately, and finally rationally combining diferent results according to a cleverly designed mechanism, such as averaging [ 5 ], voting schemes [ 6 ], weighting on noise [ 7 ] or variance [ 8 ]. Due to the decision-level fusion of late fusion, it does not need to directly fuse features of diferent modalities, so that the overall structure of the model has great flexibility compared to early fusion. The ability of late fusion to adapt to large amounts of complex modal data gives it an advantage in flood early tweet data where modalities are often missing. But pure decision-level fusion also ignores low-level interactions between diferent modalities.

Synthetically, the hybrid fusion approach combines the ability of early fusion to capture feature-level interactions and the ability of late fusion to flexibly cope with complex modal situations, respectively. Hybrid fusion has been successfully applied in multimodal event detection (MED) tasks [ 9 ], and the proposed model is inspired by it.

3. Approach

The overall flow chart of the proposed model is shown in Figure 1. The model comprehensively utilizes the body text, images, entities (#, urls), and time features in the tweet data to detect whether the tweet is related to the disaster topic.

3.1. Handling of Diferent Modalities

The image feature extractor uses ResNet101 trained on ImageNet and fine-tuned on the task dataset. Each tweet sample contains varying numbers of images, and we input them into ResNet respectively to obtain corresponding image features F , where i = 1, 2, . . ., n.

The italian dataset is a novel idea, for which we have tried variants of BERT models such as RoBERTa and multilingual BERT as textual feature extractors. In the end, multilingual BERT outperformed others. Entities refer to hashtags (#) and urls contained in the tweet text. It is common to mention related users or organizations in tweets, or use hashtags for topic labeling. The text in the url attached to the tweet is related to the original text of the tweet, and both can be regarded as the text content of the tweet [ 10 ]. After concatenating the first sentence of each paragraph of the text in urls with the text of the tweet, we input the multilingual BERT to get the embedding vector as textual feature F . At the same time, each hashtag is input into the multilingual BERT separately to ensure that the feature vector F of hashtag does not contain contextual information, where j = 1, 2, . . ., m.

Since flood-related tweets increase with the time of the rainy season each year, time feature is also important modal feature for detecting flood topics. In order to extract the periodicity of time feature in long periods, the year and date of each tweet’s creation time are extracted separately in the proposed model, and the time feature are encoded in the form of sine feature F and cosine feature F respectively. This, we can mine the periodic characteristics in time information.

3.2. Multimodal Fusion

Since both the hashtag feature F and the text feature F are the same source text information features extracted by BERT, they are directly concatenated to obtain the text-entities fusion feature F - = F ⊕ F1 ⊕ . . . ⊕ F. After that, we further concatenate the time feature into F - to obtain the text-entities-time fusion feature F -- = F - ⊕ F ⊕ F. Since the number of images contained in each sample is diferent, We OR the prediction results for each image to get the final image result R = R 1 OR R2 OR . . . OR R. Finally, we consider that one of the image and text-entities-time is related to the flood, so we can conclude that the entire sample is related to the flood, so we OR the final result of the image and the text-entities-time result again to get the final result R = R -- OR R .

4. Results and Analysis 4.1. Textual Feature Extractor Performance Comparison

We tested several diferent textual feature extractors on the Italian dataset, among which RoBERTa [ 11 ] and multilingual BERT are the models oficially recommended by huggingface to deal with Italian text problems. We give them to the plain text data in the development set to classify, and the performance results are shown in Table 1.

4.2. Ablation Experiment

As shown in Table 2, we conduct ablation experiments with diferent modality feature extractors. After introducing entities features, the model performs slightly better than relying only on uncleaned text or only cleaned body text. We also found that the model relying only on image feature performs poor. This is because most sample images do not contain obvious floodrelated elements, which makes the image feature extractor undertrained - in fact, the proposed multimodal model has the highest precision and recall on the development set exceeding 0.93, but the precision on the oficial final test set is down to 0.6741. This is because in the proposed model, there is a "path" that only passes through the image classifier, that is, when the image classifier detects a "flood-related" image, the whole model will directly output the final result. At this time, the performance of the whole model will be afected by the image classifier and become unstable. It’s to say, we still need to explore a better late fusion method.

5. Discussion and Outlook

Because the development set data of the RCTP subtask was collected in a short time span (May 25, 2020 to June 12, 2020), time features has no significant impact on the model performance. But the dataset of the LETT subtask has a long time span, in which we derive the periodicity of the number of flood-related tweet creations over time relative to the dates of the rainy season. In addition, some disasters caused by special weather are also related to specific hours. For example, some areas encounter squall line weather, and heavy precipitation will occur in the afternoon and evening. But we did not find hour-level temporal characteristics in the given datasets.

Acknowledgement

Thanks to the organizers of the MediaEval2022, especially to those organizers for DisasterMM. This work was supported in part by the Scientific Research Leader Studio of Jinan (Grant No. 2021GXRC081), and in part by the Joint Project for Smart Computing of Shandong Natural Science Foundation (Grant No. ZR2021LZH010, ZR2020LZH015, and ZR2022LZH012).

[1]

R. M.

Merchant ,

Elmer ,

Lurie , Integrating social media into emergency-preparedness eforts , New England journal of medicine 365 ( 2011 ) 289 - 291 .

[2]

Andreadis ,

Bozas , I. Gialampoukidis ,

Moumtzidou ,

Fiorin ,

Lombardo ,

Mavropoulos ,

Norbiato ,

Vrochidis ,

Ferri , I. Kompatsiaris , DisasterMM: Multimedia Analysis of DisasterRelated Social Media Data Task at MediaEval 2022 , in: Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online, 2023 .

[3]

P. K.

Atrey ,

M. A.

Hossain ,

El Saddik ,

M. S.

Kankanhalli , Multimodal fusion for multimedia analysis: a survey , Multimedia systems 16 ( 2010 ) 345 - 379 .

[4]

K. D'mello ,

Kory , A review and meta-analysis of multimodal afect detection systems, ACM computing surveys (CSUR) 47 ( 2015 ) 1 - 36 .

[5]

Shutova ,

Kiela ,

Maillard , Black holes and white rabbits: Metaphor identification with visual features, in: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies , 2016 , pp. 160 - 170 .

[6]

Morvant ,

Habrard ,

Ayache , Majority vote of diverse classifiers for late fusion, in: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR ), Springer, 2014 , pp. 153 - 162 .

[7]

Potamianos ,

Neti ,

Gravier ,

Garg ,

A. W.

Senior , Recent advances in the automatic recognition of audiovisual speech , Proceedings of the IEEE 91 ( 2003 ) 1306 - 1326 .

[8]

Evangelopoulos ,

Zlatintsi ,

Potamianos ,

Maragos ,

Rapantzikos , G. Skoumas,

Avrithis , Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention , IEEE Transactions on Multimedia 15 ( 2013 ) 1553 - 1568 .

[9]

Z.-z.

Lan ,

Bao ,

S.-I.

Yu ,

Liu ,

A. G.

Hauptmann , Multimedia classification and event detection using double fusion , Multimedia tools and applications 71 ( 2014 ) 333 - 347 .

[10]

Moumtzidou ,

Andreadis , I. Gialampoukidis ,

Karakostas ,

Vrochidis , I. Kompatsiaris , Flood relevance estimation from visual and textual content in social media streams , in: Companion Proceedings of the The Web Conference 2018 , 2018 , pp. 1621 - 1627 .

[11]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , arXiv preprint arXiv: 1907 . 11692 ( 2019 ).