<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HFFD: Hybrid Fusion Based Multimodal Flood Relevance Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yi Shao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yang Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ye Jiang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenbo Wan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiande Sun</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qingdao University of Science and Technology</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Shandong Normal University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social media, such as Twitter, has increasingly afected information dissemination and consumption, demonstrating its potential to alarm the upcoming natural disaster beforehand. This paper describes the design of a novel natural disaster event detection method that used hybrid fusion to utilize multimodal information in tweets, called HFFD (Hybrid Fusion based Flood Detection). The goal of this work is to discover flood-related event when related information spread at the early stage. The performance on the oficial dataset confirms the efectiveness of our model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the development of online social media techniques, social media allows people to seek
and share information more efectively and overcome the barriers of traditional communication
such as time lag or geographical constraints. Such characteristic of social media shows its
capacity of detecting natural disaster at early stage when the disaster related information starte
to spread on social media platform, such as Twitter [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        This paper discusses the RCTP subtask of MediaEval2022’s DisasterMM task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which aims
to detect flood-related content in tweets based on multi-modal information data on Twitter.
      </p>
      <p>
        The proposed model adopts the hybrid fusion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] in the multimodal fusion method, i.e., the
model comprehensively adopts the early fusion [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the late fusion. Since the early fusion
captures the low-level interactions of diferent modalities, and the late feature integrates a
large amount of complex modal information, this method can better deal with the lack of some
modalities when the flood-related information first spreads on the network. This way, the model
can also make use of the existing modal information to a greater extent, so as to realize the
early detection of flood information. In addition, we are also actively exploring the relationship
between more modalities and task goals, such as mentions (@), hashtags (#), urls, tweet creation
time, posting location, etc. contained in tweets, and strive to find more inspiration. These are
described in detail in Section 3 and Section 4.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Early fusion is more of an early exploration of multimodal research. Early fusion refers to the
feature-level fusion of the features of diferent modalities before the decision task, which can
better capture the low-level interaction between diferent modal information. However, due to
the existence of the modal gap, it is dificult to find a model that can perform transfer learning
between more than two modalities, that is, early fusion cannot fully achieve cross-modal feature
fusion.</p>
      <p>
        In contrast, late fusion fuses at the final decision level, that is, uses features from diferent
modalities to perform decision-making tasks separately, and finally rationally combining
diferent results according to a cleverly designed mechanism, such as averaging [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], voting schemes
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], weighting on noise [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or variance [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Due to the decision-level fusion of late fusion, it
does not need to directly fuse features of diferent modalities, so that the overall structure of
the model has great flexibility compared to early fusion. The ability of late fusion to adapt to
large amounts of complex modal data gives it an advantage in flood early tweet data where
modalities are often missing. But pure decision-level fusion also ignores low-level interactions
between diferent modalities.
      </p>
      <p>
        Synthetically, the hybrid fusion approach combines the ability of early fusion to capture
feature-level interactions and the ability of late fusion to flexibly cope with complex modal
situations, respectively. Hybrid fusion has been successfully applied in multimodal event
detection (MED) tasks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and the proposed model is inspired by it.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>The overall flow chart of the proposed model is shown in Figure 1. The model comprehensively
utilizes the body text, images, entities (#, urls), and time features in the tweet data to detect
whether the tweet is related to the disaster topic.</p>
      <sec id="sec-3-1">
        <title>3.1. Handling of Diferent Modalities</title>
        <p>The image feature extractor uses ResNet101 trained on ImageNet and fine-tuned on the task
dataset. Each tweet sample contains varying numbers of images, and we input them into ResNet
respectively to obtain corresponding image features F , where i = 1, 2, . . ., n.</p>
        <p>
          The italian dataset is a novel idea, for which we have tried variants of BERT models such as
RoBERTa and multilingual BERT as textual feature extractors. In the end, multilingual BERT
outperformed others. Entities refer to hashtags (#) and urls contained in the tweet text. It is
common to mention related users or organizations in tweets, or use hashtags for topic labeling.
The text in the url attached to the tweet is related to the original text of the tweet, and both
can be regarded as the text content of the tweet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. After concatenating the first sentence of
each paragraph of the text in urls with the text of the tweet, we input the multilingual BERT to
get the embedding vector as textual feature F . At the same time, each hashtag is input into
the multilingual BERT separately to ensure that the feature vector F of hashtag does not
contain contextual information, where j = 1, 2, . . ., m.
        </p>
        <p>Since flood-related tweets increase with the time of the rainy season each year, time feature
is also important modal feature for detecting flood topics. In order to extract the periodicity
of time feature in long periods, the year and date of each tweet’s creation time are extracted
separately in the proposed model, and the time feature are encoded in the form of sine feature
F and cosine feature F respectively. This, we can mine the periodic characteristics in
time information.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Multimodal Fusion</title>
        <p>Since both the hashtag feature F and the text feature F  are the same source text
information features extracted by BERT, they are directly concatenated to obtain the text-entities
fusion feature F - = F  ⊕ F1 ⊕ . . . ⊕ F. After that, we further concatenate
the time feature into F - to obtain the text-entities-time fusion feature F -- 
= F - ⊕ F ⊕ F. Since the number of images contained in each sample is
diferent, We OR the prediction results for each image to get the final image result R  = R
1
OR R2 OR . . . OR R. Finally, we consider that one of the image and text-entities-time is
related to the flood, so we can conclude that the entire sample is related to the flood, so we OR
the final result of the image and the text-entities-time result again to get the final result R =
R --  OR R .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Textual Feature Extractor Performance Comparison</title>
        <p>
          We tested several diferent textual feature extractors on the Italian dataset, among which
RoBERTa [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and multilingual BERT are the models oficially recommended by huggingface to
deal with Italian text problems. We give them to the plain text data in the development set to
classify, and the performance results are shown in Table 1.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ablation Experiment</title>
        <p>As shown in Table 2, we conduct ablation experiments with diferent modality feature extractors.
After introducing entities features, the model performs slightly better than relying only on
uncleaned text or only cleaned body text. We also found that the model relying only on image
feature performs poor. This is because most sample images do not contain obvious
floodrelated elements, which makes the image feature extractor undertrained - in fact, the proposed
multimodal model has the highest precision and recall on the development set exceeding 0.93,
but the precision on the oficial final test set is down to 0.6741. This is because in the proposed
model, there is a "path" that only passes through the image classifier, that is, when the image
classifier detects a "flood-related" image, the whole model will directly output the final result.
At this time, the performance of the whole model will be afected by the image classifier and
become unstable. It’s to say, we still need to explore a better late fusion method.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Outlook</title>
      <p>Because the development set data of the RCTP subtask was collected in a short time span (May
25, 2020 to June 12, 2020), time features has no significant impact on the model performance.
But the dataset of the LETT subtask has a long time span, in which we derive the periodicity of
the number of flood-related tweet creations over time relative to the dates of the rainy season.
In addition, some disasters caused by special weather are also related to specific hours. For
example, some areas encounter squall line weather, and heavy precipitation will occur in the
afternoon and evening. But we did not find hour-level temporal characteristics in the given
datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>Thanks to the organizers of the MediaEval2022, especially to those organizers for DisasterMM.
This work was supported in part by the Scientific Research Leader Studio of Jinan (Grant No.
2021GXRC081), and in part by the Joint Project for Smart Computing of Shandong Natural
Science Foundation (Grant No. ZR2021LZH010, ZR2020LZH015, and ZR2022LZH012).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Merchant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lurie</surname>
          </string-name>
          ,
          <article-title>Integrating social media into emergency-preparedness eforts</article-title>
          ,
          <source>New England journal of medicine 365</source>
          (
          <year>2011</year>
          )
          <fpage>289</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Andreadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozas</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gialampoukidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moumtzidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fiorin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mavropoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Norbiato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferri</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          ,
          <source>DisasterMM: Multimedia Analysis of DisasterRelated Social Media Data Task at MediaEval</source>
          <year>2022</year>
          , in: Proceedings of the MediaEval 2022 Workshop, Bergen, Norway and Online,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Atrey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El Saddik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          ,
          <article-title>Multimodal fusion for multimedia analysis: a survey</article-title>
          ,
          <source>Multimedia systems 16</source>
          (
          <year>2010</year>
          )
          <fpage>345</fpage>
          -
          <lpage>379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>K. D'mello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kory</surname>
          </string-name>
          ,
          <article-title>A review and meta-analysis of multimodal afect detection systems, ACM computing surveys (CSUR) 47 (</article-title>
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Shutova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maillard</surname>
          </string-name>
          ,
          <article-title>Black holes and white rabbits: Metaphor identification with visual features, in: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies</article-title>
          ,
          <year>2016</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Morvant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Habrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ayache</surname>
          </string-name>
          ,
          <article-title>Majority vote of diverse classifiers for late fusion, in: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR</article-title>
          ), Springer,
          <year>2014</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Potamianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Neti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gravier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <article-title>Recent advances in the automatic recognition of audiovisual speech</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>91</volume>
          (
          <year>2003</year>
          )
          <fpage>1306</fpage>
          -
          <lpage>1326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Evangelopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zlatintsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Potamianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maragos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rapantzikos</surname>
          </string-name>
          , G. Skoumas,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Avrithis</surname>
          </string-name>
          ,
          <article-title>Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>15</volume>
          (
          <year>2013</year>
          )
          <fpage>1553</fpage>
          -
          <lpage>1568</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.-z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Hauptmann</surname>
          </string-name>
          ,
          <article-title>Multimedia classification and event detection using double fusion</article-title>
          ,
          <source>Multimedia tools and applications 71</source>
          (
          <year>2014</year>
          )
          <fpage>333</fpage>
          -
          <lpage>347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moumtzidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Andreadis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gialampoukidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karakostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris</surname>
          </string-name>
          ,
          <article-title>Flood relevance estimation from visual and textual content in social media streams</article-title>
          ,
          <source>in: Companion Proceedings of the The Web Conference</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1621</fpage>
          -
          <lpage>1627</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>