<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alaaeddin Alia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Taimoor Khan</string-name>
          <email>taimoor.nlp@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GESIS - Leibniz Institute for the Social Sciences</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heinrich Heine University Düsseldorf</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The spread of misinformation as rumors is getting more prevalent on social media with its widespread use as access to instant information. Rumors on social media platforms can have damaging consequences unless timely intercepted. The existing studies on rumor verification use linguistic patterns, sentiment orientation, and network structures. It requires training data preparation and updating the model to stay up to date with newer rumors. However, little attention is paid to benefit from the known trusted, and credible authorities to verify rumors. In this study, we address rumor verification on platform X (previously Twitter) by using evidence from the timeline of authority accounts. We propose LLM-based bilingual rumor verification for English and Arabic using SBERT and BM25 to retrieve evidence candidates i.e., relevant tweets from the authority timeline, and finetuned XLM-RoBERTa to detect their stance of the rumor. It achieves F1-score of 0.8133 for English and 0.7647 for Arabic to detect stance labels for the rumor using evidence candidates. The rumor is verified by weighted aggregation of its stance labels having accuracy of 0.6923 and 0.5769 for Arabic.</p>
      </abstract>
      <kwd-group>
        <kwd>rumor verification</kwd>
        <kwd>LLM-based rumor verification</kwd>
        <kwd>rumor evidence stance detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, social media platforms have become the main sources for accessing information, thereby
disrupting established outlets such as television and newspapers. Social media platforms provide
quick access to unfiltered news and comprise a decentralized opinion landscape that presents multiple
perspectives. However, with the increase in access to information, fake news and rumors are also
widely spread on social media platforms, including platform X (previously Twitter). Rumor is unverified
information that may spread on social media platforms causing misinformation, and confusion, and
therefore, afecting various areas such as social events, politics, or even personal matters [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example,
in 2020, a rumor circulated on Twitter that a popular fast-food chain has donated to a controversial
political campaign, which led to a brief boycott by customers. Although the claim was later debunked,
the rumor had already afected the company’s public image, demonstrating how quickly a false narrative
can demote a brand or individual.
      </p>
      <p>
        Many studies have been conducted to verify rumors and false news on social media platforms,
focusing on the structure of responses, user profiles, linguistic patterns, sentiment orientation, and
network structure of the rumor [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. However, little attention has been paid to the role of oficial
authorities in the process of verifying rumors or claims. This is significant given that the authorities are
entities that have the knowledge and power to verify rumors as credible sources. They may support or
refute a claim through verified evidence [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A hybrid model that combines pre-trained large language
models (LLMs) such as BERT, MARBERT, AraBERT with lexical, semantic, and network-based features
is used to identify authority accounts on Twitter [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It motivates the need for a rumor verification
ROMCIR 2025: The 5th Workshop on Reducing Online Misinformation through Credible Information Retrieval (held as part of
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
system that determines relevant tweets from the authority timelines and uses that information to verify
rumors, as shown in Figure 1. For example, to verify disease-related rumors in a country, their health
ministry may be the authority account. Using relevant tweets from this authority timeline can help
address the rumor by supporting or refuting it.</p>
      <p>
        A rumor verification system is needed that benefits from the authority timeline tweets as evidence.
Each rumor has timelines of the corresponding authority accounts i.e., responsible ofices or their
representatives as determined in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Although the authority accounts may lack suficient evidence to
confirm or deny a rumor, they are nonetheless assumed to provide correct information. However, there
is evidence of politicians, celebrities, and other public figures involved in spreading misinformation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The problem statement is that given a rumor and the corresponding authority account(s) timelines,
identify relevant tweets for each rumor as evidence candidates, determine the stance label of the rumor
using each evidence candidate, and aggregate to verify the rumor status. A rumor may be supported
or refuted based on the available evidence from the authority timeline. In that case, up to 5 evidence
tweets are to be provided with the decision that assisted in verifying the rumor status. However, due to
a lack of conclusive evidence, the rumor is decided as not having enough info.
      </p>
      <p>We propose a bilingual rumor verification system for English and Arabic having four modules. It
takes a rumor and the corresponding authority timeline tweets as input and outputs the rumor label and
relevant evidence tweets in case the label is supported or refuted. The first module performs cleaning and
preprocessing of all rumors and authority timeline tweets. The second module transforms all rumors
and timeline tweets to vectors using dense representation (SBERT) for English and bag-of-words (BoW)
sparse representation (BM25) for Arabic. Using cosine similarity, it determines evidence candidates
from the authority timelines for each rumor in English and Arabic. The third module uses bilingually
ifnetuned XLM-RoBERTa to detect the stance for each rumor and evidence candidate pair. The stances
for each rumor can be a mix of supported, refuted, or ”not enough info”, depending on the evidence
candidates identified in module 2. It performs stance detection for both English and Arabic. Finally, the
fourth module performs weighted aggregation of the stance labels to verify the rumor. Our contribution
is to devise a large language model (LLM) based pipeline to automatically verify bilingual rumors
through reliable authority timeline tweets. In retrieving evidence candidates, SBERT achieved 0.6362
for English and BM25 achieved 0.7833 for Arabic. Finetuned XLM-RoBERTa in stance label detection
has the highest F1-score of 0.8133 for English and 0.7647 for Arabic, respectively. Weighted stance
label aggregation resulted in rumor verification accuracy of 0.6923 and 0.5769 for English and Arabic,
respectively.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature</title>
      <p>
        Rumor verification: Rumor verification is the process of confirming the veracity of a rumor by
gathering evidence, analyzing relevant information, and determining its truthfulness. Various datasets
are available for rumor verification, such as the AuSTR dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which focuses on the stance of
authoritative accounts in Arabic tweets. Another widely used dataset is the FEVER dataset which
is designed for fact-checking claims using evidence from Wikipedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The FEVER dataset shares
many similarities with rumor verification tasks in that it challenges systems to classify claims as
either supported, refuted, or ”not enough info” by retrieving relevant evidence. Both datasets focus on
verifying the truthfulness of information using external sources.
      </p>
      <p>
        Rumor evidence retrieval: Evidence retrieval involves identifying relevant information (evidence
documents) from various sources that can either support or refute a given rumor. Several advanced
models have been developed to optimize this process, focusing on retrieving high-quality evidence
that improves the accuracy of rumor verification. Kernel Graph Attention Network (KGAT) leverages
graph-based structures and kernel-based attention mechanisms to perform fine-grained fact verification,
enhancing the model’s ability to reason over multiple sources of evidence [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This approach constructs
an evidence graph in which claims and sentences are nodes, and their relationships are represented as
edges. KGAT’s ability to capture complex dependencies between pieces of evidence makes it a powerful
tool for rumor verification. The evidence-aware Model focuses on improving sentence retrieval in
fact-checking tasks by taking relationships between all potential evidence sentences into account and
applying self-attention mechanisms to rank them based on relevance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This evidence-aware approach
improves the precision of fact-checking systems by ensuring that only the most relevant sentences are
selected for verification.
      </p>
      <p>
        Text representation: SBERT (sentence bidirectional encoder representations from transformers) is
a variation of the original BERT model which is specifically designed to generate meaningful embeddings
at the sentence or document level. While BERT produces embeddings for individual tokens (words),
SBERT adapts BERT into a Siamese network architecture to compute embeddings that capture the
semantic meaning of entire sentences. This is highly efective for tasks such as semantic textual
similarity, question answering, and document retrieval [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. SBERT provides sentence embeddings
that can directly be used in downstream tasks such as clustering, ranking, or matching documents
based on their meaning. Its advantage over other embedding techniques is its ability to encode the
context of a sentence, taking word order and relationships between words into account. TFIDF (term
frequency-inverse document frequency) is a BoW text vectorization technique that determines word
importance through frequency in the document while penalizing for frequency across most documents.
BM25 is also used for information retrieval that improves over TFIDF by using term saturation that
restricts frequency and normalizes the document length.
      </p>
      <p>
        Stance detection: Stance detection determines whether the evidence supports, refutes, or provides
no clear information about the rumor. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced the AuSTR as the Arabic rumor tweets dataset
and finetuned BERT-based models to classify tweets as agreeing, disagreeing, or unrelated to classified
rumors. The coupled hierarchical transformer model performs stance-aware rumor verification in social
media conversations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This model captures both local and global interactions within conversation
threads and uses a coupled transformer module to integrate stance classification with rumor verification,
leading to significant performance improvements. Multi-Level Attention Model for evidence-based
factchecking uses token-level and sentence-level self-attention mechanisms to process and evaluate evidence
from multiple sentences [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Thereby providing a simple yet efective alternative to more complex
graph-based models. XLM-RoBERTa (Cross-lingual language model) is a cross-lingual transformer
model built on the RoBERTa architecture trained on 2.5 TB of filtered CommonCrawl data, covering
over 100 languages. Through unsupervised learning, XLM-RoBERTa efectively handles a wide range
of cross-lingual tasks. While it retains the same architecture as RoBERTa, the fact that it is trained on a
more extensive and diverse dataset makes it particularly well-suited for multilingual classification [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Knowledge-enhanced masked language Model (KE-MLM) is a finetuned BERT-based model aimed at
improving stance detection on social media, particularly on Twitter. Instead of random token masking,
KE-MLM focuses on stance-relevant tokens identified using the log-odds ratio, thereby improving the
model’s attention to key contextual words [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Data Preparation</title>
        <p>The architecture of the proposed methodology consists of four modules is outlined in Figure 2. The
following subsections explain the working of each module.</p>
        <p>The first module performs data loading, cleansing, and preprocessing. Bad rumors have error codes
instead of tweet content in their corresponding timelines and are removed. This appears to be the
data collection problem from the API used. In preprocessing, the tweets are cleaned by removing
unwanted characters, hashtags, URLs, mentions, etc. It prepares the data for the next module. We also
extracted keywords, hashtags, URLs, and emoji embeddings from the rumor and timeline tweets to use
as additional features. These features were incorporated to improve the performance of the finetuned
stance detection models, thereby allowing us to assess their impact on the overall results.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evidence Candidates Retrieval</title>
        <p>In this module, the rumors and their corresponding timeline tweets are transformed into dense
embeddings using the SBERT model. We also used sparse representation through TFIDF and BM25 with
unigrams and bigrams while keeping only the top 1000 most relevant features. The SBERT model has
a better semantic representation of the data in the embedding vectors that leads to eficient evidence
candidate retrieval. Following the text vectorization, we compute the cosine similarity between the
SBERT embeddings of the rumors and their respective timeline tweets. Cosine similarity between a
rumor and an authority timeline tweet can be given as;
 _( 
 ,  
, ) =
|| 
 
 ⋅  
 || || 
,
, ||
(1)
Where    is the  ℎ rumor while   , is the  ℎ tweet of the  ℎ authority account timeline,
corresponding to the  ℎ rumor. It measures the degree of similarity between two vectors, where -1
is complete dissimilarity while 1 is complete similarity. The authority timeline tweets are ordered in
decreasing order of their  _ score with the corresponding rumor. Using fixed threshold as top@k
with k=5,10,15, evidence candidates i.e., evidence[c] are identified.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Stance Detection</title>
        <p>This module performs bilingual (English and Arabic) multi-label rumor classification using the
corresponding evidence candidates. For stance detection, we employ the XLM-RoBERTa transformer-based
multilingual model. It is fine-tuned for the given task using a mix of both English and Arabic samples.
A training instance consists of concatenated vectors of rumor and one of the evidence to predict rumor
labels. This way, a rumor is paired with all its evidence tweets for a label to increase the training data
for better finetuning. We also finetuned the KE-MLM model using the same training samples. Due to
an imbalance in data, we used stratified batches for finetuning these models. This method is especially
useful for nuanced decision-making, particularly when certain stance categories, such as the supported,
are less frequent but important, compared to the more common label i.e., ”not enough info”. Other
models including both traditional and large language models are used for comparison.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Rumor Verification</title>
        <p>In this module, we aggregate the stance labels produced by the previous module for all pairs of rumors
with their corresponding evidence candidates from the authority timelines. Due to the imbalance in
data, we use weighted voting aggregation to determine a rumor label from the stance labels of all rumor
and evidence candidate pairs. The results are also compared with majority and soft voting aggregation
schemes. The weighted scheme assigns weights inversely proportional to the number of instances
of a label in the training data. This module verifies the status of the rumor as the final decision. The
evidence candidates that helped in determining the label of a rumor as supported or refuted are provided
as evidence for it. However, no evidence is needed for the ”not enough info” label.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We first split the data into 80% for training and 20% for testing using stratified sampling. This ensured
that the ratio of labels remained balanced across both training and test sets. The rumors in the study are
independent of one another while the authority timeline tweets in general show higher relevance to the
rumor, mostly covering similar topics. During data cleaning, 9 rumors and their corresponding 4,319
timeline tweets are removed from the data for not having meaningful text. These tweets have error
codes instead of content that may be sourced from the collection API. The cleaned data has 128 rumors
in Arabic that have 53 instances of ”not enough info”, 51 instances of refuted, and only 24 instances
of supported labels. While, the English data afected by data cleansing has 44 stances of ”not enough
info”, 51 refuted, and 24 supported labels. Both datasets are heavily skewed in favor of ”not enough
info” and refuted labels. Since the training data is not enough to finetune XLM-RoBERTa, we separated
each training sample into multiple instances by pairing the rumor with all its evidence sharing the
same label as provided in the training data. The pairs for each rumor depend on its evidence tweets in
the training data (from 1 to 5). No evidence is provided for the rumors labeled ”not enough info” in
the training data and therefore, to include them in the finetuning process, randomly sampled tweets
from the corresponding authority timeline are used to prepare their training instances. Due to the
specialized nature of this approach, the existing rumor datasets i.e., AuSTR and FEVER that do not
provide corresponding authority account timelines could not be used for analysis. While, due to the
bilingual training cost of XLM-RoBERTa, cross-validation is expensive, and only a one-time random
split is used to train the model. The retrieval approaches are evaluated using Recall@k and mean
average precision (MAP). Stance classification results are evaluated as F1-score (Micro) and F1-score
(Macro).</p>
      <p>To evaluate the performance of the SBERT model compared to traditional BoW methods in the
evidence retrieval task, we use Recall@k with k as 5, 10, and 15 which measures the proportion of
relevant evidence to the rumor among the top@k retrieved evidence candidates. MAP assesses the
ranking quality by considering the order of the retrieved evidence candidates. The results for both
English and Arabic datasets are presented in Figures 3a and 3b. We observe that for the English dataset,
the SBERT model achieves the best overall performance, with a Recall@5 of 0.6362, Recall@10 of 0.7607,
and Recall@15 of 0.7607. SBERT also outperforms other models in terms of MAP, achieving a score of
0.6635, which indicates that it provides a superior ranking of retrieved evidence. For the Arabic dataset,
the BM25 model performs best in terms of Recall across all values of k, reaching Recall@5 of 0.7833,
Recall@10 of 0.8222, and Recall@15 of 0.9000. BM25 also achieves the highest MAP score (0.7937),
indicating it is particularly efective in ranking relevant evidence in Arabic. While, SBERT performs
competitively, with a Recall@5 of 0.7778 and a MAP of 0.7085, demonstrating strong efectiveness across
both English and Arabic datasets.</p>
      <p>The stance detection performance is evaluated using the F1-score (Micro) and F1-score (Macro). We
compared the results of our proposed approach with traditional approaches i.e., random forest and
(a) English evidence retrieval
(b) Arabic evidence retrieval</p>
      <p>SVM and LLM-based stance detection models i.e., KE-MLM Trump, KE-MLM Biden. Since these models
were finetuned for Trump and Biden tweets respectively, that are diferent from our dataset, we also
ifnetuned KE-MLM on the training data called, KE-MLM finetuned. The results of stance detection
or classification for English and Arabic are shown in Figure 4a and 4b. Since the present architecture
does not represent rumors or their corresponding timeline tweets in a graphical structure, therefore,
the results could not be compared with KGAT. The finetuned XLM-RoBERTa model achieves the best
performance, with an F1-micro score of 0.8133, and an F1-macro score of 0.8179 for English. It suggests
that the finetuned XLM-RoBERTa model efectively handles stance classes, ofering the highest accuracy
for both English and Arabic. The Random Forest model performs better than SVM, however KE-MLM
ifnetuned model outperforms these traditional models. However KE-MLM Trump base, KE-MLM Biden
base, and XLM-RoBERTa base did not perform well. This is due to the high diference between our data
and the pretraining data of these models. For Arabic, the finetuned XLM-RoBERTa outperforms other
models, with F1-micro score of 0.7647, and F1-macro score of 0.6480 Figure 4a and 4b. This highlights
the efectiveness of finetuning LLMs for stance detection tasks. SVM and Random Forest also shows
reasonable performance on the Arabic dataset, with F1 (Micro) scores of 0.5588 and 0.6912, respectively.</p>
      <p>The rumor verification is performed through aggregation of the stance labels for each rumor and its
corresponding evidence candidate pairs. We compare our weighted aggregation approach addressing
the imbalance in data with majority voting and soft voting schemes. Weighted voting achieves the
highest performance, with F1-micro score of 0.6923 and F1-macro score of 0.6885, shown in Table 1.
Majority voting and soft voting both yield similar F1-micro and F1-macro scores of 0.5769 and 0.5476,
respectively. These results indicate that weighted voting is the most efective aggregation scheme
for rumor verification with imbalanced data. For evidence retrieval, the weighted voting approach
outperforms other techniques with a Recall@5 of 0.5556 and a MAP of 0.5556. For Arabic rumor stance
classification, the weighted scheme achieves F1-micro score of 0.5769, F1-macro score of 0.5557 with a
Recall@5 of 0.6222, and a MAP of 0.5778 for evidence retrieval. Majority voting and soft voting both
reach F1-micro score of 0.5000 and, F1-macro score of 0.4002 and have lower Recall@5 and MAP values
of 0.3333 and 0.2889, respectively.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The results highlight important diferences in model performance across the evidence retrieval, stance
detection, and rumor verification tasks for both English and Arabic. For evidence retrieval, the SBERT
model demonstrated superior performance in English, particularly in terms of Recall@5 and MAP.
This suggests that SBERT is more efective at capturing semantic similarities for ranking relevant
evidence in English, which is likely due to its deep contextualized embeddings. Conversely, for the
Arabic dataset, the BM25 model outperformed other models, achieving the highest Recall@15 and MAP
scores. This indicates that traditional retrieval techniques such as BM25 are still highly efective for
Arabic text, potentially due to the language’s morphological richness, which enables simple
frequencybased methods to efectively capture relevance. In stance detection, the finetuned XLM-RoBERTa
model consistently achieved the best results across both English and Arabic, which suggests that
domain-specific finetuning of transformer-based models significantly improves the ability to distinguish
between stance classes. However, despite being finetuned on equal instances and similar topics for
both English and Arabic, the accuracy for English is higher than for Arabic. It attributed to either the
evidence candidates used for finetuning were not very relevant and/or better English samples were
used in pretraining the model. It is interesting to note that traditional models such as SVM and Random
Forest, while performing reasonably well, however, were outperformed by XLM-RoBERTa, especially
in terms of F1-macro scores. This indicates that XLM-RoBERTa is better at handling class imbalances
and providing a more balanced prediction across all classes.</p>
      <p>For rumor verification, the aggregation technique of weighted voting proved to be the most efective
for both English and Arabic. In particular, weighted voting achieved the highest F1-micro score and
F1-macro scores, outperforming both majority voting and soft voting. Due to the imbalance in data, the
label weights were inversely proportional to their representation in the training data. The majority
and soft voting schemes have similar results for both English and Arabic datasets. Majority voting
and soft voting yielded the same results, with lower accuracy and F1-macro scores. The majority and
soft voting have the same score indicating that there is no higher diference in stance intensities of
rumor evidence candidate pairs for a corresponding rumor. The results suggest that weighted voting
is particularly beneficial in handling cases in which some stance classes are more prevalent, thereby
helping to mitigate the impact of class imbalance. There are some limitations to the current approach.
The additional features such as emoji embeddings, hashtags, and URLs did not improve the results of
the stance detection task, which requires more efort for better representation and concatenation with
the content embeddings. Moreover, The retrieval mechanism did not consider the presence of stance in
the authorities’ timeline and therefore did not provide a clear separation between the irrelevant timeline
tweets and evidence candidates. Finetuning SBERT for the task may also have improved the evidence
candidates retrieval mechanism. The results show that transformer-based models such as SBERT and
XLM-RoBERTa are efective for evidence retrieval and stance, particularly when finetuned for the task.
Nevertheless, traditional models such as BM25 remain competitive, particularly for non-English data,
and weighted voting emerges as an important technique for improving rumor verification performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this research, we addressed rumor verification issues on social media platforms, when known
authority accounts correspond to the rumor topic. The proposed system can be deployed as a first-hand
rumor detector to alert on rumor tweets with claims that are not supported by the corresponding
authority accounts. The proposed methodology centers on utilizing evidence retrieved from authority
timelines and stance detection using the transformer-based pipeline. The results show that SBERT and
ifnetuned XLM-RoBERTa, achieve superior performance for evidence retrieval and stance detection.
Our findings emphasize the growing importance of transformer-based models for NLP tasks, while also
highlighting areas where traditional methods and aggregation schemes, such as weighted voting, can
still play a valuable role. In the future, the retrieval module can be improved using the evidence-aware
model to consider the relationship among timeline tweets as well. Feature extraction and their utilization
can also be improved to benefit additional features within tweet content. Further exploration of hybrid
models that combine traditional retrieval methods, such as BM25, with deep learning techniques could
yield promising results, particularly in multilingual or domain-specific contexts. Similarly, enhancing
aggregation schemes, such as adaptive weighting schemes based on context, could further boost
performance in rumor verification tasks.
in: Proceedings of the 2021 conference of the north american chapter of the association for
computational linguistics: human language technologies, 2021, pp. 4725–4735.</p>
    </sec>
    <sec id="sec-7">
      <title>Appendix</title>
      <p>Model
XLM-RoBERTa Finetuned
XLM-RoBERTa Base
SVM
Random Forest
KE-MLM Biden Finetuned
KE-MLM Trump Base
KE-MLM Biden Base</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Cocarascu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>The fact extraction and verification (fever) shared task</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>10971</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Fine-grained fact verification with kernel graph attention network</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>09796</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M. S.</given-names>
            <surname>Khoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Chieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Coupled hierarchical transformer for stance-aware rumor verification in social media conversations</article-title>
          ,
          <source>in: In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          , T. Elsayed,
          <article-title>Are authorities denying or supporting? detecting stance of authorities towards rumors in twitter</article-title>
          ,
          <source>Social Network Analysis and Mining</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          , W. Mansour,
          <article-title>Who can verify this? finding authorities for rumor verification in twitter</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>60</volume>
          (
          <year>2023</year>
          )
          <fpage>103366</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Brennen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          , Types, sources, and claims of covid-19
          <string-name>
            <surname>misinformation</surname>
          </string-name>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bekoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Papagiannopoulou</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Deligiannis, Understanding the impact of evidence-aware sentence selection for fact checking</article-title>
          ,
          <source>in: Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship</source>
          , Disinformation, and
          <string-name>
            <surname>Propaganda</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10084</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kruengkrai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>A multi-level attention model for evidence-based fact checking</article-title>
          ,
          <source>arXiv preprint arXiv:2106.00950</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawintiranon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Knowledge enhanced masked language model for stance detection,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>