<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F. Primig, The Influence of Media Trust and Normative Role Expectations on the Credibil-
ity of Fact Checkers, Journalism Practice</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv</article-id>
      <title-group>
        <article-title>AuthEv-LKolb at CheckThat! 2024: A Two-Stage Approach To Evidence-Based Social Media Claim Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis Kolb</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Wien, Data Science Research Unit</institution>
          ,
          <addr-line>Favoritenstraße 9-11/194-04, 1040 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>18</volume>
      <issue>2024</issue>
      <fpage>1137</fpage>
      <lpage>1157</lpage>
      <abstract>
        <p>This paper covers our submission to CLEF 2024 CheckThat! Lab task 5: Authority Evidence for Rumor Verification. Misinformation as claims on social media platforms is an ever present issue. We present a two-stage approach to verify claims posted on social media based on evidence posted by authority accounts to the same platform. We conduct experiments to find the optimal setup with respect to the target metrics specified in the CLEF 2024 CheckThat! Lab, where we are participating in Task 5. Our experiments show that Large Language Models, of which we compare GPT-4 and Llama3-70B, are suited to this particular verification task. The paper finally presents areas where further improvements can be explored.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fact-checking</kwd>
        <kwd>natural language processing</kwd>
        <kwd>information retrieval</kwd>
        <kwd>CLEF 2024</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This paper covers our submission to CLEF 2024 CheckThat! Lab task 5: Authority Evidence for Rumor
Verification. The descriptions for all tasks, including our own, are provided in the conference paper by
the lab organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        There are many options available to large platform operators to combat misinformation on their
platforms, like professional fact-checking services or even manually fact-checking claims on their
platform. Manually checking every reported post has turned into a task that is no longer a viable option
for most large platforms, due to the sheer volume of content uploaded by users. There are improvements
to these methods that platforms can implement, like identifying and matching similar claims to already
fact-checked claims and reusing the work that already went into fact-checking the original claim. This
was already a task at the CLEF 2022 CheckThat! Lab [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, there are also alternative approaches.
      </p>
      <p>
        Specifically on X.com (formerly Twitter), a community fact-checking system is in place, colloquially
called “Community Notes". In a 2022 study, Pröllochs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] investigated the impact of this feature, and
one of the findings was that the feature’s “[...] community-driven approach faces challenges concerning
opinion speculation and polarization among the user base – in particular with regards to influential
user accounts." (p.11 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ])
      </p>
      <p>In this paper, we present a more automated approach to fact-checking claims on social media, using
oficial government statements on the same platform to verify claims, which could be used both as
a stand-alone service, and as a tool to assist human fact-checkers and fact-checking services. There
are some drawbacks to relying on oficial authority accounts rather than the community, which are
discussed in Section 5.3.</p>
      <p>
        The oficial CLEF 2024 CheckThat! Lab Task 5 is defined as: “Given a rumor expressed in a tweet and
a set of authorities (one or more authority Twitter accounts) for that rumor, represented by a list of
tweets from their timelines during the period surrounding the rumor, the system should retrieve up to
5 evidence tweets from those timelines, and determine if the rumor is supported (true), refuted (false),
or unverifiable (in case not enough evidence to verify it exists in the given tweets) according to the
evidence" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>We experimented with several setups and combinations of diferent strategies. Our approach involved
and tested two stages: a retrieval stage, and a verification stage.</p>
      <p>• In the retrieval stage, for a given claim (also referred to as “rumor"), we aim to retrieve evidence
from the set of all tweets relevant to that claim.
• In the verification stage, we use the retrieved evidence to predict a label for the claim (REFUTES,</p>
      <p>SUPPORTS or NOT ENOUGH INFO).</p>
      <p>We structure our paper into the following sections: Section 2 introduces the data we are working
with, and the target measures we use to evaluate our experiment results. Chapter 3 discusses the main
objectives of the experiments we conducted during our participation, while Chapter 4 presents our
approach to the task. The results of our experiments are presented in Chapter 5. Finally, Chapter 6
concludes our paper, and presents questions and topics for further research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Dataset and Evaluation Measures</title>
      <p>The data we are working with consists of various tweet texts. For every tweet making a claim, there is a
set of tweets authored by authority sources, only some of which are relevant to the claim. Notably, the
tweet texts do include links to attached images that were posted with the tweet, which could contain
some additional information (see Section 6 discussing future work). The only metadata directly included
in the dataset is the username and the tweet ID (which can be used to fetch more metadata from the
twitter API), but these are present only for authority statements, not for claims (which are only a single
“text string").</p>
      <p>
        Here is an example of what a rumor to be verified looks like: every rumor is provided as JSON, with
an ID, the claim text, and a list of statements, each of which contains the account URL that tweeted
the statement, the tweet ID, and the tweet text. Labeled data also includes which of the statements are
relevant to the claim.
{
}
"id": "AuRED_142",
"claim": "Naturalization decree in preparation: Lebanese passports for
sale?! https://t.co/UuQ7yMbSWJ https://t.co/Jf1K1NbZJD",
"statements": [
[
"https://twitter.com/LBpresidency",
"1555424541509386240",
"The Information Office of the Presidency of the Republic: What was
published by the French newspaper “Liberation” about the “selling” of
Lebanese passports to non-Lebanese is false and baseless news."
The dataset consists of 160 rumors overall, 128 of which were available with ground truth. Our
approach did not involve learning-to-rank [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. So, in our case, the dataset size is only relevant insofar
as a larger and more diverse dataset is likely to cover a wider range of scenarios and topics, on which
the proposed system could be tested.
      </p>
      <p>
        For this task, the following target measures are considered when evaluating performance, as specified
in the Task description [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
• Macro-F1 as the primary measure for the overall verification performance, which averages the F1
score for each of the three labels to account for class imbalance.
      </p>
      <p>• Strict Macro-F1 as a secondary measure for verification performance, which additionally considers
found evidence. For this measure, a “true positive" needs to have a correct label for the rumor, as
well as an overlap of at least one piece of evidence between ground-truth evidence and found
evidence. More overlap does not increase the Strict F1 score.
• MAP (Mean Average Precision) as the primary measure for retrieval efectiveness.
• R@5 (Recall at 5 “items", as the system should retrieve at most 5 statements) as a secondary
measure for retrieval efectiveness.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment Design</title>
      <p>For this paper, we ran a set of experiments, the results of which are presented in Section 5. We also
made a submission to the CheckThat! Lab. This submission and the experiments are separate. The
experiments serve to evaluate the efectiveness of diferent configurations for our proposed setup, and
the submission is created using three of those configurations, since each team could submit up to three
runs to the CheckThat! Lab.</p>
      <p>
        Our proposed setup is illustrated in Figure 1. This setup was selected based on initial experimentation
with a setup oriented after a paper on “Stance Detection" by Haouari et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which was refined over
the course of development. We narrowed down the methods we used in initial experiments to the
methods described in Section 4.
      </p>
      <p>In the following list, each number refers to a numbered component in Figure 1. Our experiments
quantitatively evaluate:
1. The impact of preprocessing and adding external data about the authority.
2. Methods of retrieving relevant statements (“evidence") in the retrieval stage.
3. The performance of transformer-based approaches for the “verification stage".
4. The impact of diferent options to influence how the pairwise scores from the verification stage
should be combined into the overall label for the rumor.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment Setup</title>
      <p>Our setup consists of two stages: retrieval of evidence related to the current rumor, and a verification
stage using the retrieved evidence from the previous stage to fact-check the rumor. We can also
optionally include some preprocessing and data augmentation steps.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing and Data Augmentation</title>
        <p>Preprocessing approaches and strategies to combine the individual predictions were also part of our
experiments. We aim to obtain the best performing setup of preprocessing strategies for both the
retrieval and verification steps, and the best scoring strategies for the verification step.</p>
        <p>Prepocessing and data augmentation (Figure 1 component number 1) are optional features:
• Data augmentation adds the Twitter display name and/or the Twitter author bio to the statement
text, depending on the configuration.
• Preprocessing cleans up the text: remove line breaks, some special characters like quotes and
hashtags, URLs, the pattern “RT @&lt;username&gt;" (added by the Twitter API for quote tweets) and
emojis.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Retrieval Stage</title>
        <p>For the retrieval stage (Figure 1 component number 2), there are multiple viable options. The main
retrieval methods we focused on (and ran experiments for) were:
• PyTerrier to create a simple BatchRetrieve pipeline of (BM25) » (PL2). See the PyTerrier
documentation for details.1
• Cosine distances between embeddings obtained through the OpenAI Embeddings API.
These methods are rather simple, but efective enough for this task. So long as a single relevant source
is retrieved, the powerful LLM in the verification stage is able to correctly predict the judgment.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Verification Stage</title>
        <p>
          For the verification stage (Figure 1 component number 3), we experimented with two major
transformerbased approaches:
• A fine-tuned version of BART (specifically bart-large-mnli, available on Hugging Face), a
sequenceto-sequence autoencoder by Facebook (Meta AI) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], for zero-shot Natural Language Inference
(NLI) [9].2 We initially used NLI models for the verification stage, which use a classification
approach to classify a combined text input as ENTAILMENT or CONTRADICTION. However the
deep natural language understanding of LLMs allows the model to navigate somewhat complex
reasoning tasks very efectively. In fact, they outperform the NLI models we used by a wide
margin.
• Large Language Models (LLMs), for the submission GPT-4 by OpenAI [10] (specifically the model
version named GPT-4-1106-preview), since it performed the best of all available models at the
time of the submission period. Due to the relatively low complexity of the reasoning (which is
somewhat similar to natural language inference) in the verification step, we theorize that most
suficiently large LLMs would perform similarly (e.g. Llama3-400b or Claude Opus). At the scale
of Llama3-70b we saw significant performance drops compared to GPT-4-Turbo.
        </p>
        <p>Our CheckThat! Lab submission was created using gpt-4-1106-preview as the LLM. After uploading
the submission, we re-ran our experiment setup, for which we used gpt-4o-2024-05-13, as it was cheaper
and faster to run, and was the newest available LLM from OpenAI.</p>
        <p>For OpenAI completions, we invoke the LLM by using the OpenAI Assistants API, with each
claimevidence pairing creating a new thread, and the system prompt being set in the assistant.3 For the
assistant configuration, we used a temperature of 0.01 and a top-p of 0.5. These values should encourage
1https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html
2https://huggingface.co/facebook/bart-large-mnli
3https://platform.openai.com/docs/api-reference/assistants
consistent responses.4 Llama3 completions are obtained using the Hugging Face Inference API with the
default parameter values and the model Llama3-70B-Instruct, as more powerful Llama3 models are not
yet available.5</p>
        <p>The LLM is prompted in the verification stage with a prompt template that is populated with both
the claim and the authority statement, and instructed using the system prompt to adhere to the output
format, and to only use information from the prompt, and not to use its domain knowledge or knowledge
from training data. The LLM will predict not only a label, but also a confidence in the label between 0
and 1, which will be used to combine the pairwise labels in the next step.</p>
        <p>The system prompt, which gives the LLM instructions it must adhere to, is shown below. The OpenAI
LLM Assistant API always adhered to this system prompt during our experimentation. We also activated
the “JSON-mode" in the Assistants configuration, which ensures answers follow the format specified in
the system prompt, though the system prompt on its own would likely be efective enough to ensure
this behavior.</p>
        <p>You are a helpful assistant doing simple reasoning tasks.</p>
        <p>You will be given a statement and a claim.</p>
        <p>You need to decide if a statement either supports the claim ("SUPPORTS"),
refutes the claim ("REFUTES"), or if the statement is not related to the
claim ("NOT ENOUGH INFO").</p>
        <p>USE ONLY THE STATEMENT AND THE CLAIM PROVIDED BY THE USER TO MAKE
YOUR DECISION.</p>
        <p>You must also provide a confidence score between 0 and 1, indicating
how confident you are in your decision.</p>
        <p>You must format your answer in JSON format, like this:
{"decision": ["SUPPORTS"|"REFUTES"|"NOT ENOUGH INFO"],
"confidence": [0...1]}
No yapping.</p>
        <p>Below is a real input message to the LLM (primed with the previous system prompt). In this example,
the data was preprocessed and had no external data added:
"Statement from Authority Account ’LBpresidency’: ’’The
Information Office of the Presidency of the Republic denies
a false news broadcast by the MTV station about Baabda Palace
preparing a decree naturalizing 4 000 people and recalls that
it had denied yesterday the false information published by the
French magazine ’Liberation’ about the same fabricated news ’’"
Claim: "Naturalization decree in preparation: Lebanese passports for sale !"</p>
        <p>Since we score every combination of rumor and evidence separately, we have to combine them to
produce an overall label prediction. As part of our experiments, we tested (Figure 1 component number
4):
• Weighting (“scaling") prediction confidence scores by retrieval score. The retrieval stage, in
addition to the top-5 documents, also returns the associated score used to compute the ranking,
which optionally can be used here.
• Normalizing retrieval scores, as diferent retrieval systems return retrieval scores on diferent
scales.</p>
        <p>• Including versus ignoring NOT ENOUGH INFO predictions for the final label score calculation.
4https://medium.com/@1511425435311/understanding-openais-temperature-and-top-p-parameters-in-language-modelsd2066504684f
5https://huggingface.co/docs/hub/en/models-inference</p>
        <p>Once we have obtained label predictions and for every claim-statement pairing, we weigh the
confidence the LLM in the verification stage predicted using the retrieval score (if this feature is set
active in the configuration), and then calculate the mean of the predicted scores (confidences) to obtain
our overall label prediction. If the summed, averaged scores cross a significance threshold, we predict
the respective SUPPORTS or REFUTES label. The threshold is not tuned or learned, rather it is set
manually at 0.15 such that two opposing predictions of roughly equal confidence cancel each other out,
unless one predictions is much more significant than the other, opposing prediction. Thus, the threshold
accounts for some variation between two roughly equally strong predictions. Our experiments show
that for this data set, this simple approach of combining predictions is suficient.</p>
        <p>Since SUPPORTS predictions are positive, and REFUTES predictions are negative, taking the mean
of the two predictions scores emulates a voting system with votes being weighted by the prediction
confidences. In this system, NOT ENOUGH INFO predictions do not contribute to the final overall
label, as a NOT ENOUGH INFO prediction from the LLM does not indicate any leaning toward either
SUPPORTS or REFUTES. Optionally, we include the NOT ENOUGH INFO prediction in the average,
lowering the total overall score – potentially below the significance threshold.</p>
        <p>
          This type of task is related to “stance detection" of authorities, which was introduced in a paper by
Haouari et al. (who are also the organizers of the 2024 CheckThat! Lab task 5) in 2023 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Our approach
follows up on their paper, and expands the implementation to also retrieve evidence from a predefined
dataset. Graves [11] lists three families of approaches to automatic fact verification, one of which is
"[...] consulting authoritative sources" [11]. Manually consulting a third-party authority is definitely
a valid tool for in-depth fact-checkers, and our system aims to assist these fact-checkers by finding
statements an authoritative source already posted publicly, and predicting the stance of the source to
the rumor or claim.
        </p>
        <p>The resources used during development and for the submission are listed here:
• For OpenAI embeddings and GPT-4-Turbo completions we used the OpenAI API.6
• Llama3-70b completions were obtained from the Hugging Face "Inference for Pros" API.7
• BM25, PySerini and TF-IDF retrieval methods as well as bart-large-mnli for verification were
computationally cheap enough to efectively run on a local desktop PC (AMD Ryzen 5, Nvidia
GTX 970, 16GB memory).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>
        We participated in the CheckThat! Lab Task 5 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and independently ran experiments to find the best
configuration options for our approach. The results of each are reported here, in their own subsections.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Experiment Results</title>
        <p>To test the various configuration options we created, we ran automated experiments on the dev split
of the dataset (containing 32 rumors to be verified using the included timeline) to answer this set of
research questions:
• To what extent can tweets (“evidence”) relevant to a claim be retrieved from timelines of authority
accounts, given an initial claim, a set of authority accounts and the timelines of those authority
accounts?
• To what extent can a claim, given a list of tweets (“evidence”), accurately be identified as being
supported by the evidence (true), refuted by the evidence (false), or unverifiable (not enough
evidence to verify it)?
• To what extent can a pipeline combining the approaches from RQ1 and RQ2 refute or support a
claim, automatically retrieving evidence from the timelines of authority accounts?
6https://platform.openai.com/docs/overview
7https://huggingface.co/docs/api-inference/index</p>
        <p>For the experiments we performed, which are presented in Table 1 to find the best retrieval
configuration, given the features we tested (preprocessing, adding author name and author bio), we did not find
significant diferences looking only at the retrieval evaluation. Generally, the best MAP performance
was obtained by the simple PyTerrier retrieval method of scoring with BM25, then re-ranking using
PL2 (divergence-from-randomness), using preprocessing and including the author bio in the statement
text. It seems that preprocessing slightly improves retrieval performance overall. For the secondary
measure, Recall@5, PyTerrier also performed the best.</p>
        <p>In our approach, we ran the experiments to optimize the system for the use case of verification, as a
“pipeline" from start to finish (claim and timeline input, to overall label with evidence output). Table
2 lists changes in score when a feature is actively used in a configuration, versus when it is not. It
also shows the score diference between experiments that used LLAMA3 and those that used GPT-4,
and changes in verification score between experiments with each retrieval method described above.
The features that were tested are described in Section 4. In Table 2, “Ignore NEI" means ignore NOT
ENOUGH INFO (NEI) predictions for the overall score.</p>
        <p>Since we ran experiments in all possible permutations of our configuration options, we calculate the
mean score of every configuration where a feature is used, and do the same for every configuration
where it is not used (for example, Preprocessing “True" vs. “False", or in the case of Verification methods
“OPENAI" vs. “PyTerrier"). The diference in average score gives an indication of the score impact of
the feature value. A positive score diference in Table 2 means the average score of the configurations
using value option 1 was higher than those using value option 2. In most cases, the diference is not
meaningful.</p>
        <p>Using GPT-4 over LLAMA3 yields the highest performance gain on average, a noticeable Macro-F1
score increase of about 0.2. This is not surprising, as the GPT-4 model is much more powerful, as
mentioned previously. It would be interesting to see score diferences on other comparably large
language models, like Claude Opus or Google Gemini. However, that comparison is outside the scope
of this paper.</p>
        <p>Additionally, it would be interesting to see the influence of diferent retrieval methods on the
verification performance. In our experiments, the diference between retrieval methods is rather small.
As mentioned previously, the LLMs in the verification stage are powerful enough that a single piece
of relevant evidence usually sufices to predict the correct label. Running the experiments on a more
diverse dataset with diferent retrieval methods in a larger search space might hinder the verification
stage from functioning properly. If no relevant evidence is found, the system is likely to predict NOT
ENOUGH INFO – as it should.</p>
        <p>The best performing configurations (at rank 1 and 2, see Table 5 in the Appendix) of the system
yielded the best results when not using any preprocessing. Preprocessing nearly always removes some
amount of signal along with the noise in the data, which might hurt LLM performance more than in
helps. Roughly two thirds of all configurations achieving the highest scores used no preprocessing.
Overall, the mean Macro-F1 score of system configurations using preprocessing is lower by 0.0139 in
our experiments, see Table 2.</p>
        <p>Proposed features like scaling by retrieval score, normalizing retrieval score to [0...1] and including
external data did not have a significant impact in our experiments with this dataset. The impact of
excluding NOT ENOUGH INFO predictions is noticeable, since in our configuration, the final label is
created by averaging the confidences of all pairwise predictions by the LLM, and if the average over
that list passes a threshold, a REFUTES or SUPPORTS label is predicted. Including NOT ENOUGH INFO
predictions with a value of 0 simply lowers the average score, which at a retrieval-k of 5 pairs can be
significant enough to make a diference. In this case, including the NOT ENOUGH INFO predictions in
the average score results in the system being presumably too cautious to perform adequately.</p>
        <p>In some cases, the verifier fails to correctly classify SUPPORTS or REFUTES. During our testing, in
each of those cases, the system predicts NOT ENOUGH INFO overall, which is the ideal fail case. The
system never predicted overall SUPPORTS where the actual overall label is REFUTES, or the other way
around.</p>
        <p>See the Appendix for the full tables, or view the Jupyter Notebook with the full tables on GitHub.8</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. CLEF Submission Results</title>
        <p>In the CheckThat! Lab Task 5, we participated in the challenge for the English dataset. The measures
reported by the Lab organizers were MAP and R@5 for retrieval, and Macro-F1 and Strict Macro-F1
scores for verification (see also Section 2). There was a limit of 3 runs able to be submitted per team, only
one of which was allowed to use external data not included in the dataset (our run labeled “secondary1"
used author display name and author bio from Twitter, if available). The CheckThat! Lab organizers
also provide a baseline score. Here, we report our own results, and this baseline, the full leaderboard is
available on the CheckThat! Lab Task 5 website.9</p>
        <p>We submitted three runs, each with diferent configurations for our setup:
• “primary": No external data and not preprocessing, only OpenAI embeddings with “raw" data.
• “secondary1": OpenAI embeddings for retrieval, with external Twitter data about the author
added, and no preprocessing.</p>
        <p>• “secondary2": PyTerrier retrieval method, using preprocessed data.
8https://github.com/LuisKolb/clef-2024-authority/blob/main/clef/pipeline/eval_experiment_large.ipynb
9checkthat.gitlab.io/clef2024/task5
All three runs used GPT-4 as the verification stage as described in Section 4. Preprocessing and external
data are described in Section 4. The configuration options for the combination of the pairwise predictions
were all set to “False", meaning no scaling or weighting using the retrieval score, and NOT ENOUGH
INFO predictions being included in the average used to calculate the overall label.</p>
        <p>In the retrieval stage, presented in Table 3, the best score our system achieved was a MAP of 0.549
using the primary run setup. Notably, we achieved a R@5 score of 0.619 using the secondary setup
with external data, which would have been 4th on the leaderboard if R@5 was the targeted measure.
The highest score was achieved by team “IAI Group", with a MAP of 0.628, who used a “Crossencoder"
approach, according to their Run ID on the oficial leaderboard.</p>
        <p>In the verification stage, we achieved the best result with a Macro-F1 of 0.895 in the secondary system
using external data (authority display name and authority bio, obtained from Twitter). Our results can
be seen in Table 4.</p>
        <p>As the leaderboard results show, our approach to retrieval did not work particularly well in comparison
to the other participants. However, our verification component significantly outperformed the other
participants. Presumably, this demonstrates the strength of Large Language Models in this type of task,
where few relevant pieces of evidence are needed to predict correctly, and irrelevant evidence does not
introduce significant noise to the overall prediction. Thus, even though our retrieval component was
comparatively weaker, the relatively high Recall resulted in good predictions overall.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Limitations of the Approach</title>
        <p>There are a few caveats to this proposed setup, and its utility would likely lie in serving as an additional
tool in the toolbox used to combat the spread of misinformation. These caveats are:
• Human fact-checking by neutral sources will most likely be more precise, more reliable and more
trusted, assuming the fact-checkers themselves are seen as neutral and trustworthy (which is
influenced by a multitude of factors, as analyzed in the study by Primig [12]).
• In contrast to “traditional" fact-checking, our approach does not verify the actual truth content
in a claim, only whether authority sources support or dispute a claim. For this paper, we are
working with the definition of “authority" laid out by Haouari et al. in their 2023 paper [ 13].
Authority sources can be government accounts, for example, a Ministry of Health in a given state
could be considered an authority related to rumors or claims about public health related matters
in the same state. In general, authorities are considered experts in a given area, but not all experts
are necessarily considered authorities. Additionally, an account is considered an authority when a
rumor is about the account holder themselves. For example, the dev split of the data set contains
a rumor about a journalist being involved in a deadly car crash, and the statement “My loved
ones and my people who were busy with me: I am fine [...]" posted by the journalist’s account is
considered authority evidence refuting the rumor. Because of examples like this, we included
experiments with adding external data like account name and account bio to the data set.
• Another consideration is model selection. For our submission, we used GPT-4-1106-preview,
the most recent OpenAI model available at the time. It is important to note that closed models
are subject to frequent changes, and “open" models like Llama3-400b should produce more
predictable output over longer periods of time.10 Additionally, closed models are usually subject
to content moderation, which could plausibly impact system performance and reliability - the area
of fact-checking often deals with controversial claims and statements, after all. Unfortunately,
Llama3-400b was not yet publicly available at the time of writing.</p>
        <p>Running the system in the best-performing configuration is also the most expensive way to run the
system (both in terms of computing time and API costs). There are possible trade-ofs, as if deployment
“at scale" or “in production" is desired, some compromises could be necessary:
• BM25 performs similarly to the OpenAI embeddings cosine distances method. It is also much
cheaper to execute, as it does not require an external API call and the associated tokens.
• For the best performing configuration, external data is included in, or rather, added to the data
set. This inclusion slightly improves performance, but also requires another API call, which is
also expensive, due to the restructuring of the X.com (formerly Twitter) API.11</p>
        <p>A recent study by Primig [12] from 2022 looked at the perception of fact-checkers and fact-checking
services in the study population. The author found that, while higher trust in media correlates with
trust in fact-checking, there is a significant part of the population who view fact-checking services as
propaganda tools of the established government. To increase trust in the system, its purpose needs
to be clearly stated: which is to assist users in verifying rumors using oficial sources . Those users who
distrust and reject oficial sources out of hand will not find the information provided by our system to
be helpful.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Perspectives for Future Work</title>
      <p>In this paper, we have demonstrated the ability of our proposed setup to generally accurately classify
whether oficial sources SUPPORT or REFUTE unseen rumors in a zero-shot fashion, using the data
provided by the task organizers. In a real-world application, some considerations would have to be
made with respect to operational aspects like computation costs, as LLMs are expensive to use “at scale".
Model selection could also have a significant impact (especially “closed-source" models), as discussed in
Section 5.3 .</p>
      <p>In future work, improvements can be made, and extensions of the system need to be checked for
performance improvements. Intuitive areas for further experimentation and development are:
10https://platform.openai.com/docs/changelog
11https://developer.x.com/en/docs/twitter-api/getting-started/about-twitter-api
• Do diferent embedding models for retrieval influence the performance of the verification stage?
Do they significantly influence distribution of answers (for example, are there less or more NOT
ENOUGH INFO predictions using another embedding model)?
• How can the retrieval stage be improved? Retrieval is essential for any fact-checking system to
be able to judge a claim, as the verification stage relies on relevant evidence.
• How well does the system generalize to other domains and social media platforms? The datasets
used were mainly focused on a specific geographical region, auto-translated from Arabic, and the
topics of the statement-claim pairings were overall relatively topically similar.
• Diferent translation systems could also impact the reliability and efectiveness of any NLP-based
approach, especially if the approach expects English data (like our approach does) and data from
other languages has to be automatically translated.
• Does including more metadata improve retrieval or verification performance? How should the
diferent metadata types be included? For example, if a statement is a direct reply or a “quote
tweet" of the original tweet containing the claim, it is intuitive that this type of metadata would
signal increased relevance.
• Multi-modality: tweets don’t only contain text content, but also sometimes images and video
data. Does adding this additional information to the tweet content, for example via transcription
or use of multimodal capabilities of modern LLMs, improve retrieval or verification performance?</p>
    </sec>
    <sec id="sec-7">
      <title>A. Online Resources</title>
      <p>The github repository can be found at github.com/LuisKolb/clef-2024-authority. The repository includes
all the diferent components we used for our experiments, and the scripts used to produce our results
for the CheckThat! Task 5 submission.</p>
    </sec>
    <sec id="sec-8">
      <title>B. Glossary</title>
      <p>In this paper, we use some specific words to describe specific concepts:
• “claim": the individual text snippet/sentence(s) that is to be verified (using authority sources)
• “rumor": used interchangeably with claim (in the dataset, every rumor consists of a claim and
several statements, and has a "rumor_id")
• “statement": a social media post, in this context posted by an authority accounts
• “evidence": a statement relevant to a specific claim
• “authority": typically oficial government social media accounts, but also sometimes the individual
person a claim is about, and whose social media posts can be used to verify that claim</p>
      <sec id="sec-8-1">
        <title>B.1. Verification Experiment Results and Tables</title>
        <p>Configurations with the same score are assigned the same rank, as they produced the same results.
Some column names are abbreviated for layout width reasons:</p>
        <p>LLAMA
LLAMA
LLAMA
LLAMA
LLAMA
LLAMA
LLAMA
OPENAI
LLAMA
LLAMA
OPENAI
LLAMA
LLAMA
LLAMA
LLAMA
OPENAI
OPENAI</p>
        <p>OPENAI</p>
        <p>False
False
False
False
False
False
True
False
True
False
True
True
False
True
True
False
False
True</p>
        <p>Scale</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Piskorski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities and adversarial robustness</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Di Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2022 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims</article-title>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pröllochs</surname>
          </string-name>
          ,
          <article-title>Community-Based Fact-Checking on Twitter's Birdwatch Platform</article-title>
          ,
          <source>Proceedings of the International AAAI Conference on Web and Social Media</source>
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <fpage>794</fpage>
          -
          <lpage>805</lpage>
          . URL: https: //ojs.aaai.org/index.php/ICWSM/article/view/19335. doi:
          <volume>10</volume>
          .1609/icwsm.v16i1.
          <fpage>19335</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor Verification using Evidence from Authorities</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.), Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2024</year>
          , Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.-Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Learning to Rank for Information Retrieval</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>225</fpage>
          -
          <lpage>331</lpage>
          . URL: https://www.nowpublishers.com/article/Details/INR-016. doi:
          <volume>10</volume>
          . 1561/1500000016, publisher: Now Publishers, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <surname>The</surname>
            <given-names>CLEF</given-names>
          </string-name>
          -2024 CheckThat! Lab:
          <string-name>
            <surname>Check-Worthiness</surname>
          </string-name>
          , Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: N.
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tonellotto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lipani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , I. Ounis (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>449</fpage>
          -
          <lpage>458</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>62</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          , T. Elsayed,
          <article-title>Are authorities denying or supporting? Detecting stance of authorities towards rumors in Twitter</article-title>
          ,
          <source>Social Network Analysis and Mining</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>34</article-title>
          . URL: https://doi. org/10.1007/s13278-023-01189-3. doi:
          <volume>10</volume>
          .1007/s13278-023-01189-3.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>