AuthEv-LKolb at CheckThat! 2024: A Two-Stage Approach To Evidence-Based Social Media Claim Verification Luis Kolb1 , Allan Hanbury1 1 TU Wien, Data Science Research Unit, Favoritenstraße 9-11/194-04, 1040 Vienna, Austria Abstract This paper covers our submission to CLEF 2024 CheckThat! Lab task 5: Authority Evidence for Rumor Verification. Misinformation as claims on social media platforms is an ever present issue. We present a two-stage approach to verify claims posted on social media based on evidence posted by authority accounts to the same platform. We conduct experiments to find the optimal setup with respect to the target metrics specified in the CLEF 2024 CheckThat! Lab, where we are participating in Task 5. Our experiments show that Large Language Models, of which we compare GPT-4 and Llama3-70B, are suited to this particular verification task. The paper finally presents areas where further improvements can be explored. Keywords fact-checking, natural language processing, information retrieval, CLEF 2024 1. Introduction This paper covers our submission to CLEF 2024 CheckThat! Lab task 5: Authority Evidence for Rumor Verification. The descriptions for all tasks, including our own, are provided in the conference paper by the lab organizers [1]. There are many options available to large platform operators to combat misinformation on their platforms, like professional fact-checking services or even manually fact-checking claims on their platform. Manually checking every reported post has turned into a task that is no longer a viable option for most large platforms, due to the sheer volume of content uploaded by users. There are improvements to these methods that platforms can implement, like identifying and matching similar claims to already fact-checked claims and reusing the work that already went into fact-checking the original claim. This was already a task at the CLEF 2022 CheckThat! Lab [2]. However, there are also alternative approaches. Specifically on X.com (formerly Twitter), a community fact-checking system is in place, colloquially called “Community Notes". In a 2022 study, Pröllochs [3] investigated the impact of this feature, and one of the findings was that the feature’s “[...] community-driven approach faces challenges concerning opinion speculation and polarization among the user base – in particular with regards to influential user accounts." (p.11 [3]) In this paper, we present a more automated approach to fact-checking claims on social media, using official government statements on the same platform to verify claims, which could be used both as a stand-alone service, and as a tool to assist human fact-checkers and fact-checking services. There are some drawbacks to relying on official authority accounts rather than the community, which are discussed in Section 5.3. The official CLEF 2024 CheckThat! Lab Task 5 is defined as: “Given a rumor expressed in a tweet and a set of authorities (one or more authority Twitter accounts) for that rumor, represented by a list of tweets from their timelines during the period surrounding the rumor, the system should retrieve up to 5 evidence tweets from those timelines, and determine if the rumor is supported (true), refuted (false), or unverifiable (in case not enough evidence to verify it exists in the given tweets) according to the evidence" [4]. CLEF 2024: Conference and Labs of the Evaluation Forum, 9-12 September, 2024, Grenoble, France, 2024 $ kolb.luis@gmail.com (L. Kolb); allan.hanbury@tuwien.ac.at (A. Hanbury) € https://luiskolb.at (L. Kolb) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings We experimented with several setups and combinations of different strategies. Our approach involved and tested two stages: a retrieval stage, and a verification stage. • In the retrieval stage, for a given claim (also referred to as “rumor"), we aim to retrieve evidence from the set of all tweets relevant to that claim. • In the verification stage, we use the retrieved evidence to predict a label for the claim (REFUTES, SUPPORTS or NOT ENOUGH INFO). We structure our paper into the following sections: Section 2 introduces the data we are working with, and the target measures we use to evaluate our experiment results. Chapter 3 discusses the main objectives of the experiments we conducted during our participation, while Chapter 4 presents our approach to the task. The results of our experiments are presented in Chapter 5. Finally, Chapter 6 concludes our paper, and presents questions and topics for further research. 2. Task Dataset and Evaluation Measures The data we are working with consists of various tweet texts. For every tweet making a claim, there is a set of tweets authored by authority sources, only some of which are relevant to the claim. Notably, the tweet texts do include links to attached images that were posted with the tweet, which could contain some additional information (see Section 6 discussing future work). The only metadata directly included in the dataset is the username and the tweet ID (which can be used to fetch more metadata from the twitter API), but these are present only for authority statements, not for claims (which are only a single “text string"). Here is an example of what a rumor to be verified looks like: every rumor is provided as JSON, with an ID, the claim text, and a list of statements, each of which contains the account URL that tweeted the statement, the tweet ID, and the tweet text. Labeled data also includes which of the statements are relevant to the claim. { "id": "AuRED_142", "claim": "Naturalization decree in preparation: Lebanese passports for sale?! https://t.co/UuQ7yMbSWJ https://t.co/Jf1K1NbZJD", "statements": [ [ "https://twitter.com/LBpresidency", "1555424541509386240", "The Information Office of the Presidency of the Republic: What was published by the French newspaper “Liberation” about the “selling” of Lebanese passports to non-Lebanese is false and baseless news." ], ... ] } The dataset consists of 160 rumors overall, 128 of which were available with ground truth. Our approach did not involve learning-to-rank [5]. So, in our case, the dataset size is only relevant insofar as a larger and more diverse dataset is likely to cover a wider range of scenarios and topics, on which the proposed system could be tested. For this task, the following target measures are considered when evaluating performance, as specified in the Task description [6]: • Macro-F1 as the primary measure for the overall verification performance, which averages the F1 score for each of the three labels to account for class imbalance. Figure 1: Visualization of our system for verifying a rumor. Numbers 1–4 indicate components with multiple configuration options. • Strict Macro-F1 as a secondary measure for verification performance, which additionally considers found evidence. For this measure, a “true positive" needs to have a correct label for the rumor, as well as an overlap of at least one piece of evidence between ground-truth evidence and found evidence. More overlap does not increase the Strict F1 score. • MAP (Mean Average Precision) as the primary measure for retrieval effectiveness. • R@5 (Recall at 5 “items", as the system should retrieve at most 5 statements) as a secondary measure for retrieval effectiveness. 3. Experiment Design For this paper, we ran a set of experiments, the results of which are presented in Section 5. We also made a submission to the CheckThat! Lab. This submission and the experiments are separate. The experiments serve to evaluate the effectiveness of different configurations for our proposed setup, and the submission is created using three of those configurations, since each team could submit up to three runs to the CheckThat! Lab. Our proposed setup is illustrated in Figure 1. This setup was selected based on initial experimentation with a setup oriented after a paper on “Stance Detection" by Haouari et al. [7], which was refined over the course of development. We narrowed down the methods we used in initial experiments to the methods described in Section 4. In the following list, each number refers to a numbered component in Figure 1. Our experiments quantitatively evaluate: 1. The impact of preprocessing and adding external data about the authority. 2. Methods of retrieving relevant statements (“evidence") in the retrieval stage. 3. The performance of transformer-based approaches for the “verification stage". 4. The impact of different options to influence how the pairwise scores from the verification stage should be combined into the overall label for the rumor. 4. Experiment Setup Our setup consists of two stages: retrieval of evidence related to the current rumor, and a verification stage using the retrieved evidence from the previous stage to fact-check the rumor. We can also optionally include some preprocessing and data augmentation steps. 4.1. Preprocessing and Data Augmentation Preprocessing approaches and strategies to combine the individual predictions were also part of our experiments. We aim to obtain the best performing setup of preprocessing strategies for both the retrieval and verification steps, and the best scoring strategies for the verification step. Prepocessing and data augmentation (Figure 1 component number 1) are optional features: • Data augmentation adds the Twitter display name and/or the Twitter author bio to the statement text, depending on the configuration. • Preprocessing cleans up the text: remove line breaks, some special characters like quotes and hashtags, URLs, the pattern “RT @" (added by the Twitter API for quote tweets) and emojis. 4.2. Retrieval Stage For the retrieval stage (Figure 1 component number 2), there are multiple viable options. The main retrieval methods we focused on (and ran experiments for) were: • PyTerrier to create a simple BatchRetrieve pipeline of (BM25) » (PL2). See the PyTerrier docu- mentation for details.1 • Cosine distances between embeddings obtained through the OpenAI Embeddings API. These methods are rather simple, but effective enough for this task. So long as a single relevant source is retrieved, the powerful LLM in the verification stage is able to correctly predict the judgment. 4.3. Verification Stage For the verification stage (Figure 1 component number 3), we experimented with two major transformer- based approaches: • A fine-tuned version of BART (specifically bart-large-mnli, available on Hugging Face), a sequence- to-sequence autoencoder by Facebook (Meta AI) [8], for zero-shot Natural Language Inference (NLI) [9].2 We initially used NLI models for the verification stage, which use a classification approach to classify a combined text input as ENTAILMENT or CONTRADICTION. However the deep natural language understanding of LLMs allows the model to navigate somewhat complex reasoning tasks very effectively. In fact, they outperform the NLI models we used by a wide margin. • Large Language Models (LLMs), for the submission GPT-4 by OpenAI [10] (specifically the model version named GPT-4-1106-preview), since it performed the best of all available models at the time of the submission period. Due to the relatively low complexity of the reasoning (which is somewhat similar to natural language inference) in the verification step, we theorize that most sufficiently large LLMs would perform similarly (e.g. Llama3-400b or Claude Opus). At the scale of Llama3-70b we saw significant performance drops compared to GPT-4-Turbo. Our CheckThat! Lab submission was created using gpt-4-1106-preview as the LLM. After uploading the submission, we re-ran our experiment setup, for which we used gpt-4o-2024-05-13, as it was cheaper and faster to run, and was the newest available LLM from OpenAI. For OpenAI completions, we invoke the LLM by using the OpenAI Assistants API, with each claim- evidence pairing creating a new thread, and the system prompt being set in the assistant.3 For the assistant configuration, we used a temperature of 0.01 and a top-p of 0.5. These values should encourage 1 https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html 2 https://huggingface.co/facebook/bart-large-mnli 3 https://platform.openai.com/docs/api-reference/assistants consistent responses.4 Llama3 completions are obtained using the Hugging Face Inference API with the default parameter values and the model Llama3-70B-Instruct, as more powerful Llama3 models are not yet available.5 The LLM is prompted in the verification stage with a prompt template that is populated with both the claim and the authority statement, and instructed using the system prompt to adhere to the output format, and to only use information from the prompt, and not to use its domain knowledge or knowledge from training data. The LLM will predict not only a label, but also a confidence in the label between 0 and 1, which will be used to combine the pairwise labels in the next step. The system prompt, which gives the LLM instructions it must adhere to, is shown below. The OpenAI LLM Assistant API always adhered to this system prompt during our experimentation. We also activated the “JSON-mode" in the Assistants configuration, which ensures answers follow the format specified in the system prompt, though the system prompt on its own would likely be effective enough to ensure this behavior. You are a helpful assistant doing simple reasoning tasks. You will be given a statement and a claim. You need to decide if a statement either supports the claim ("SUPPORTS"), refutes the claim ("REFUTES"), or if the statement is not related to the claim ("NOT ENOUGH INFO"). USE ONLY THE STATEMENT AND THE CLAIM PROVIDED BY THE USER TO MAKE YOUR DECISION. You must also provide a confidence score between 0 and 1, indicating how confident you are in your decision. You must format your answer in JSON format, like this: {"decision": ["SUPPORTS"|"REFUTES"|"NOT ENOUGH INFO"], "confidence": [0...1]} No yapping. Below is a real input message to the LLM (primed with the previous system prompt). In this example, the data was preprocessed and had no external data added: "Statement from Authority Account ’LBpresidency’: ’’The Information Office of the Presidency of the Republic denies a false news broadcast by the MTV station about Baabda Palace preparing a decree naturalizing 4 000 people and recalls that it had denied yesterday the false information published by the French magazine ’Liberation’ about the same fabricated news ’’" Claim: "Naturalization decree in preparation: Lebanese passports for sale !" Since we score every combination of rumor and evidence separately, we have to combine them to produce an overall label prediction. As part of our experiments, we tested (Figure 1 component number 4): • Weighting (“scaling") prediction confidence scores by retrieval score. The retrieval stage, in addition to the top-5 documents, also returns the associated score used to compute the ranking, which optionally can be used here. • Normalizing retrieval scores, as different retrieval systems return retrieval scores on different scales. • Including versus ignoring NOT ENOUGH INFO predictions for the final label score calculation. 4 https://medium.com/@1511425435311/understanding-openais-temperature-and-top-p-parameters-in-language-models- d2066504684f 5 https://huggingface.co/docs/hub/en/models-inference Once we have obtained label predictions and for every claim-statement pairing, we weigh the confidence the LLM in the verification stage predicted using the retrieval score (if this feature is set active in the configuration), and then calculate the mean of the predicted scores (confidences) to obtain our overall label prediction. If the summed, averaged scores cross a significance threshold, we predict the respective SUPPORTS or REFUTES label. The threshold is not tuned or learned, rather it is set manually at 0.15 such that two opposing predictions of roughly equal confidence cancel each other out, unless one predictions is much more significant than the other, opposing prediction. Thus, the threshold accounts for some variation between two roughly equally strong predictions. Our experiments show that for this data set, this simple approach of combining predictions is sufficient. Since SUPPORTS predictions are positive, and REFUTES predictions are negative, taking the mean of the two predictions scores emulates a voting system with votes being weighted by the prediction confidences. In this system, NOT ENOUGH INFO predictions do not contribute to the final overall label, as a NOT ENOUGH INFO prediction from the LLM does not indicate any leaning toward either SUPPORTS or REFUTES. Optionally, we include the NOT ENOUGH INFO prediction in the average, lowering the total overall score – potentially below the significance threshold. This type of task is related to “stance detection" of authorities, which was introduced in a paper by Haouari et al. (who are also the organizers of the 2024 CheckThat! Lab task 5) in 2023 [7]. Our approach follows up on their paper, and expands the implementation to also retrieve evidence from a predefined dataset. Graves [11] lists three families of approaches to automatic fact verification, one of which is "[...] consulting authoritative sources" [11]. Manually consulting a third-party authority is definitely a valid tool for in-depth fact-checkers, and our system aims to assist these fact-checkers by finding statements an authoritative source already posted publicly, and predicting the stance of the source to the rumor or claim. The resources used during development and for the submission are listed here: • For OpenAI embeddings and GPT-4-Turbo completions we used the OpenAI API.6 • Llama3-70b completions were obtained from the Hugging Face "Inference for Pros" API.7 • BM25, PySerini and TF-IDF retrieval methods as well as bart-large-mnli for verification were computationally cheap enough to effectively run on a local desktop PC (AMD Ryzen 5, Nvidia GTX 970, 16GB memory). 5. Results and Discussion We participated in the CheckThat! Lab Task 5 [4], and independently ran experiments to find the best configuration options for our approach. The results of each are reported here, in their own subsections. 5.1. Experiment Results To test the various configuration options we created, we ran automated experiments on the dev split of the dataset (containing 32 rumors to be verified using the included timeline) to answer this set of research questions: • To what extent can tweets (“evidence”) relevant to a claim be retrieved from timelines of authority accounts, given an initial claim, a set of authority accounts and the timelines of those authority accounts? • To what extent can a claim, given a list of tweets (“evidence”), accurately be identified as being supported by the evidence (true), refuted by the evidence (false), or unverifiable (not enough evidence to verify it)? • To what extent can a pipeline combining the approaches from RQ1 and RQ2 refute or support a claim, automatically retrieving evidence from the timelines of authority accounts? 6 https://platform.openai.com/docs/overview 7 https://huggingface.co/docs/api-inference/index Table 1 Experiment results for retrieval configurations on the dev set. Rank MAP R@5 Retrieval Preprocess Author Name Author Bio 1 0.688 0.754 PyTerrier True False True 2 0.674 0.747 Embeddings False False False 3 0.671 0.728 Embeddings True True False 4 0.659 0.752 PyTerrier True True True 4 0.659 0.752 PyTerrier True True True 5 0.657 0.708 PyTerrier True False False 6 0.643 0.717 Embeddings True True True 7 0.643 0.717 PyTerrier False True False 8 0.641 0.710 Embeddings True False False 9 0.641 0.708 PyTerrier False False False 10 0.640 0.657 Embeddings False False True 11 0.637 0.708 PyTerrier True True False 12 0.634 0.719 Embeddings False True False 13 0.633 0.699 PyTerrier False True True 14 0.628 0.675 PyTerrier False False True 15 0.620 0.708 Embeddings False True True 16 0.590 0.719 Embeddings True False True Table 2 Differences in average verification performance score over all configurations, for each feature. The positive difference represents the average score increase when value option 1 is used over value option 2. Feature tested Value option 1 Value option 2 Macro-F1 Difference Strict-Macro-F1 Difference Verification OPENAI LLAMA +0.1911 +0.2066 Retrieval PyTerrier Embeddings +0.0239 +0.0132 Preprocessing False True +0.0139 +0.0117 External Data False True +0.0066 +0.0078 Normalize False True +0.0300 +0.0306 Scale False True +0.0635 +0.0637 Ignore NEI True False +0.0709 +0.0708 For the experiments we performed, which are presented in Table 1 to find the best retrieval configu- ration, given the features we tested (preprocessing, adding author name and author bio), we did not find significant differences looking only at the retrieval evaluation. Generally, the best MAP performance was obtained by the simple PyTerrier retrieval method of scoring with BM25, then re-ranking using PL2 (divergence-from-randomness), using preprocessing and including the author bio in the statement text. It seems that preprocessing slightly improves retrieval performance overall. For the secondary measure, Recall@5, PyTerrier also performed the best. In our approach, we ran the experiments to optimize the system for the use case of verification, as a “pipeline" from start to finish (claim and timeline input, to overall label with evidence output). Table 2 lists changes in score when a feature is actively used in a configuration, versus when it is not. It also shows the score difference between experiments that used LLAMA3 and those that used GPT-4, and changes in verification score between experiments with each retrieval method described above. The features that were tested are described in Section 4. In Table 2, “Ignore NEI" means ignore NOT ENOUGH INFO (NEI) predictions for the overall score. Since we ran experiments in all possible permutations of our configuration options, we calculate the mean score of every configuration where a feature is used, and do the same for every configuration where it is not used (for example, Preprocessing “True" vs. “False", or in the case of Verification methods “OPENAI" vs. “PyTerrier"). The difference in average score gives an indication of the score impact of the feature value. A positive score difference in Table 2 means the average score of the configurations using value option 1 was higher than those using value option 2. In most cases, the difference is not meaningful. Using GPT-4 over LLAMA3 yields the highest performance gain on average, a noticeable Macro-F1 score increase of about 0.2. This is not surprising, as the GPT-4 model is much more powerful, as mentioned previously. It would be interesting to see score differences on other comparably large language models, like Claude Opus or Google Gemini. However, that comparison is outside the scope of this paper. Additionally, it would be interesting to see the influence of different retrieval methods on the verification performance. In our experiments, the difference between retrieval methods is rather small. As mentioned previously, the LLMs in the verification stage are powerful enough that a single piece of relevant evidence usually suffices to predict the correct label. Running the experiments on a more diverse dataset with different retrieval methods in a larger search space might hinder the verification stage from functioning properly. If no relevant evidence is found, the system is likely to predict NOT ENOUGH INFO – as it should. The best performing configurations (at rank 1 and 2, see Table 5 in the Appendix) of the system yielded the best results when not using any preprocessing. Preprocessing nearly always removes some amount of signal along with the noise in the data, which might hurt LLM performance more than in helps. Roughly two thirds of all configurations achieving the highest scores used no preprocessing. Overall, the mean Macro-F1 score of system configurations using preprocessing is lower by 0.0139 in our experiments, see Table 2. Proposed features like scaling by retrieval score, normalizing retrieval score to [0...1] and including external data did not have a significant impact in our experiments with this dataset. The impact of excluding NOT ENOUGH INFO predictions is noticeable, since in our configuration, the final label is created by averaging the confidences of all pairwise predictions by the LLM, and if the average over that list passes a threshold, a REFUTES or SUPPORTS label is predicted. Including NOT ENOUGH INFO predictions with a value of 0 simply lowers the average score, which at a retrieval-k of 5 pairs can be significant enough to make a difference. In this case, including the NOT ENOUGH INFO predictions in the average score results in the system being presumably too cautious to perform adequately. In some cases, the verifier fails to correctly classify SUPPORTS or REFUTES. During our testing, in each of those cases, the system predicts NOT ENOUGH INFO overall, which is the ideal fail case. The system never predicted overall SUPPORTS where the actual overall label is REFUTES, or the other way around. See the Appendix for the full tables, or view the Jupyter Notebook with the full tables on GitHub.8 5.2. CLEF Submission Results In the CheckThat! Lab Task 5, we participated in the challenge for the English dataset. The measures reported by the Lab organizers were MAP and R@5 for retrieval, and Macro-F1 and Strict Macro-F1 scores for verification (see also Section 2). There was a limit of 3 runs able to be submitted per team, only one of which was allowed to use external data not included in the dataset (our run labeled “secondary1" used author display name and author bio from Twitter, if available). The CheckThat! Lab organizers also provide a baseline score. Here, we report our own results, and this baseline, the full leaderboard is available on the CheckThat! Lab Task 5 website.9 We submitted three runs, each with different configurations for our setup: • “primary": No external data and not preprocessing, only OpenAI embeddings with “raw" data. • “secondary1": OpenAI embeddings for retrieval, with external Twitter data about the author added, and no preprocessing. • “secondary2": PyTerrier retrieval method, using preprocessed data. 8 https://github.com/LuisKolb/clef-2024-authority/blob/main/clef/pipeline/eval_experiment_large.ipynb 9 checkthat.gitlab.io/clef2024/task5 Table 3 Selected results for the English retrieval leaderboard. For each other team, the best submission score is presented here. Team Run Label MAP R@5 Retrieval Preprocess External Data IAI Group secondary1 0.628 0.676 bigIR primary 0.604 0.677 Axolotl primary 0.566 0.617 Team DEFAULT primary 0.559 0.634 AuthEv-LKolb (ours) primary 0.549 0.587 Embeddings False False AuthEv-LKolb (ours) secondary2 0.524 0.563 PyTerrier True False AuthEv-LKolb (ours) secondary1 0.510 0.619 Embeddings False True (baseline) 0.335 0.445 Table 4 Selected results for the English verification leaderboard. For all runs, we used GPT-4 as the verification component. For each other team, the best submission score is presented here. Team Run Label Macro-F1 Strict Retrieval Preprocess External Macro-F1 Data AuthEv-LKolb (ours) secondary1 0.895 0.876 Embeddings False True AuthEv-LKolb (ours) primary 0.879 0.861 Embeddings False False AuthEv-LKolb (ours) secondary2 0.831 0.831 PyTerrier True False Axolotl primary 0.687 0.687 (baseline) 0.495 0.495 Team DEFAULT primary 0.482 0.454 IAI Group secondary1 0.459 0.444 bigIR primary 0.458 0.428 All three runs used GPT-4 as the verification stage as described in Section 4. Preprocessing and external data are described in Section 4. The configuration options for the combination of the pairwise predictions were all set to “False", meaning no scaling or weighting using the retrieval score, and NOT ENOUGH INFO predictions being included in the average used to calculate the overall label. In the retrieval stage, presented in Table 3, the best score our system achieved was a MAP of 0.549 using the primary run setup. Notably, we achieved a R@5 score of 0.619 using the secondary setup with external data, which would have been 4th on the leaderboard if R@5 was the targeted measure. The highest score was achieved by team “IAI Group", with a MAP of 0.628, who used a “Crossencoder" approach, according to their Run ID on the official leaderboard. In the verification stage, we achieved the best result with a Macro-F1 of 0.895 in the secondary system using external data (authority display name and authority bio, obtained from Twitter). Our results can be seen in Table 4. As the leaderboard results show, our approach to retrieval did not work particularly well in comparison to the other participants. However, our verification component significantly outperformed the other participants. Presumably, this demonstrates the strength of Large Language Models in this type of task, where few relevant pieces of evidence are needed to predict correctly, and irrelevant evidence does not introduce significant noise to the overall prediction. Thus, even though our retrieval component was comparatively weaker, the relatively high Recall resulted in good predictions overall. 5.3. Limitations of the Approach There are a few caveats to this proposed setup, and its utility would likely lie in serving as an additional tool in the toolbox used to combat the spread of misinformation. These caveats are: • Human fact-checking by neutral sources will most likely be more precise, more reliable and more trusted, assuming the fact-checkers themselves are seen as neutral and trustworthy (which is influenced by a multitude of factors, as analyzed in the study by Primig [12]). • In contrast to “traditional" fact-checking, our approach does not verify the actual truth content in a claim, only whether authority sources support or dispute a claim. For this paper, we are working with the definition of “authority" laid out by Haouari et al. in their 2023 paper [13]. Authority sources can be government accounts, for example, a Ministry of Health in a given state could be considered an authority related to rumors or claims about public health related matters in the same state. In general, authorities are considered experts in a given area, but not all experts are necessarily considered authorities. Additionally, an account is considered an authority when a rumor is about the account holder themselves. For example, the dev split of the data set contains a rumor about a journalist being involved in a deadly car crash, and the statement “My loved ones and my people who were busy with me: I am fine [...]" posted by the journalist’s account is considered authority evidence refuting the rumor. Because of examples like this, we included experiments with adding external data like account name and account bio to the data set. • Another consideration is model selection. For our submission, we used GPT-4-1106-preview, the most recent OpenAI model available at the time. It is important to note that closed models are subject to frequent changes, and “open" models like Llama3-400b should produce more predictable output over longer periods of time.10 Additionally, closed models are usually subject to content moderation, which could plausibly impact system performance and reliability - the area of fact-checking often deals with controversial claims and statements, after all. Unfortunately, Llama3-400b was not yet publicly available at the time of writing. Running the system in the best-performing configuration is also the most expensive way to run the system (both in terms of computing time and API costs). There are possible trade-offs, as if deployment “at scale" or “in production" is desired, some compromises could be necessary: • BM25 performs similarly to the OpenAI embeddings cosine distances method. It is also much cheaper to execute, as it does not require an external API call and the associated tokens. • For the best performing configuration, external data is included in, or rather, added to the data set. This inclusion slightly improves performance, but also requires another API call, which is also expensive, due to the restructuring of the X.com (formerly Twitter) API.11 A recent study by Primig [12] from 2022 looked at the perception of fact-checkers and fact-checking services in the study population. The author found that, while higher trust in media correlates with trust in fact-checking, there is a significant part of the population who view fact-checking services as propaganda tools of the established government. To increase trust in the system, its purpose needs to be clearly stated: which is to assist users in verifying rumors using official sources. Those users who distrust and reject official sources out of hand will not find the information provided by our system to be helpful. 6. Conclusion and Perspectives for Future Work In this paper, we have demonstrated the ability of our proposed setup to generally accurately classify whether official sources SUPPORT or REFUTE unseen rumors in a zero-shot fashion, using the data provided by the task organizers. In a real-world application, some considerations would have to be made with respect to operational aspects like computation costs, as LLMs are expensive to use “at scale". Model selection could also have a significant impact (especially “closed-source" models), as discussed in Section 5.3 . In future work, improvements can be made, and extensions of the system need to be checked for performance improvements. Intuitive areas for further experimentation and development are: 10 https://platform.openai.com/docs/changelog 11 https://developer.x.com/en/docs/twitter-api/getting-started/about-twitter-api • Do different embedding models for retrieval influence the performance of the verification stage? Do they significantly influence distribution of answers (for example, are there less or more NOT ENOUGH INFO predictions using another embedding model)? • How can the retrieval stage be improved? Retrieval is essential for any fact-checking system to be able to judge a claim, as the verification stage relies on relevant evidence. • How well does the system generalize to other domains and social media platforms? The datasets used were mainly focused on a specific geographical region, auto-translated from Arabic, and the topics of the statement-claim pairings were overall relatively topically similar. • Different translation systems could also impact the reliability and effectiveness of any NLP-based approach, especially if the approach expects English data (like our approach does) and data from other languages has to be automatically translated. • Does including more metadata improve retrieval or verification performance? How should the different metadata types be included? For example, if a statement is a direct reply or a “quote tweet" of the original tweet containing the claim, it is intuitive that this type of metadata would signal increased relevance. • Multi-modality: tweets don’t only contain text content, but also sometimes images and video data. Does adding this additional information to the tweet content, for example via transcription or use of multimodal capabilities of modern LLMs, improve retrieval or verification performance? References [1] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli, G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [2] P. Nakov, H. Mubarak, N. Babulkov, Overview of the CLEF-2022 CheckThat! Lab Task 2 on Detecting Previously Fact-Checked Claims, in: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, 2022. [3] N. Pröllochs, Community-Based Fact-Checking on Twitter’s Birdwatch Platform, Proceedings of the International AAAI Conference on Web and Social Media 16 (2022) 794–805. URL: https: //ojs.aaai.org/index.php/ICWSM/article/view/19335. doi:10.1609/icwsm.v16i1.19335. [4] F. Haouari, T. Elsayed, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor Verification using Evidence from Authorities, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024. [5] T.-Y. Liu, Learning to Rank for Information Retrieval, Foundations and Trends® in Information Retrieval 3 (2009) 225–331. URL: https://www.nowpublishers.com/article/Details/INR-016. doi:10. 1561/1500000016, publisher: Now Publishers, Inc. [6] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-Worthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458. doi:10.1007/978-3-031-56069-9_62. [7] F. Haouari, T. Elsayed, Are authorities denying or supporting? Detecting stance of authorities towards rumors in Twitter, Social Network Analysis and Mining 14 (2024) 34. URL: https://doi. org/10.1007/s13278-023-01189-3. doi:10.1007/s13278-023-01189-3. [8] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Trans- lation, and Comprehension, 2019. URL: http://arxiv.org/abs/1910.13461. doi:10.48550/arXiv. 1910.13461, arXiv:1910.13461 [cs, stat]. [9] B. MacCartney, S. U. C. S. Department, Natural Language Inference, Stanford University, 2009. URL: https://books.google.at/books?id=F55EAQAAIAAJ. [10] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, C. Heidecke, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, L. Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. d. A. B. Peres, M. Petrov, H. P. d. O. Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. J. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph, L. Kondraciuk, GPT-4 Technical Report, 2024. URL: http://arxiv.org/abs/2303.08774. doi:10.48550/arXiv.2303.08774, arXiv:2303.08774 [cs]. [11] L. Smars, FACTSHEET: Understanding the Promise and Limits of Automated Fact-Checking, 2018. URL: https://www.digitalnewsreport.org/publications/2018/ factsheet-understanding-promise-limits-automated-fact-checking/. [12] F. Primig, The Influence of Media Trust and Normative Role Expectations on the Credibil- ity of Fact Checkers, Journalism Practice 18 (2024) 1137–1157. URL: https://doi.org/10.1080/ 17512786.2022.2080102. doi:10.1080/17512786.2022.2080102, publisher: Routledge _eprint: https://doi.org/10.1080/17512786.2022.2080102. [13] F. Haouari, T. Elsayed, W. Mansour, Who can verify this? Finding authorities for rumor verifi- cation in Twitter, Information Processing & Management 60 (2023) 103366. URL: https://www. sciencedirect.com/science/article/pii/S0306457323001036. doi:10.1016/j.ipm.2023.103366. A. Online Resources The github repository can be found at github.com/LuisKolb/clef-2024-authority. The repository includes all the different components we used for our experiments, and the scripts used to produce our results for the CheckThat! Task 5 submission. B. Glossary In this paper, we use some specific words to describe specific concepts: • “claim": the individual text snippet/sentence(s) that is to be verified (using authority sources) • “rumor": used interchangeably with claim (in the dataset, every rumor consists of a claim and several statements, and has a "rumor_id") • “statement": a social media post, in this context posted by an authority accounts • “evidence": a statement relevant to a specific claim • “authority": typically official government social media accounts, but also sometimes the individual person a claim is about, and whose social media posts can be used to verify that claim B.1. Verification Experiment Results and Tables Configurations with the same score are assigned the same rank, as they produced the same results. Some column names are abbreviated for layout width reasons: • MF1: Macro-F1 score • SMF1: Strict-Macro-F1 score • Pre: whether Preprocessing was used • ExtData: whether External Data (Author Name and Bio) was used • IgnNEI: whether NOT ENOUGH INFO (NEI) pairwise predictions are Included in the decision weighting (False) or are Ignored in the decision (True) Table 5 Experiment results for verification configurations/feature combinations on the dev set. Sorted by Macro-F1. Ranks 1-5. Rank MF1 SMF1 Retrieval Verification Pre ExtData Scale Norm IgnNEI 1 0.872 0.856 Embeddings OPENAI False True False False True 1 0.872 0.856 Embeddings OPENAI False True False False False 1 0.872 0.856 Embeddings OPENAI False True True False True 1 0.872 0.856 Embeddings OPENAI False True False True True 1 0.872 0.856 Embeddings OPENAI False True False True False 1 0.872 0.856 Embeddings OPENAI False True True True True 1 0.872 0.872 PyTerrier OPENAI False True False False True 1 0.872 0.872 PyTerrier OPENAI False True False False False 1 0.872 0.872 PyTerrier OPENAI False True True False True 1 0.872 0.872 PyTerrier OPENAI False True True False False 1 0.872 0.872 PyTerrier OPENAI False True False True True 1 0.872 0.872 PyTerrier OPENAI False True False True False 1 0.872 0.856 Embeddings OPENAI False False False False True 1 0.872 0.856 Embeddings OPENAI False False False False False 1 0.872 0.856 Embeddings OPENAI False False True False True 1 0.872 0.856 Embeddings OPENAI False False False True True 1 0.872 0.856 Embeddings OPENAI False False False True False 1 0.872 0.856 Embeddings OPENAI False False True True True 1 0.872 0.872 PyTerrier OPENAI False False False False True 1 0.872 0.872 PyTerrier OPENAI False False False False False 1 0.872 0.872 PyTerrier OPENAI False False True False True 1 0.872 0.872 PyTerrier OPENAI False False True False False 1 0.872 0.872 PyTerrier OPENAI False False False True True 1 0.872 0.872 PyTerrier OPENAI False False False True False 1 0.872 0.856 Embeddings OPENAI True True True True True 1 0.872 0.872 PyTerrier OPENAI True True False False True 1 0.872 0.872 PyTerrier OPENAI True True False False False 1 0.872 0.872 PyTerrier OPENAI True True True False True 1 0.872 0.872 PyTerrier OPENAI True True True False False 1 0.872 0.872 PyTerrier OPENAI True True False True True 1 0.872 0.872 PyTerrier OPENAI True True False True False 1 0.872 0.856 Embeddings OPENAI True False False False True 1 0.872 0.856 Embeddings OPENAI True False False False False 1 0.872 0.856 Embeddings OPENAI True False True False True 1 0.872 0.856 Embeddings OPENAI True False False True True 1 0.872 0.856 Embeddings OPENAI True False False True False 1 0.872 0.856 Embeddings OPENAI True False True True True 2 0.855 0.841 PyTerrier OPENAI True False False False True 2 0.855 0.841 PyTerrier OPENAI True False False False False 2 0.855 0.841 PyTerrier OPENAI True False True False True 2 0.855 0.841 PyTerrier OPENAI True False True False False 2 0.855 0.841 PyTerrier OPENAI True False False True True 2 0.855 0.841 PyTerrier OPENAI True False False True False 3 0.831 0.816 Embeddings OPENAI True True False False True 3 0.831 0.816 Embeddings OPENAI True True False False False 3 0.831 0.816 Embeddings OPENAI True True True False True 3 0.831 0.816 Embeddings OPENAI True True False True True 3 0.831 0.816 Embeddings OPENAI True True False True False 4 0.820 0.820 PyTerrier OPENAI False True True True True 4 0.820 0.820 PyTerrier OPENAI False False True True True 4 0.820 0.820 PyTerrier OPENAI True True True True True 5 0.806 0.790 PyTerrier OPENAI True False True True True Table 6 Experiment results for verification configurations/feature combinations on the dev set. Sorted by Macro-F1. Ranks 6-30. Rank MF1 SMF1 Retrieval Verification Pre ExtData Scale Norm IgnNEI 6 0.723 0.699 Embeddings OPENAI False True True True False 6 0.723 0.699 Embeddings OPENAI False False True True False 7 0.713 0.696 Embeddings LLAMA True True True True False 8 0.700 0.677 Embeddings OPENAI True False True True False 9 0.691 0.661 Embeddings LLAMA True True True False True 9 0.691 0.661 Embeddings LLAMA True True True True True 10 0.682 0.657 PyTerrier LLAMA False True True False True 10 0.682 0.657 PyTerrier LLAMA False True True False False 11 0.661 0.620 Embeddings LLAMA False True True True True 12 0.661 0.632 Embeddings LLAMA True True False False True 12 0.661 0.632 Embeddings LLAMA True True False False False 12 0.661 0.632 Embeddings LLAMA True True False True True 12 0.661 0.632 Embeddings LLAMA True True False True False 13 0.657 0.634 Embeddings OPENAI True True True True False 14 0.651 0.620 PyTerrier LLAMA False True True True True 14 0.651 0.620 PyTerrier LLAMA False False True True True 15 0.647 0.634 PyTerrier LLAMA True True True False True 15 0.647 0.634 PyTerrier LLAMA True True True False False 16 0.645 0.628 PyTerrier LLAMA True True True True True 17 0.643 0.619 PyTerrier LLAMA False False True False True 17 0.643 0.619 PyTerrier LLAMA False False True False False 18 0.640 0.611 PyTerrier LLAMA False True False False False 18 0.640 0.611 PyTerrier LLAMA False True False True False 19 0.637 0.606 Embeddings LLAMA True False True False True 19 0.637 0.606 Embeddings LLAMA True False True True True 20 0.637 0.617 Embeddings LLAMA True False True True False 21 0.630 0.613 PyTerrier LLAMA True False True True True 22 0.628 0.605 Embeddings OPENAI False True True False False 22 0.628 0.605 Embeddings OPENAI False False True False False 23 0.627 0.598 PyTerrier LLAMA False True False False True 23 0.627 0.598 PyTerrier LLAMA False True False True True 23 0.627 0.598 Embeddings LLAMA True False False False True 23 0.627 0.598 Embeddings LLAMA True False False False False 23 0.627 0.598 Embeddings LLAMA True False False True True 23 0.627 0.598 Embeddings LLAMA True False False True False 24 0.626 0.612 PyTerrier LLAMA True True False False False 24 0.626 0.612 PyTerrier LLAMA True True False True False 25 0.624 0.611 PyTerrier LLAMA True False True False True 25 0.624 0.611 PyTerrier LLAMA True False True False False 26 0.618 0.576 Embeddings LLAMA False True False False True 26 0.618 0.576 Embeddings LLAMA False True False False False 26 0.618 0.576 Embeddings LLAMA False True True False True 26 0.618 0.576 Embeddings LLAMA False True False True True 26 0.618 0.576 Embeddings LLAMA False True False True False 27 0.615 0.601 PyTerrier LLAMA True True False False True 27 0.615 0.601 PyTerrier LLAMA True True False True True 28 0.606 0.583 Embeddings OPENAI True True True False False 29 0.605 0.577 PyTerrier LLAMA False False False False False 29 0.605 0.577 PyTerrier LLAMA False False False True False 30 0.595 0.575 Embeddings LLAMA True False True False False Table 7 Experiment results for verification configurations/feature combinations on the dev set. Sorted by Macro-F1. Ranks 31-42. Rank MF1 SMF1 Retrieval Verification Pre ExtData Scale Norm IgnNEI 31 0.590 0.562 PyTerrier LLAMA False False False False True 31 0.590 0.562 PyTerrier LLAMA False False False True True 31 0.590 0.576 PyTerrier LLAMA True False False False True 31 0.590 0.576 PyTerrier LLAMA True False False True True 32 0.585 0.571 PyTerrier LLAMA True False False False False 32 0.585 0.571 PyTerrier LLAMA True False False True False 33 0.557 0.525 Embeddings LLAMA False True True True False 34 0.545 0.521 Embeddings OPENAI True False True False False 35 0.537 0.519 Embeddings LLAMA True True True False False 36 0.537 0.511 PyTerrier LLAMA True False True True False 37 0.489 0.489 PyTerrier OPENAI True True True True False 38 0.485 0.457 PyTerrier LLAMA False True True True False 38 0.485 0.457 PyTerrier LLAMA False False True True False 39 0.468 0.444 PyTerrier LLAMA True True True True False 40 0.453 0.420 Embeddings LLAMA False True True False False 41 0.413 0.413 PyTerrier OPENAI False False True True False 41 0.413 0.413 PyTerrier OPENAI True False True True False 42 0.394 0.394 PyTerrier OPENAI False True True True False