=Paper=
{{Paper
|id=Vol-3740/paper-62
|storemode=property
|title=Fraunhofer SIT at CheckThat! 2024: Adapter Fusion for Check-Worthiness Detection
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-62.pdf
|volume=Vol-3740
|authors=Inna Vogel,Pauline Möhle
|dblpUrl=https://dblp.org/rec/conf/clef/VogelM24
}}
==Fraunhofer SIT at CheckThat! 2024: Adapter Fusion for Check-Worthiness Detection==
Fraunhofer SIT at CheckThat! 2024: Adapter Fusion for
Check-Worthiness Detection
Notebook for the CheckThat! Lab at CLEF 2024
Inna Vogel1,2,* , Pauline Möhle1
1
Fraunhofer Institute for Secure Information Technology SIT | ATHENE - National Research Center for Applied Cybersecurity,
Rheinstrasse 75, Darmstadt, 64295, Germany, url=https://www.sit.fraunhofer.de/
2
Advisori FTC GmbH, Kaiserstraße 44, 60329 Frankfurt am Main, Germany, url=https://www.advisori.de/
Abstract
This paper describes the Fraunhofer SIT team’s third-place approach for CLEF-2024 CheckThat! lab Challenge
Task 1 for English. The "Check-Worthiness Estimation" task is to determine whether a text snippet from a political
debate should be prioritised for fact-checking. Identifying check-worthy statements aims to facilitate manual
fact-checking by prioritising claims that fact-checkers should consider first. It can also be considered as the
primary step of a fact-checking system. Our proposed system is an adapter fusion model that integrates a task
adapter with a Named Entity Recognition (NER) adapter. Adapters offer a resource-efficient alternative to fully
fine-tuning transformer models. Our submitted model achieves a 𝐹1 score of 0.78 on the English test set and was
ranked as the third best model in the competition.
Keywords
check-worthiness detection, fact-checking, adapter fusion, task adapter, NER
1. Introduction
The fact-checking process typically involves three main steps. The first step is to identify statements or
claims within a text that need to be fact-checked, as not all claims are equally important or contain
pertinent information that needs to be verified. This can include false claims, statistics or other
objectively verifiable inaccuracies. Fact-checkers prioritise claims for verification based on their potential
impact, factual consistency or public interest. Once a claim has been selected, the second step is to
gather credible evidence to support or refute it by consulting reliable sources such as academic journals,
official reports, reputable news organisations, subject matter experts and primary sources such as
original documents or statistics. To ensure consistency and accuracy, fact-checkers and journalists
cross-reference information from multiple sources. The main challenge is that the majority of fact-
checkers’ work remains manual. As a result, there is an urgent need to develop technologies that can
facilitate, accelerate and improve journalists’ fact-checking and fake news and misinformation detection
tasks.
The first step in the fact-checking pipeline, automatically identifying statements worthy of verification,
has the potential to assist fact-checkers and journalists by locating and highlighting statements within
a text that warrant further verification. This process could streamline the fact-checking workflow and
reduce the potential for human bias in selecting claims for verification. Check-worthy sentences or
statements are usually those that contain factual information such as dates, definitions, statistics or
descriptions of events or laws.
The CheckThat! Lab has been tackling this scientific problem for the past several years. The aim
of this year’s CheckThat! Lab Task 1 "Check-Worthiness Estimation" is to determine whether a claim
in a tweet and/or a political debate/speech is worth fact-checking. The task is considered a binary
classification task with data available in Arabic, English and Spanish [1]. Frauhofer SIT participated
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ inna.vogel@advisori.de (I. Vogel)
www.linkedin.com/in/inna-vogel-nlp (I. Vogel)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
in Task 1 of the CLEF 2024 CheckThat! Lab Challenge for the English language identifying relevant
claims in political debates.
In this paper, we propose an adapter fusion approach that integrates a task adapter with a Named
Entity Recognition (NER) adapter. Adapters are a resource-efficient alternative to fully fine-tuned
transformer models [2]. Initially, we trained a task adapter to effectively detect check-worthy statements.
As check-worthy claims often contain facts in the form of named entities — such as personal names,
dates, financial and percentage values - we combined the task adapter with a NER adapter. With a 𝐹1
score of 0.78, our proposed adapter fusion model placed third in the competition.
2. Related Work
While early approaches focused on a fixed set of features (such as sentiment, word count, part of speech
(PoS) tags and named entities (NE)) and utilized traditional machine learning models (Naive Bayes,
SVM and Random Forest) [3], recent work focuses on pre-trained language models such as BERT [4, 5].
The CLEF CheckThat! challenge, which was introduced in 2018 and is still ongoing, has contributed
a considerable amount of research in recent years. Despite the diversity of models and representations
employed in the initial years of the challenge, including k-nearest neighbors [6] and recurrent neural
networks [7] for models, and character n-grams [6] and word embeddings [8] for representation,
neural approaches utilizing word embeddings demonstrated superior performance compared to classical
methods [9]. This trend continued in the 2019 challenge, where the top-performing team used an
LSTM model trained with dual token embeddings (domain-specific word embeddings and syntactic
dependencies) after pre-training on previous debates [10].
After the emergence of transformers in 2019 [11], there was a shift in contributions towards utilizing
transformers for check-worthiness detection in subsequent years [12, 13]. Following the introduction
of GPT-3, the best-performing approach for the English subtask 2023 was to fine-tune GPT-3 with
7.7k examples from pre-existing datasets. However, subsequent experiments by the same group using
DeBERTaV3 yielded almost identical results to GPT-3 [14].
Schlicht et al. [15] conducted an investigation into the cross-training of adapter fusion models across
various world languages, including Arabic, English, and Spanish, for the purpose of multilingual check-
worthiness detection. They used mBERT and XLM-R and adapter fusion models on multilingual datasets
from the CLEF CheckThat! Lab 2022 and 2021 challenges. They showed that the models outperformed
monolingual task adapters and fully tuned models. A 𝐹1 score of 0.51 was achieved for the detection of
English check-worthy claims. Vogel et al. [16] combined a task adapter and a NER adapter and achieved
state-of-the-art results on two challenging check-worthiness benchmarks. The best-performing model
achieved a 𝐹1 score of 0.92 on the CheckThat! Lab 2023 dataset. In addition, the authors interpreted the
fusion attentions, demonstrating the effectiveness of their approach.
3. Data Set Description
The data for this year’s CheckThat! 2024 Challenge Task 1 "Check-Worthiness Estimation" is available
in Arabic, English, and Dutch1 [1]. However, our approach focuses only on the English data set. While
the methodology employed could theoretically be applied to other languages, specific modifications to
the model would be required to account for linguistic differences.
For the English task, the data set consists of political debates collected from the US presidential
general election debates. Examples from the data set are shown in Table 1.
The aim goal of Task 1 is to identify entries that contain check-worthy claims. The data set was
annotated by human labelers. The label distributions and data set splits were provided by the organisers
and are shown in Table 2. As can be seen, the data set consists of 23,851 entries, divided into two
classes: check-worthy and non-check-worthy, labeled "YES" and "NO" respectively. The data set is
1
Spanish was only offered for training
Table 1
Instances of check-worthy (Yes) and non-check-worthy (No) sentences for Task 1
Instance Class
1. It called for an increase in the production of energy in the United States. Yes
2. There are 9 countries that spend more than we do on public education. Yes
3. I’d like to mention one thing. No
4. "And for that to happen, we have to strengthen our economy here at home." No
significantly unbalanced, with approximately 25% of the entries labeled as check-worthy and 75% labeled
as non-check-worthy.
To train and test the system, the data set is divided into three subsets: training (Train: 22,501 entries),
development (Dev: 1,032 entries), and development test (Dev Test: 318 entries). The development test
set contains a slightly higher proportion of check-worthy entries (33%) compared to the other data sets.
The unlabeled test set (Test) was provided for evaluation purposes and consists of 341 sentences.
Table 2
Class distribution of the CheckThat! Lab 2024 Task 1 English data set ("Test" set is unlabelled for evaluation
purposes.)
Total Yes No
Train 22,501 5,413 17,088
Dev 1,032 238 794
Dev Test 318 108 210
Sum 23,851 5,759 18,092
Test 341 - -
4. Methodology and Results
In this section, we present our submitted adapter fusion approach, which combines a task adapter with
a NER adapter. Adapters are a lightweight alternative to full model fine-tuning, consisting of a small set
of re-initialised weights at each layer of the pre-trained model [2]. These newly introduced weights are
updated during fine-tuning, while the pre-existing parameters of the model remain fixed. This feature
makes adapters parameter-efficient, speeds up training iterations and, due to their compact and modular
nature, enables their modular sharing and composition without compromising model performance.
Adapter fusion is a method that combines the knowledge derived from different pre-trained adapters
that were trained for distinct tasks. This process incorporates an attention module, which adeptly
merges knowledge from various task adapters dynamically. Consequently, it fuses the knowledge
acquired from diverse adapters into a unified representation. Various fusion techniques, including
weighted summation, gating mechanisms, or attention mechanisms, can be employed for this purpose
[2]. The goal of the adapter fusion method is to harness the synergies between different tasks and
adapters.
Initially, we trained a task adapter on the CheckThat! Lab 2024 [1] dataset to effectively identify
check-worthy sentences. No data pre-processing or cleaning was applied to the dataset. To train
the task adapter, we applied adapter transformers from the "Adapter Hub" repository for pre-trained
adapter models [17]. We used the pre-trained RoBERTa model [18] to tokenize the input data using the
maximum sequence length of 512 (truncation=True, padding="max_length"). The task adapter model
was trained for 6 epochs with a learning rate of 1e-4 and a batch size of 32.
The task adapter was trained on the "Train" dataset containing 22,501 instances, while the performance
of the models during training was evaluated on the "Dev" set containing 1,032 instances. Finally, the
"Dev Test" set of 318 samples was used to evaluate the trained model (Table 2). Our model achieves a
𝐹1 of 0.866 over the positive (check-worthy) class. The results of the evaluation are shown in Table 3.
Table 3
Evaluation scores Precision (P), recall (R), 𝐹1 score and accuracy for the task adapter model.
P R 𝐹1 Accuracy
Dev 0.943 0.975 0.959 0.981
Dev Test 0.977 0.778 0.866 0.918
We chose to use the adapter fusion approach, combining a task adapter with a NER adapter, to effectively
detect named entities in the dataset. Previous studies have shown that check-worthy sentences tend
to contain more named entities than non-check-worthy sentences [16]. This is due to the fact that
factual information often arises in the form of names and numerical data, encompassing personal
names, company names, geographical locations, dates, years, and percentages. Table 4 gives examples
of sentences from the dataset containing named entities.
Table 4
Examples of check-worthy sentences with name entities
Instance Class
1. "Today, 47 million people are on food stamps." Yes
2. "Of the nine million people put to work in new jobs since I’ve been in office, 1.3 million Yes
of those has been among black Americans, and another million among those who speak
Spanish."
3. If you take the tax cut that the president of the United States has given – President Bush Yes
gave to Americans in the top 1 percent of America – just that tax cut that went to the top 1
percent of America would have saved Social Security until the year 2075.
The adapter fusion model takes as input the representations generated by multiple adapters, each
trained for distinct tasks, and learns a parameterized mixer of the encoded information. The previously
trained task adapter was fused with the fine-tuned version of the DistilRoBERTa [19] based NER model.
The NER model was trained and evaluated on the CoNLL 2003 dataset and achieves an 𝐹1 score of 0.92
[20].
We trained our adapter fusion model for 6 epochs with a learning rate of 5e-5 and a batch size of 32
with a maximum sequence length of 512. The model was evaluated on the "Dev Test" and achieves a 𝐹1
of 0.916 over the check-worthy class. The results of the approach are shown in Table 5.
Table 5
Evaluation scores Precision (P), recall (R), 𝐹1 score and accuracy for the adapter fusion model. Comparison with
the top best models and the baseline.
P R 𝐹1 Accuracy
Fraunhofer SIT (Dev) 0.983 0.967 0.975 0.989
Fraunhofer SIT (Dev Test) 0.979 0.861 0.916 0.947
Fraunhofer SIT (Test) - - 0.780 -
Baseline (Test) - - 0.307 -
FactFinders (Test) - - 0.802 -
teamopenfact (Test) - - 0.796 -
Since the adapter fusion model outperformed the adapter model in terms of 𝐹1 score, we used the
former to classify the private test set of this year’s CheckThat! 2024 competition. Our model achieves
an 𝐹1 score of 0.78 over the positive check-worthy class.
5. Conclusion and Future Work
Identifying check-worthy statements can be seen as a first step in detecting the spread of false informa-
tion online. Used as a pre-filter, this approach can significantly reduce the amount of data requiring
manual evaluation by human experts. In this paper, we presented an adapter fusion method that
combines a task-specific adapter and a NER adapter.
Initially, we trained a task adapter to detect check-worthy statements effectively. Given that check-
worthy statements often contain named entities (such as references to persons, locations, or dates),
we integrated this task adapter with a pre-trained NER adapter. This integration aimed to exploit the
synergies between different tasks. Our approach achieves a 𝐹1 score of 0.78 on the CheckThat! Lab
2023 test dataset and was ranked third in the competition.
Future research may explore the integration of additional task-specific or pre-trained adapters. In
our current approach, we utilized a pre-trained NER adapter developed to detect four NER classes.
Subsequent work could investigate the use of a NER classifier trained to identify a broader range of
NER classes.
Acknowledgements
This work was supported by the German Federal Ministry of Education and Research (BMBF) and the
Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of
“ATHENE – CRISIS”.
References
[1] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The clef-2024 checkthat! lab: Check-worthiness,
subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel-
lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
[2] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan,
S. Gelly, Parameter-efficient transfer learning for nlp., in: K. Chaudhuri, R. Salakhutdinov (Eds.),
ICML, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 2790–2799. URL:
http://dblp.uni-trier.de/db/conf/icml/icml2019.html#HoulsbyGJMLGAG19.
[3] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates,
in: Proceedings of the 24th ACM International on Conference on Information and Knowledge
Management, CIKM ’15, Association for Computing Machinery, New York, NY, USA, 2015, p.
1835–1838. URL: https://doi.org/10.1145/2806416.2806652. doi:10.1145/2806416.2806652.
[4] F. Alam, A. Barrón-Cedeño, G. S. Cheema, S. Hakimov, M. Hasanain, C. Li, R. Míguez, H. Mubarak,
G. K. Shahi, W. Zaghouani, P. Nakov, Overview of the CLEF-2023 CheckThat! lab task 1 on check-
worthiness in multimodal and multigenre content, in: Working Notes of CLEF 2023—Conference
and Labs of the Evaluation Forum, CLEF ’2023, Thessaloniki, Greece, 2023.
[5] K. Meng, D. Jimenez, F. Arslan, J. D. Devasier, D. Obembe, C. Li, Gradient-based adversarial
training on transformer networks for detecting check-worthy factual claims, ArXiv abs/2002.07725
(2020). URL: https://api.semanticscholar.org/CorpusID:211146392.
[6] B. Ghanem, M. Montes, F. Rangel Pardo, P. Rosso, Upv-inaoe-autoritas - check that: Preliminary
approach for checking worthiness of claims, 2018.
[7] C. Hansen, C. Hansen, J. Simonsen, C. Lioma, The copenhagen team participation in the check-
worthiness task of the competition of automatic identification and verification of claims in political
debates of the clef-2018 checkthat! lab, in: L. Cappellato , N. Ferro , J. Nie, L. Soulier (Eds.), CLEF
2018 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, 2018. 19th Working Notes of
CLEF Conference and Labs of the Evaluation Forum, CLEF 2018 ; Conference date: 10-09-2018
Through 14-09-2018.
[8] R. Banerjee, C. Zuo, A. Karakaş, A hybrid recognition system for check-worthy claims using
heuristics and supervised learning, 2018.
[9] P. Atanasova, A. Barron-Cedeno, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov, G. D. S.
Martino, P. Nakov, Overview of the clef-2018 checkthat! lab on automatic identification and
verification of political claims. task 1: Check-worthiness, 2018. arXiv:1808.05542.
[10] C. Hansen, C. Hansen, J. Simonsen, C. Lioma, Neural weakly supervised fact check-worthiness
detection with contrastive sampling-based ranking loss, volume 2380, ceur workshop proceedings,
2019. 20th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2019 ;
Conference date: 09-09-2019 Through 12-09-2019.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
doi:10.18653/v1/N19-1423.
[12] A. Barron-Cedeno, T. Elsayed, P. Nakov, G. D. S. Martino, M. Hasanain, R. Suwaileh, F. Haouari,
N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. S. Ali, Overview of checkthat! 2020: Automatic
identification and verification of claims in social media, 2020. arXiv:2007.07997.
[13] P. Nakov, G. D. S. Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari,
M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov, G. K. Shahi, J. M.
Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the clef–2021 checkthat! lab on detecting
check-worthy claims, previously fact-checked claims, and fake news, 2021. arXiv:2109.12987.
[14] M. Sawinski, K. Węcel, E. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski, W. Abramowicz,
Openfact at checkthat! 2023: Head-to-head gpt vs. bert - a comparative study of transformers
language models for the detection of check-worthy claims, 2023.
[15] I. B. Schlicht, L. Flek, P. Rosso, Multilingual detection of check-worthy claims using world
languages and adapter fusion, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis,
C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval, Springer Nature
Switzerland, Cham, 2023, pp. 118–133.
[16] I. Vogel, P. Möhle, M. Meghana, M. Steinebach, Adapter fusion for check-worthiness detection
– combining a task adapter with a ner adapter., ROMCIR 2024: The 4th Workshop on Reducing
Online Misinformation through Credible Information Retrieval, held as part of ECIR 2024: the
46th European Conference on Information Retrieval, March 24, 2024, Glasgow, UK (2024). URL:
https://romcir.disco.unimib.it/wp-content/uploads/sites/151/2024/03/Paper6_Vogel.pdf.
[17] C. Poth, H. Sterz, I. Paul, S. Purkayastha, L. Engländer, T. Imhof, I. Vulić, S. Ruder, I. Gurevych,
J. Pfeiffer, Adapters: A unified library for parameter-efficient and modular transfer learning, 2023.
arXiv:2311.11077.
[18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, ArXiv abs/1907.11692 (2019). URL:
https://api.semanticscholar.org/CorpusID:198953378.
[19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, ArXiv abs/1910.01108 (2019).
[20] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use
framework for state-of-the-art NLP, in: NAACL 2019, 2019 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp.
54–59.