Adapter fusion for check-worthiness detection -
                                combining a task adapter with a NER adapter
                                Inna Vogel* , Pauline Möhle, Meghana Meghana and Martin Steinebach
                                Fraunhofer Institute for Secure Information Technology SIT | ATHENE - National Research Center for Applied
                                Cybersecurity, Rheinstrasse 75, Darmstadt, 64295, Germany, https://www.sit.fraunhofer.de/


                                           Abstract
                                           Detecting check-worthy statements aims to facilitate manual fact-checking efforts by detecting claims
                                           that fact-checkers should prioritize first. It can also be considered as the first step of a fact-checking
                                           system. In this paper, we present an adapter fusion model that combines a task adapter with a NER
                                           adapter achieving state-of-the-art results on two challenging check-worthiness benchmarks. Adapters
                                           are a resource-efficient alternative to fully fine-tuning transformer models. Our best performing model
                                           obtains an 𝐹 1 score of 0.92 on the CheckThat! Lab 2023 dataset. Additionally, we interpret the fusion
                                           attentions, demonstrating the effectiveness of our approach. The quantitative analysis of the fusion
                                           attentions shows that named entities contribute significantly to the prediction of the adapter fusion
                                           model.

                                           Keywords
                                           check-worthiness detection, fact-checking, adapter fusion, NER


                                1. Introduction
                                Fact-checking online content is essential to ensure the reliability of information shared through
                                various online communication channels, such as news websites and social media platforms.
                                Fact-checkers and journalists are constantly working to identify and correct misinformation and
                                communicate their work as quickly as possible. But with the amount of information published
                                every day online and the limited resources available to journalists and fact-checkers, it is almost
                                impossible to keep up with this critical work.
                                   The fact-checking process consists usually of three main steps. The first step involves
                                identifying statements or claims in a text that are worth fact-checking, as not all claims are
                                equally important or contain relevant information that needs to be fact-checked. These can
                                be false allegations, statistics or other objectively verifiable false information. Fact-checkers
                                prioritize claims for verification based on their potential impact, the claim’s factual coherence
                                or the public interest in the claim. Once a claim has been selected, the second step is to gather
                                trustworthy evidence to confirm or disprove it by researching reliable sources. These sources
                                can include academic journals, official reports, reputable news organizations, subject matter
                                experts and primary sources such as original documents or statistics. To ensure consistency
                                and accuracy, fact-checkers and journalists compare information from multiple sources. The
                                ROMCIR 2024: The 4th Workshop on Reducing Online Misinformation through Credible Information Retrieval, held as
                                part of ECIR 2024: the 46th European Conference on Information Retrieval, March 24, 2024, Glasgow, UK
                                *
                                  Corresponding author.
                                $ inna.vogel@sit.fraunhofer.de (I. Vogel)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
main challenge is that the vast majority of the work of the fact-checker is still done manually.
Therefore, there is a need to develop technologies that would facilitate, speed up and improve
the task of fact-checking and journalists detecting fake news and misinformation.
   The first step in the fact-checking pipeline, the automatic identification of check-worthy
statements, could facilitate the work of fact-checkers or journalists by identifying and high-
lighting statements within a text that require further verification. This could streamline the
fact-checking process and reduce the potential for human bias in claim selection.
   We consider the check-worthiness detection task as a binary classification task. Check-
worthy sentences or statements are usually those that contain factual information such as dates,
definitions, statistics or descriptions of events or laws. They are usually of interest to or likely to
affect the general public [1]. Nakov et al. [2] extend the definition and add that such statements
can also be potentially damaging to society, public figures or a company. Non-check-worthy
statements, on the other hand, contain subjective opinions and beliefs rather than factual
claims [1, 2].
   In this paper, we propose an adapter fusion approach that combines a task adapter with a
NER (Named Entity Recognition) adapter. Adapters are a resource-efficient alternative to fully
fine-tuned transformer models [3]. They are lightweight and modular neural networks, learning
tasks with fewer parameters and transferring knowledge across tasks and languages [4]. We
first trained a task adapter to effectively detect check-worthy statements. As we noticed that
check-worthy claims increasingly contain facts in the form of named entities, such as personal
names, dates, financial and percentage values or names of events or historical occurrences
(e.g., "World War II" or "the Great Depression"), we fused the task adapter with a NER adapter.
Our approach achieves state-of-the-art results on two challenging check-worthiness detection
benchmarks.
   Our contributions are summarized as follows:

    • We are the first to propose an adapter fusion model that combines a task adapter with a
      NER adapter.
    • Our model achieves state-of-the-art performance by a substantial margin on challenging
      check-worthiness benchmarks.
    • We use an explainability tool to interpret the classification results, demonstrating the
      effectiveness of our approach.


2. Related Work
The first methods of check-worthiness detection were based on the extraction of meaningful
features from the text. Given a US presidential election transcript, ClaimBuster [5] predicts
check-worthiness by using a support vector machine and extracting a set of 6,615 features in
total (such as word count, sentiment, tf-idf weighted bag-of-words, part-of-speech tags or entity
type). Gencheva et al. [6] extended the work of Hassan et al. [5] by including contextual features
such as the position of the sentence, the size of a segment belonging to a speaker, topics or word
embeddings. A MAP of 0.427 was achieved using a neural network with all features combined.
   Meng et al. [7] used both the ClaimBuster dataset and the CLEF CheckThat! Lab 2019 dataset
[8] on the detection of check-worthy factual claims using adversarial training on transformer
models. Their model achieved an improvement in the 𝐹 1 score of 4.7 points compared to other
models on these datasets.
   CheckThat! Lab organises multilingual check-worthiness detection tasks since 2020. They
support more languages every year for multimodal and multigenre content [9]. The aim of the
CheckThat! Lab challenge in 2023 [10] was to determine whether the information contained in
the political debate is reliable and worthy of further fact-checking. Sawiński et al. [11] were
the best performing team in English. They experimented with fine-tuning a variety of BERT
models and found that fine-tuning DeBERTaV3 [12] yielded near-identical performance to
GPT-3, achieving an 𝐹 1 score of 0.89. Frick et al. [13] came second in the competition with
an 𝐹 1 score of 0.87 by fine-tuning BERT three times, starting with a different seed for model
initialisation, resulting in three models. They combined these models into an ensemble using
a model souping technique that adaptively adjusts the influence of each model based on its
performance.
   Schlicht et al. [9] investigated cross-training of adapter fusion models on world languages
(such as Arabic, English and Spanish) to detect check-worthiness in multiple languages. There-
fore, they used mBERT and XLM-R and adapter fusion models (combining task and language
adapters) within a transformer as well as fully fine-tuned transformers. They showed that
the models could perform better than monolingual task adapters and fully tuned models. For
the detection of English check-worthy claims, an 𝐹 1 score of 0.51 was achieved using the
multilingual dataset from the CLEF CheckThat! Lab challenge 2022 [14] and 2021 [15] applying
XLM-R in combination with a fully fine-tuned transformer.


3. Dataset Description
We used the CheckThat! Lab 2023 dataset [10] and the ClaimBuster dataset [1] for our experi-
ments. The CheckThat! Lab 2023 English dataset consists of political debates collected from the
US presidential general election debates [10]. The aim of the CheckThat! Lab 2023 task was
to predict whether a text snippet from a political debate needs to be assessed manually by an
expert by estimating its check-worthiness. Examples from the dataset are shown in Table 1.

Table 1
Examples from the CheckThat! Lab 2023 dataset for check-worthy (Yes) and non-check-worthy (No)
statements
     Instance                                                                                  Class
     And that means 98 percent of American families, 97 percent of small businesses, they
  1.                                                                                            Yes
     will not see a tax increase.
  2. I said we’d get tougher with child support and child support enforcement’s up 50 percent.  Yes
  3. But I’m not going to do that.                                                              No
  4. But the important thing is what are we going to do now?                                    No

The dataset is divided into four subsets. The "Train" subset consists of 16,876 entries. Each
entry is labelled either "Yes" or "No" as to whether it is worth checking (YES) or not (No). The
development set "Dev" contains 5,625 statements, the development test set "Dev Test" contains
1,032 entries and the test set for the final evaluation "Test" contains 318 statements. The label
Table 2
Class distribution of the CheckThat! Lab 2023 check-worthiness dataset for English.
                                             Total     Yes      No
                                 Train      16,876   4,058   12,818
                                 Dev         5,625   1,355    4,270
                                 Dev Test    1,032     238      794
                                 Test          318     108      210
                                 Sum        23,851   5,759   18,092


distributions and dataset splits are shown in Table 2.
   While the first three partitions primarily use the ClaimBuster dataset described in Arslan et al.
[1], there have been some updates made by the CheckThat! Lab 2023 organizers to improve
the quality of the annotations. The test set includes sentences that were not featured in the
ClaimBuster dataset.
   The ClaimBuster dataset was labelled by 101 annotators over a period of 26 months. The fol-
lowing three classes have been annotated. 1. Check-worthy factual sentences: These sentences
contain statements of fact that the general public will have an interest in finding out whether
they are true or not. Journalists and fact-checkers look for these kinds of statements to check
their veracity. 2. Unimportant factual sentences: These are factual statements, but they are not
verifiable or the general public is not interested in knowing whether these sentences are true
or false. Fact-checkers do not consider these sentences to be worth checking. 3. Non-factual
sentences: These sentences contain no factual claims. Subjective sentences such as opinions,
and beliefs fall into this category [1].
   To compare our results with the work of Meng et al. [7], the ClaimBuster dataset was used
as a baseline. The dataset consists of 9,674 sentences, of which 6,910 are non-check-worthy
and 2,764 are check-worthy [7]. The authors excluded the class of non-check-worthy factual
sentences, having observed that this class was not really useful and could negatively affect the
performance of models. To compare our results, we used the same split as the authors - 67.5%
of the dataset was used to train, 7.5% for validation and 25% to test our model.
   Both datasets are highly unbalanced, with about a quarter of the sentences being check-
worthy. This is also due to the fact that attention-worthy sentences occur less frequently in the
text than non-check-worthy sentences.


4. Methodology
4.1. Adapter Fusion - Combining a Task Adapter with a NER Adapter
Transformer models, pre-trained on massive amounts of text data and then fine-tuned on target
tasks, have led to significant advances in NLP, achieving state-of-the-art results in a variety of
tasks. However, models such as RoBERTa [16] and BERT [17] consist of millions of parameters,
making it prohibitively expensive to share and distribute fully tuned models for each individual
downstream task. Adapters are a lightweight alternative to full model fine-tuning, consisting of
only a tiny set of newly initialised weights at each layer of the pre-trained model [3]. These
Table 3
Distribution of named entities in the CheckThat! Lab 2023 dataset using 5,759 sentences per class
                                                   PER     LOC     ORG     MISC
                   Check-worthy sentences          1,226   1,739   973     1,240
                   Non-check-worthy sentences      708     1,015   356     630


new weights are then updated during fine-tuning and the pre-trained parameters of the pre-
trained model are frozen. This means that adapters are parameter efficient, speed up training
iterations, and can be shared and composited due to their modularity and compact size without
compromising the performance of the model. Adapters have been shown to work on par with
full fine-tuning by adapting the representations at each level [4].
   Adapter fusion is a method of combining the knowledge of multiple pre-trained adapters
trained for different tasks. Adapter fusion consists of an attention module that learns how to
dynamically combine knowledge from different task adapters. This means that it fuses the
information learned by different adapters into a coherent representation. Different fusion strate-
gies can be used, such as weighted summation, gating mechanisms, or attention mechanisms.
The goal is to capture the synergies between different tasks and adapters. First, we trained
a task adapter to efficiently detect and classify the sentences worth checking by journalists
and fact-checkers. This model serves as our baseline. In a quantitative analysis counting the
frequencies of named entities in the two classes, we found that check-worthy sentences tend to
contain more named entities than non-check-worthy sentences. Table 3 shows the distribution
of named entities in the CheckThat! Lab 2023 dataset.
   Since the dataset contains less check-worthy sentences (5,759) than non-check-worthy sen-
tences (18,092), we reduced the number of sentences in the negative class so that the dataset
used for the analysis is balanced, i.e. 5,759 sentences per class. To count the frequencies of the
named entities, we used the four-class NER model for English "Flair", which achieves an 𝐹 1
score of 0.93 on the CoNLL 2003 dataset [18].
   The four named entity classes are: Person Name (PER), Location Name (LOC), Organisation
Name (ORG) and Miscellaneous (MISC). The distribution of the classes shows that all the classes
of named entities are more present in the sentences that are worth fact-checking. We assume
that named entities, as in journalism, are essential holders of information. The "Five Ws" in
journalism "Who?, What?, Where?, When?, and Why?" are essential questions that aim to
provide a complete understanding of a situation. The questions "Who?, Where?, When?" and
partly also "What?" (e.g. "30% of those vaccinated") can be detected by a NER recogniser.
   This motivated us to experiment with the fusion of a trained task adapter with a NER adapter.
The architecture of our final model is illustrated in Figure 1. The input to the architecture is a
statement while the output is a probability score that determines the check-worthiness of the
given input statement. The adapter fusion component takes as input the representations of
multiple adapters trained on different tasks and learns a parameterized mixer of the encoded
information.
Figure 1: The architecture of the adapter fusion model combining a task adapter and a NER adapter.


4.2. Implementation Details
The task adapter model "AF-TA" and the adapter fusion model "AF-TA-NER" were trained
on the CheckThat! Lab 2023 [10] dataset. For this, adapter transformers from the "Adapter
Hub" repository for pre-trained adapter modules were employed [4]. We used the pre-trained
RoBERTa model as it showed significant improvements in various NLP benchmarks compared
to the original BERT model [16].
   Our models were trained on the "Train" dataset with 16,876 instances, while the performance
of the models during training was evaluated on the "Dev"set with 5,625 instances. Finally, the
"Dev Test" set of 1,032 samples was used to compare the performance of the different trained
models. The overall best performing model was evaluated on the "Test" dataset (Table 2) using
the 𝐹 1 score metric over the positive (check-worthy) class. We tried a number of different
parameters and report on the ones that performed best.The task adapter model "AF-TA" was
trained for 6 epochs with a learning rate of 1e-4 and a batch size of 32, using a maximum
sequence length of 512.
   To train our "AF-TA-NER" model, we fused the task adapter model "AF-TA" and the fine-tuned
version of the DistilRoBERTa [19] based NER model. The NER model was trained and evaluated
on the CoNLL 2003 dataset and achieves an 𝐹 1 score of 0.92 [20]. The adapter fusion model
"AF-TA-NER", which combines the task adapter and the NER adapter model performed best
when the number of epochs was set to 5, the learning rate to 5e-5, the batch size to 8 and the
maximum sequence length of 512.


5. Baselines
To compare the performance of our adapter fusion model, "AF-TA-NER", three different baselines
were used. As the first baseline, we chose the best performing system of the CheckThat! Lab
2023 challenge "OpenFact" [11]. The highest 𝐹 1 score of 0.89 was obtained by the GPT-3 Curie
model fine-tuned with approximately 7,690 examples selected on label quality criteria.
   Meng et al. [7] proposed in their work a method by applying adversarial perturbations using
the BERT architecture. In order to detect verifiable factual claims, they used the ClaimBuster
dataset [1]. To validate their model, they performed 4-fold cross-validation, selecting the best
model from each fold using the weighted 𝐹 1 score calculated on the validation set. We split
the data into 25% test, 7.5% validation and 67.5% training and applied stratified 4-fold cross-
validation according to the authors in order to compare the performance of our model. The
reported 𝐹 1 score is based on classifications across all folds. Their best performing model
"CB-BBA" achieves an average 𝐹 1 score of 0.83 on the positive (check-worthy) class. We used
the same dataset split for our proposed models, applying stratified 4-fold cross-validation.
   To determine whether the proposed NER adapter fusion model "AF-TA-NER" can improve
classification results, we used the task adapter model "AF-TA" as a third baseline. The imple-
mentation details were given in section 4.2.


6. Evaluation Results
Table 4 shows the performance results of the "AF-TA-NER" model and two baselines, the "AF-TA"
model as well as the best performing system in the CheckThat! Lab 2023 challenge "OpenFact"
Sawiński et al. [11].
   The "AF-TA" task adapter model achieves an 𝐹 1 score of 0.87 while the GPT-3 Curie model
used in "OpenFact" achieves an 𝐹 1 score of 0.89. The proposed "AF-TA-NER" outperforms both
models by achieving an 𝐹 1 score of 0.92 on the positive class (check-worthy). The negative
class (not check-worthy) achieves an 𝐹 1 score of 0.96.

Table 4
Precision (P), recall (R), 𝐹 1 score and accuracy for the CheckThat! Lab 2023 dataset. The best model is
marked in bold. The second best model is underlined.
                                Model         P      R      𝐹1    Accuracy
                                OpenFact      0.95 0.85 0.89 0.93
                                AF-TA         0.96 0.79 0.87 0.92
                                AF-TA-NER 0.98 0.86 0.92 0.95

Table 5 shows the classification results of our two proposed models compared to the approach
presented by Meng et al. [7]. Our models outperform "CB-BBA" (𝐹 1 score 0.84) by a substantial
margin. The "AF-TA-NER" model (𝐹 1 score 0.89) performs slightly better than the "AF-TA"
model (𝐹 1 score 0.88). Compared to the "CB-BBA" model, our best performing model achieves
a 5 point improvement in 𝐹 1 score. The "AF-TA-NER" model outperforms all three baselines
and achieves state-of-the-art performance on different benchmarks. The confusion matrix of
our "AF-TA-NER" model is shown in Figure 2.
Table 5
Precision (P), recall (R), 𝐹 1 score and averaged across stratified 4-fold cross validation. The best model
is marked in bold. The second best model is underlined.
                                      Model         P      R       𝐹1
                                      CB-BBA        0.84 0.83 0.84
                                      AF-TA         0.87 0.90 0.88
                                      AF-TA-NER 0.88 0.90 0.89


Figure 2: Cunfusion matrix of the "AF-TA-NER" model. CW refers to the check-worthy class, while
NCW refers to the non-check-worthy class.


7. Interpretation of the Fusion Attentions
In this section, we present the interpretation of the fusion attentions using the "Transformers
Interpret"1 library. The core attribution methods on which Transformers Interpret is built
are "Integrated Gradients" and a variant of them, "Layer Integrated Gradients". The feature
attribution score is a summary or average of the attributions from each layer, explaining which
features were most important in a model’s prediction for a given input.
   The aim of the interpretation of the classification results is to analyze whether our proposed
adapter fusion model is capable of reliably classifying check-worthy statements in a text. This
analysis can also be useful for journalists and fact-checkers to check the basis on which the
model has made its decision. The following quantitative and qualitative analysis is based on the
classification results of our trained adapter fusion model "AF-TA-NER".
   To determine which named entity class contributes the most to the classification, we chose
a NER model for the quantitative analysis that can recognize the highest number of classes.
Therefore, we used spaCy’s2 NLP pipeline model "en_core_web_smpipeline". The model provides
a NER system that can identify named entities and classify them into 18 predefined categories.

1
    Transformers Interpret: https://github.com/cdpierse/transformers-interpret.
2
    spaCy: https://spacy.io/.
The aim of the quantitative analyses was to investigate whether NER features were important
for the predictions of the "AF-TA-NER" model.
   Table 6 lists the classes that contributed most to the classification of the check-worthy class.
To compare, we also show how the respective NER class relates to the negative class. Positive
attribution numbers indicate whether a word contributes positively to the predicted class, while
negative numbers indicate the opposite.

Table 6
Contribution of NER features to the prediction of the adapter fusion model "AF-TA-NER". Positive
attribution numbers indicate that a word contributes positively to the predicted class, while negative
numbers indicate that a word contributes negatively to the predicted class. CW refers to the check-
worthy class, while NCW refers to the non-check-worthy class.
                                                  CW     NCW
                                      Money       0.87   -0.38
                                      Percent     0.67   -0.97
                                      Cardinal    0.48   0.11
                                      Date        0.50   -0.07
                                      Event       0.27   -0.18
                                      Ordinal     0.23   -0.11

The quantitative analysis shows that named entities contribute significantly to the prediction of
the adapter fusion model. The NER class "money" contributes most positively to the positive
class (e.g. "I paid $38 million one year..."), while at the same time it has little relevance for the
classification of the negative class. The same holds true for "Percent", "Cardinal" and "Date"
classes - each of them contributing significantly to the positive class (e.g. "When we were in office,
there was 15% less violence in America..." ), while contributing negatively to the non-check-worthy
class.
   It is interesting to note that the classes "Person" (0.18), "Place" (0.04) and "Organisation"
(0.14), although also relevant for the classification of the check-worthy class, are less significant
than the other mentioned NER classes in Table 6. Even the class "Event" (0.27), which refers to
mentions of events in the text such as hurricanes, battles, wars or sports events, contributes
more to the model’s attention. This suggests that the model would still give good classification
results even if the names of people and places were removed from the dataset, e.g., for data
protection reasons.
   Using the Transformers Interpret heat map visualization, we analyzed the contribution of
tokens to the model’s prediction. This type of visualization is particularly useful for understand-
ing and interpreting the decisions made by complex transformer models. Each token in the
input text is colour-coded based on its contribution score. The colour intensity represents the
magnitude of the contribution. Our qualitative analysis shows that action verbs (or dynamic
verbs), which describe the action that a subject of a sentence performs (e.g. "run", "fight",
"sleep"), contribute to the prediction of the check-worthy class. Examples are shown in Figure 7.
The colour intensity shows that words like "paid", "released", "reduced", and "beat" contribute
positively to the prediction of the adapter fusion model.
   As there are only two examples of False Positive (FP) classifications (Figure 2), no reliable
analysis can be made. However, we suspect that these two examples shown in Figure 4 were
Figure 3: Heat map using the Transformers Interpret library showing how different tokens contribute
to the model’s prediction of the TP (check-worthy) class.


incorrectly labelled as not check-worthy by the human annotators. In our opinion, these
examples contain relevant facts that should be fact-checked. The same applies to False Negative
cases. In the following example, we found no evidence for a verifiable statement: "Yes, you said
that". However, the annotation is difficult to evaluate, as it is also subjective.


Figure 4: Heat map using the Transformers Interpret library, showing the prediction of the model on
the FP class.


8. Conclusion and Future Work
In this paper, we presented our work on detecting check-worthy factual claims employing an
adapter fusion approach by combining a task adapter with a NER adapter. We first trained a
task adapter to effectively detect check-worthy statements and used it as our first baseline. As
we analyzed that check-worthy claims increasingly contain facts in the form of named entities,
we fused the task adapter with a NER adapter. The goal was to capture the synergies between
different tasks and adapters. Our approach achieves state-of-the-art results on two challenging
benchmarks. Our best "AF-TA-NER" adapter fusion model achieves an 𝐹 1 score of 0.92 on the
CheckThat! Lab 2023 dataset and an 𝐹 1 score of 0.89 on the ClaimBuster dataset.
   Additionally, we used an explainability tool to interpret the fusion attentions, demonstrating
the effectiveness of our approach. The quantitative analysis showed that named entities con-
tribute significantly to the prediction of the adapter fusion model. To determine which NER
class contributes the most to the classification results, we used a NER system that can identify
18 predefined categories of named entities. By analysing the attention weights, we found that
it is not the NER classes "Person", "Location" or "Organization" that contribute most to the
positive class, but rather the classes "Money", "Percent", "Cardinal" and "Date". In the future, we
therefore plan to employ a NER adapter that can classify more than four classes. Additionally,
we want to investigate the fusion of different task adapters and interpret how fusion attention
differs across adapters and fusion models.


Acknowledgements
This work was supported by the German Federal Ministry of Education and Research (BMBF)
and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint
support of "ATHENE – CRISIS" and "Lernlabor Cybersicherheit" (LLCS).


References
 [1] F. Arslan, N. Hassan, C. Li, M. Tremayne, A benchmark dataset of check-worthy factual
     claims, in: 14th International AAAI Conference on Web and Social Media, AAAI, 2020.
     URL: https://api.semanticscholar.org/CorpusID:216870066.
 [2] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, W. Mansour, B. Hamdan, Z. S. Ali, N. Babulkov, A. Nikolov,
     G. K. Shahi, J. M. Struß, T. Mandl, M. Kutlu, Y. S. Kartal, Overview of the clef–2021
     checkthat! lab on detecting check-worthy claims, previously fact-checked claims, and fake
     news, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 12th
     International Conference of the CLEF Association, CLEF 2021, Virtual Event, September
     21–24, 2021, Proceedings, Springer-Verlag, Berlin, Heidelberg, 2021, p. 264–291. URL:
     https://doi.org/10.1007/978-3-030-85251-1_19. doi:10.1007/978-3-030-85251-1_19.
 [3] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo,
     M. Attariyan, S. Gelly, Parameter-efficient transfer learning for nlp., in: K. Chaudhuri,
     R. Salakhutdinov (Eds.), ICML, volume 97 of Proceedings of Machine Learning Research,
     PMLR, 2019, pp. 2790–2799. URL: http://dblp.uni-trier.de/db/conf/icml/icml2019.html#
     HoulsbyGJMLGAG19.
 [4] J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vulić, S. Ruder, K. Cho, I. Gurevych, Adapter-
     hub: A framework for adapting transformers, in: Proceedings of the 2020 Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP 2020): Systems
     Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 46–54. URL:
     https://www.aclweb.org/anthology/2020.emnlp-demos.7.
 [5] N. Hassan, F. Arslan, C. Li, M. Tremayne, Toward automated fact-checking: Detecting
     check-worthy factual claims by claimbuster, Proceedings of the 23rd ACM SIGKDD
     International Conference on Knowledge Discovery and Data Mining (2017).
 [6] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, I. Koychev, A context-aware
     approach for detecting worth-checking claims in political debates, in: Proceedings of
     the International Conference Recent Advances in Natural Language Processing, RANLP
     2017, INCOMA Ltd., Varna, Bulgaria, 2017, pp. 267–276. URL: https://doi.org/10.26615/
     978-954-452-049-6_037. doi:10.26615/978-954-452-049-6_037.
 [7] K. Meng, D. Jimenez, F. Arslan, J. D. Devasier, D. Obembe, C. Li, Gradient-based adversarial
     training on transformer networks for detecting check-worthy factual claims, ArXiv
     abs/2002.07725 (2020). URL: https://api.semanticscholar.org/CorpusID:211146392.
 [8] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, G. D. S. Martino, Overview of
     the CLEF-2019 checkthat! lab: Automatic identification and verification of claims. task
     1: Check-worthiness, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), Working
     Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland,
     September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019.
     URL: https://ceur-ws.org/Vol-2380/paper_269.pdf.
 [9] I. B. Schlicht, L. Flek, P. Rosso, Multilingual detection of check-worthy claims using world
     languages and adapter fusion, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho,
     B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval,
     Springer Nature Switzerland, Cham, 2023, pp. 118–133.
[10] F. Alam, A. Barrón-Cedeño, G. S. Cheema, S. Hakimov, M. Hasanain, C. Li, R. Míguez,
     H. Mubarak, G. K. Shahi, W. Zaghouani, P. Nakov, Overview of the CLEF-2023 CheckThat!
     lab task 1 on check-worthiness in multimodal and multigenre content, in: Working Notes
     of CLEF 2023—Conference and Labs of the Evaluation Forum, CLEF ’2023, Thessaloniki,
     Greece, 2023.
[11] M. Sawiński, K. Węcel, E. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski,
     W. Abramowicz, Openfact at checkthat!-2023: Head-to-head gpt vs. bert - a compar-
     ative study of transformers language models for the detection of check-worthy claims, in:
     Conference and Labs of the Evaluation Forum, 2023. URL: https://api.semanticscholar.org/
     CorpusID:264441775.
[12] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
     gradient-disentangled embedding sharing, in: The Eleventh International Conference on
     Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, OpenReview.net,
     2023. URL: https://openreview.net/pdf?id=sE7-XhLxHA.
[13] R. A. Frick, I. Vogel, J. Choi, Fraunhofer SIT at checkthat!-2023: Enhancing the detection
     of multimodal and multigenre check-worthiness using optical character recognition and
     model souping, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes
     of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece,
     September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 337–350. URL: https://ceur-ws.org/Vol-3497/paper-029.pdf.
[14] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez,
     T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov,
     N. Babulkov, Y. S. Kartal, J. Beltrán, The clef-2022 checkthat! lab on fighting the covid-19
     infodemic and fake news detection, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert,
     K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval, Springer Interna-
     tional Publishing, Cham, 2022, pp. 416–428.
[15] S. Shaar, F. Haouari, W. Mansour, M. Hasanain, N. Babulkov, F. Alam, G. D. S. Martino,
     T. Elsayed, P. Nakov, Overview of the CLEF-2021 checkthat! lab task 2 on detecting
     previously fact-checked claims in tweets and political debates, in: G. Faggioli, N. Ferro,
     A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 -
     Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to -
     24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 393–405.
     URL: https://ceur-ws.org/Vol-2936/paper-29.pdf.
[16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized bert pretraining approach, ArXiv abs/1907.11692
     (2019). URL: https://api.semanticscholar.org/CorpusID:198953378.
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.),
     Proceedings of the 2019 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
     Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
     4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423.
[18] A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling,
     in: COLING 2018, 27th International Conference on Computational Linguistics, 2018, pp.
     1638–1649.
[19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, ArXiv abs/1910.01108 (2019).
[20] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use
     framework for state-of-the-art NLP, in: NAACL 2019, 2019 Annual Conference of the North
     American Chapter of the Association for Computational Linguistics (Demonstrations),
     2019, pp. 54–59.