=Paper=
{{Paper
|id=Vol-3740/paper-28
|storemode=property
|title=Overview of the CLEF-2024 CheckThat! Lab Task 6 on Robustness of Credibility Assessment
with Adversarial Examples (InCrediblAE)
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-28.pdf
|volume=Vol-3740
|authors=Piotr Przybyła,Ben Wu,Alexander Shvets,Yida Mu,Kim Cheng Sheang,Xingyi Song,Horacio Saggion
|dblpUrl=https://dblp.org/rec/conf/clef/PrzybylaWSMSSS24
}}
==Overview of the CLEF-2024 CheckThat! Lab Task 6 on Robustness of Credibility Assessment
with Adversarial Examples (InCrediblAE)==
Overview of the CLEF-2024 CheckThat! Lab Task 6 on
Robustness of Credibility Assessment with Adversarial
Examples (InCrediblAE)
Piotr Przybyła1,2,* , Ben Wu3 , Alexander Shvets1 , Yida Mu3 , Kim Cheng Sheang1 ,
Xingyi Song3 and Horacio Saggion3
1
Universitat Pompeu Fabra, Barcelona, Spain
2
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
3
University of Sheffield, Sheffield, UK
Abstract
Task 6 at CheckThat! Lab, organised at CLEF-2024, is devoted to assessing the robustness of misinformation
detection solutions implemented as text classification models. The participants of the task were provided with
prediction models and data examples for several problems of credibility estimation and their goal was to come
up with adversarial examples (AEs): small modifications to the provided text fragments, such that the original
meaning is preserved, but the victim classifier changes its decision. The evaluation involved five domains
(detection of: biased news, propaganda techniques, false claims, rumours and COVID-19 misinformation) and
three classifiers (BiLSTM, BERT and adversarially fine-tuned RoBERTa). Six teams participated in the task,
representing a variety of approaches and substantially outperforming previous AE generation solutions. We also
performed manual evaluation, which highlighted some modification techniques that are particularly likely to
pass unnoticed by human readers. Overall, the task results emphasise the need to assess the robustness of text
classification solutions before implementing them in content filtering on large platforms, such as social media.
Keywords
adversarial examples, robustness, misinformation detection, credibility, text classification, natural language
processing
1. Introduction
The challenges of misinformation have been taken up with great energy and vigour by the NLP and IR
communities. The main reasons for such enthusiastic adoption of the new tasks are wide availability of
textual data to train on and tantalisingly simple dichotomy of fake vs. real, clearly fitting the familiar
task of binary classification. Among a great deal of work in the domain [1, 2], this framework has also
enabled numerous shared tasks, including detecting hyperpartisan news [3], propaganda [4], bots [5],
false claims [6] and more. These research results quickly found applications in content moderation for
large media platforms, which increasingly rely on ML tools to support, but also to replace the human
effort [7].
However, a shared task framework is far from the real-world application scenario, where the test
data are not fixed, but are generated continuously by users. This means that if a malicious actor sees
their non-credible content rejected by the system, they are likely to try to modify it to pass the filters,
rather than simply abandon their goals. Unfortunately, the deep learning architectures, which many of
the best-performing solutions use, are known for their susceptibility to adversarial examples, i.e. data
instances modified with the intent of fooling a classifier [8]. While discovering adversarial examples
for text is more challenging than in other domains, it is definitely possible [9]. Thus, investigating
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ piotr.przybyla@upf.edu (P. Przybyła); bpwu1@sheffield.ac.uk (B. Wu); y.mu@sheffield.ac.uk (Y. Mu);
kimcheng.sheang@upf.edu (K. C. Sheang); x.song@sheffield.ac.uk (X. Song); horacio.saggion@upf.edu (H. Saggion)
0000-0001-9043-6817 (P. Przybyła); 0009-0002-0918-526X (B. Wu); 0000-0002-8255-9435 (Y. Mu); 0000-0002-4662-0358
(K. C. Sheang); 0000-0002-4188-6974 (X. Song); 0000-0003-0016-7807 (H. Saggion)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
the robustness of text credibility assessment solutions is indispensable for making them applicable in
real-world adversarial scenarios.
Here we report on the shared task on Investigating Robustness of Credibility Assessment with
Adversarial Examples (InCrediblAE), which was organised as Task 6 of the CheckThat! 2024 evaluation
lab [10, 11] at the CLEF 2024 conference. In InCrediblAE, participants get access to the following
resources for each domain:
1. three victim classifiers, assessing the credibility of the input text and returning a score,
2. an attack dataset, including around 400 instances unseen by the classifier in training.
In the task we take into account five domains, corresponding to important challenges in the credibility
assessment – see section 2 for details.
The goal of the participants is to make modification to the instances in the attack dataset, turning
them into adversarial examples. Each adversarial example (AE) is evaluated on meaning preservation, i.e.
how similar it is to the original; and classifier confusion, i.e. whether the output of the victim classifier
is different than for the original.
The evaluation consists of two stages. The automatic evaluation follows the framework established
in the field of adversarial learning, with the above factors assessed through automatic measures, i.e.
BODEGA score [12], leading to a ranking list – see section 3.1. In the manual evaluation we use human
judgement to assess the semantic similarity between attack sentences and their original counterparts.
This process aims to highlight potential errors arising from automatic evaluation metrics, as well as to
create a high-quality text similarity dataset for the future development and evaluation of metrics – see
section 3.2.
The task has attracted six teams submitting various approaches (section 4.1), some of which have
clearly outperformed previous solutions applied to the same problems (section 4.2), generally confirming
the high vulnerability of popular text classifiers to adversarial attacks. What is more, the results of the
manual evaluation (section 4.3) highlight cases where AEs might appear far from the original text if
judged by automatic measures, but are in fact quite convincing to human annotators. The results of the
manual annotation are made openly available for future research1 . We also share the code2 and data3
for the automatic evaluation.
2. Task data
The shared task uses the foundation of the BODEGA framework [12], which has been created to enable
systematic robustness testing in the area of misinformation detection. Within this framework, several
domains are available, each organised around a credibility assessment problem, defined as a binary text
classification task. Within a domain, an expert-annotated corpus of documents is used to train victim
classifiers (train subset) and test the attack performance (attack subset).
The participants are provided with Python code, hosted on Google Colab, allowing to interact with
the victim models in an attack scenario through OpenAttack interface [13]. Their goal is to prepare
a procedure that modifies the text fragments in the attack dataset to achieve a different decision of a
classifier with a minimal meaning alteration. The participant submission includes the AEs, as well as
the number of victim model queries needed to find them.
2.1. Datasets
Four of the domains in InCrediblAE have been prepared in BODEGA based on previously published
corpora. The final one (C19) is new and was not available before.
Style-based news bias assessment (HN) is a task of verifying credibility of a news article based
on its overall writing style. It relies on previous work indicating that stylistic analysis of fake news
1
https://github.com/GateNLP/CLEF2024_InCrediblAE_Manual_Evaluation_Dataset
2
https://github.com/piotrmp/BODEGA
3
https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task6
Table 1
The sizes of subsets in each domain and the percentage of cases labelled as positive.
Task Training Attack Development Positive
HN 60,235 400 3,600 50.00%
PR 12,675 416 3,320 29.42%
FC 172,763 405 19,010 51.27%
RD 8,694 415 2,070 32.68%
C19 1,130 595 0 42.55%
outlets can be used to distinguish them from credible sources [14, 15]. The corpus contains news
bias annotations assigned at the level of the source (whole website) by journalists from BuzzFeed and
MediaBiasFactCheck.com4 . For the purpose of BODEGA, 10% of the training instances were used and
the non-credibility label was assigned to articles from sources marked as hyperpartisan, both left- and
right-wing.
Propaganda detection (PR) is focused on recognising specific manipulation techniques appealing
to emotions [16], for example name-calling, flag-waving or straw-man fallacy. This approach has
the advantage of being fine-grained by highlighting manipulative fragments in text, akin to the NER
(Named Entity Recognition) tasks. We rely on the token-level annotations of 14 techniques, marked
by professional annotators for the SemEval 2020 Task 11 (Detection of Propaganda Techniques in News
Articles) [4], for which the training set is public5 . In order to cast the task as binary text classification for
BODEGA, the corpus was split into sentences and those including some tokens marked as propaganda
were labelled as non-credible.
Fact checking (FC) is an approach to misinformation detection based on extracting claims made in
a piece of text and verifying them with respect to a trusted knowledge base [17]. In order to represent
the problem as binary classification, we focus on the final stage of the workflow, when a claim is
compared to relevant evidence from the knowledge base, which either confirms its validity or refutes
it. In BODEGA, the data from the FEVER shared task [18] is used, consisting of claims that were
paired with relevant passages from Wikipedia articles. The instances where a claim is supported by the
evidence were labelled as credible, and those when it is refuted as non-credible.
Rumour detection (RD) is aimed at detecting information spreading widely over social media
despite it not coming from a credible source. Rumours can be detected using many indicators [19],
but here we focus on the textual content of a social media post, as well as the reactions of other
users. In BODEGA, this is achieved thanks to the augmented dataset of rumours and non-rumours for
rumour detection [20], created from Twitter threads relevant to six real-world events, labelled by experts
according to the source reliability of the initial post. One of the events (Charlie Hebdo shooting) was
set aside as the attack dataset.
COVID-19 misinformation detection (C19) focuses on binary classification6 of misinformation
related to COVID-19 [21, 22]. Given a known false claim about the disease, the task is to determine
whether a user’s tweet supports that false claim. If so, the tweet is classified as COVID-19 misinformation
(positive class). Alternative responses, such as contradicting, questioning, commenting, or irrelevance
regarding the false claim are reserved for the negative class. Appendix B provides examples from the
dataset.
Table 1 summarises the information on datasets, including the sizes of subsets: training (for training
victim classifiers), development (reserved for future use) and attack (to be modified into AEs), as well as
the percentage of positive (non-credible) instances.
4
https://zenodo.org/record/1489920
5
https://zenodo.org/record/3952415
6
We merged IRRELEVANT and DEBUNK class as non-misinfo to covert original dataset into a binary classification
2.2. Victim classifiers
Training datasets were used to prepare victim models, representing popular approaches to text classifi-
cation. Two of the models (BiLSTM and BERT) were trained as in BODEGA framework, but the surprise
classifier was trained specifically for the shared task and it was revealed to the participants in the test
phase, one week before the submissions.
BiLSTM classifier consists of an embedding layer (token representations of size 32), two LSTM [23]
layers (forwards and backwards, hidden representation of size 128) and a dense linear layer converting
the text fragment representation (of size 256) into two-class probability, normalised using softmax.
BERT classifier is a bert-base-uncased model [24] from the HuggingFace Transformers library
[25], fine-tuned for sequence classification using Adam optimiser with linear weight decay [26] for 5
epochs.
Surprise classifier is a RoBERTa model [27], i.e. roberta-base from HuggingFace Transformers,
adversarially-trained to be more robust to adversarial attacks. We use data augmentation to improve
robustness: First, the model is fine-tuned for one epoch on the train dataset, then adversarial examples
are generated from the entire train dataset using BERT-ATTACK [28], and then the model is fine-tuned
for one epoch on a combination of the train dataset and the successful adversarial examples. We train
with constant learning rate 2 × 10−5 and the Adam optimiser. We use batch of size 32 for all tasks
except PR, which uses a batch of size 64. Due to computational constraints, for HN we only generate
adversarial examples from a subset (6000 samples) of the training data.
3. Evaluation
The evaluation procedure consists of two stages. Firstly, the BODEGA framework is used to automatically
assess the attack effectiveness of each participant in 15 scenarios (5 domains × 3 victims). The average
BODEGA score is used to create the leaderboard, expressing the overall performance. Secondly, the
task most challenging for automatic evaluation (fact-checking) is used to perform manual annotation of
meaning preservation in selected instances.
3.1. Automatic evaluation
In automatic evaluation, when an example 𝑥𝑖 is modified into AE 𝑥*𝑖 , the quality of the transformation
is assessed through BODEGA score defined as follows [12]:
BODEGA_score(𝑥𝑖 , 𝑥*𝑖 ) = Con_score(𝑥𝑖 , 𝑥*𝑖 ) × Sem_score(𝑥𝑖 , 𝑥*𝑖 ) × Char_score(𝑥𝑖 , 𝑥*𝑖 ),
where:
• Con_score, i.e. confusion score, takes value of 1 when the attacked classifier predicts a different
class for 𝑥*𝑖 than it did for 𝑥𝑖 , and 0 otherwise.
• Sem_score, i.e. semantic similarity score, is a measure of meaning preservation between 𝑥𝑖 and
𝑥*𝑖 , computed using the BLEURT [29] evaluation measure (BLEURT-20 variant), clipped to the
(0-1) range.
• Char_score, i.e. character similarity score, is a measure of similarity of 𝑥𝑖 and 𝑥*𝑖 as character
sequences, computed through Levenshtein distance [30], scaled to (0-1) similarity.
We can see that an AE will be ranked highly if it changes the output of the classifier
(Con_score(𝑥𝑖 , 𝑥*𝑖 ) = 1), but at the same time preserves both the meaning (Sem_score(𝑥𝑖 , 𝑥*𝑖 ) ≈ 1)
and the appearance (Char_score(𝑥𝑖 , 𝑥*𝑖 ) ≈ 1) of the original text.
To measure the overall attack success in a particular scenario, the BODEGA score averaged over all
instances in the attack set is employed. However, the constituent scores, also averaged over the dataset,
can be used to understand the results. We also report the number of queries that is needed (on average)
for a single AE to be found.
Table 2
Category definitions for the manual evaluation.
Category Definitions
This label is used when the semantic content of the attack sample closely aligns with that of the original sample.
Preserve the
Participants should use this label if the meaning, context, and intent of the compared texts remain essentially un-
Semantic Meaning
changed, indicating that the attack sample has effectively maintained the core message of the original.
Participants should apply this label when there is a noticeable alteration in the semantic content between the ori-
Change the ginal and the attack sample. This label indicates that while the attack sample may be related or similar in some
Semantic Meaning aspects to the original, it diverges enough in meaning or intent to be considered distinct or modified. For example,
name entities (including Year, Name, Location, etc.) are changed in the modified text.
No sense The content of the attack sample does not make any sense.
Table 3
Confidence scores and definitions used by Mu et al. [31].
Confidence Definitions
Extremely unconfident about the annotation (I’m really unsure about the annotation. It may belong
1
to another category as well, you may wish to discard this instance from the training.)
Not confident about the annotation (I’m not sure about the annotation, it seems it also belongs to
2
other categories, but you can still include this instance as a “silver standard instance” in training.)
Pretty confident about the annotation (I’m pretty sure about the annotation, but might be in high
3
chance other annotators may label it in a different category.)
Fairly confident about the annotation (I’m confident about the annotation, but might be in small
4
chance other annotators may label it in a different category.)
5 Extremely confident about the annotation (I’m certain about the annotation without a doubt.)
3.2. Manual evaluation
The goal of the manual evaluation is to highlight the cases where the automatic evaluation measures,
especially regarding semantic similarity, might not be an accurate representation of human reception of
the adversarially modified content. To that end, we have selected the samples from the fact-checking do-
main and the surprise victim, where even small changes in text can alter the meaning and, consequently,
the credibility label.
Task Description We aim to gather assessments regarding the semantic similarities between attack
samples and the original samples. Participants in the shared task are requested to dedicate approximately
60 minutes to this manual evaluation (i.e., 100 samples per participant), which are conducted using an
open-source, collaborative annotation platform, i.e., GATE Teamware 2 [32]. Judges rate the sample
pairs based on the following scale: (a). Preserve the Semantic Meaning, (b). Change the Semantic
Meaning, and (c). No sense. Table 2 demonstrates categories and descriptions. Similar to the work of
Mu et al. [31], annotators are required to indicate the confidence level (see Table 3) of their assigned
class in the ‘Confidence Row’.
Data Sampling We randomly select 100 paired samples (i.e., original and modified texts) from
each submission, resulting in a total of 600 paired samples. Note that only the successful attacks are
considered.
Annotator Training We train the annotators by providing a training document detailing the anno-
tation pipeline, which includes (i) a step-by-step tutorial for using the GATE Teamware platform, (ii)
a user information sheet to inform about any potential issues and risks that may occur during data
annotation, and (iii) a user consent sheet as required by the ethical approval from the University of
Sheffield, where the annotation was performed.
A total of 12 annotators (i.e., 6 participants and 6 organisers) were recruited to manually annotate the
paired samples. These 12 annotators were further divided into 6 separate groups (i.e., two annotators
per group). In each group, 100 tweets were assigned to each annotator. Finally, this process yielded 100
double-annotated paired samples from each group, resulting in 600 double-annotated samples in total.
Annotation Methodology and Quality Assurance All samples are double-annotated by the shared
task organizers and participants. Briefly, each paired sample is annotated by one participant and one
shared task organiser. A third annotator from the shared-task organisers is used to resolve any conflicts.
Given that there are three categories in total, the annotation with the highest confidence score will be
considered in the case of three differing annotations.
4. Results
Here we outline the results of the InCrediblAE shared task in three steps: first, we describe the solutions
submitted by the participants (section 4.1), then we present the results of the automatic (section 4.2)
and manual evaluation (section 4.3).
4.1. Participating solutions
The SINAI team [33] proposes a method for adversarial attacks based on the substitution of characters
by homoglyphs (e.g. characters which resemble the target such as l ≈ 1). The method uses exhaustive
search with two variants: with memory and without memory. They ground their approach in the fact
that homoglyphs could deceive the human eye while at the same time provoke a Large Language Model
classifier to reverse its prediction due to the presence of an unexpected token. According to the official
leader board, the approach is sub-optimal, obtaining the last rank in the task according to the official
evaluation metrics. However, human evaluation of content preservation is ranked high.
The MMU_NLP participation [34] features a system to attack classifiers based on lexical substitution
and character replacement. The proposed method searches for candidate words to attack followed
by a word replacement mechanism. The word search mechanism masks words to check their vulner-
ability with those words having a high impact on the classifier performance retained for the attack.
The replacement step uses homoglyphs for character replacement or lexical substitution. Character
replacement is tested in two different conditions: random character attack or begin/end of word attack.
The lexical replacement attack uses a large language model to retrieve a word similar to the target
word. Overall result fall short compared to other participants, however the proposed methods improves
over the baseline in several settings. The character attack method seems more effective than the word
replacement approach.
The Palöri team [35] proposes an approach to identify vulnerable words by computing (relying on a
masked language model) the difference between the probability distribution of the original sentence
and the sentence with a word masked. These differences are used as scores to rank words by their
"vulnerability" according to the model. The ranked words (most to least vulnerable) are then replaced
using a word from a list of substitutes proposed by the language model. The sentence with the replaced
word is used to attack the victim classifier. In case of success, the new sentence is returned, otherwise
the method loops using the sentence with the "best" possible substitution (i.e. one which reduced the
victim’s confidence the most). The method produces successful attacks which however do not preserve
the original sentence’s meaning. To address this problem a "synonym" dictionary is created, using
GloVe embeddings and the aclImbd dataset, to draw substitutes from. The new method only contributes
minor improvements over the masked language model.
The solution of the OpenFact team [36] consists in a coupling of various word-substitution approaches
in an ensemble in such a way that if the first approach does not succeed in changing the classifier’s
decision, then the second one is called. In particular, a modification of BERT-ATTACK [28] was
proposed (it features a change in parameter values, an alternative selection of a replacement position
by an exhaustive search of candidates that provide the largest difference in probabilities for a predicted
class, and an iterative replacement from 1 to 7 words, including punctuation and digits at the latter,
until the success of an attack) and backed up by a genetic algorithm [37] realised in the OpenAttack
framework [13]. Another ensemble was compiled of a proposed greedy search by word swap with
synonyms in the word embedding space (prebuilt "counter-fitted" GloVe embeddings [38]) and another
model available in the OpenAttack, TextFooler [39], which unlike other similar approaches replaces
words in agreement with the syntax of the attacked text. Apart from ensemble models, approaches
from the TextAttack framework [40] were used. They demonstrated superior performance over the
baseline methods in automatic scores. In particular, CLARE [41] – a model that implements a special
mask-then-infill procedure that incorporates replace/insert/merge operations allowing for outcomes
varied in length – was applied for PR, FC and C19 tasks and consistently yielded better results in all
automatic scores. Overall, this solution gained the best automatic scores in most of the domains for
most of the victims, which made it the first in the leaderboard created by averaging the scores across
all scenarios. However, it was ranked very low within the human evaluation, as in the majority of the
cases the meaning of the text was changed.
The TurQUaz [42] team leverages a genetic algorithm to look for a combination of character modifica-
tions. The modifications that are introduced using a mutation operator include homoglyph replacement,
three options of word splitting (random, favouring existing words in the subword outcome, and special
heuristic-wise), insertion or removal of individual random letters, and shuffling the order of the letters
within the word. The search is carried out until the first flip is found. Only the use of homoglyphs and
word splits proved to be efficient. Apart from the genetic algorithm, the team experimented with attacks
made by utilizing large language models (LLMs) such as Llama 37 and Mistral8 . Three approaches have
been tried: (i) prompting a model for text paraphrasing, (ii) leveraging a model for identifying words to
be changed, and (iii) generating adversarial examples with one LLM and verifying whether an attack is
going to be successful with another LLM. None of the approaches outperformed the genetic algorithm,
however, the team believes in the potential of LLMs and suggests fine-tuning the models specifically for
this task to improve their performance in the future. The primary solution has been ranked third based
on the BODEGA score.
The TextTrojaners team [43] introduces a BeamAttack method that makes attacks at a word level using
RoBERTa [27] and a beam search as a backbone to produce contextually appropriate word substitutions.
Beam search algorithm adapted by enabling operations of replacing, skipping, or removing words
allows for generating and evaluating multiple alternative word replacement combinations in a single
run. For the identification of the most vulnerable words to be replaced, the team experiments with
two ranking approaches that use the explainable AI framework LIME [44] and logit-based importance
scores, as proposed in [28]. In a series of ablation studies, they show that the choice of a ranking method
varies across victims and depends on the dataset. The solution achieved the second-best result on the
BODEGA evaluation metric but gained rather low manual evaluation scores.
4.2. Automatic evaluation
Table 4 includes the results of the evaluation against the BiLSTM victim. Generally, we can see that
different approaches dominate in different scenarios. However, in every domain, the best BODEGA
score (in boldface) is achieved by a solution submitted to InCrediblAE, rather than a reference solution
from previous work (BERT-ATTACK or DeepWordBug). HN appears to be the easiest domain, with the
leading solution (TextTrojaners) achieving BODEGA score of 0.91 through 100% confusion with 91%
semantic similarity and 99% character similarity. This result, closely followed by OpenFact, is especially
impressive when compared to the scores of BERT-ATTACK (BODEGA score of 0.64). The TextTrojaners
approach also leads in PR, but we need to note its high amount of queries needed – in this case, 593
compared to TurQUaz achieving almost the same result (0.68 instead of 0.70) with six times less queries.
The C19 appears to be the domain most challenging for attacks, although even here we note a vast
improvement over the reference method (OpenFact: 0.72 vs BERT-ATTACK: 0.50).
In the attacks against the BERT victim, evaluated in table 5, the OpenFact method dominates, ceding
only in FC to homoglyph-based SINAI. We can also note that the best BODEGA score is either equivalent
(for FC, HN and C19) or significantly lower (for PR and RD) than for BiLSTM, indicating higher difficulty
of attacking a Transformer-based classifier. This also results in a higher number of queries necessary to
7
https://huggingface.co/docs/transformers/model_doc/llama3
8
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Table 4
Results of the automatic evaluation of attacks against the BiLSTM victims, for each of the five domains, expressed
through BODEGA score and its constituents, as well as the number of queries (except for OpenFact, whose
submission did not include this value).
Domain Method BODEGA Confusion Semantic Character Queries
PR MMU_NLP 0.40 0.57 0.74 0.95 390.42
OpenFact 0.65 0.94 0.77 0.89 -
Palöri 0.60 0.98 0.68 0.90 52.31
SINAI 0.34 0.37 0.93 0.98 318.27
TextTrojaners 0.70 0.97 0.80 0.90 593.38
TurQUaz 0.68 0.95 0.76 0.94 96.22
BERT-ATTACK 0.53 0.80 0.72 0.91 61.41
DeepWordBug 0.29 0.38 0.79 0.96 27.45
FC MMU_NLP 0.49 0.66 0.77 0.96 175.32
OpenFact 0.80 0.98 0.84 0.97 -
Palöri 0.69 1.00 0.71 0.97 76.47
SINAI 0.82 0.98 0.86 0.96 507.16
TextTrojaners 0.76 1.00 0.81 0.94 1,549.12
TurQUaz 0.79 1.00 0.82 0.97 39.44
BERT-ATTACK 0.60 0.86 0.73 0.95 132.80
DeepWordBug 0.48 0.58 0.85 0.98 54.36
RD MMU_NLP 0.12 0.28 0.44 0.97 2,198.42
OpenFact 0.84 0.95 0.91 0.98 -
Palöri 0.30 0.71 0.44 0.95 532.04
SINAI 0.14 0.17 0.85 1.00 140.96
TextTrojaners 0.83 1.00 0.87 0.96 3,831.72
TurQUaz 0.39 0.67 0.61 0.95 276.93
BERT-ATTACK 0.29 0.79 0.41 0.89 985.52
DeepWordBug 0.16 0.24 0.68 0.99 232.75
HN MMU_NLP 0.54 0.91 0.59 0.99 898.83
OpenFact 0.89 0.97 0.93 0.99 -
Palöri 0.64 0.99 0.65 0.99 373.22
SINAI 0.41 0.48 0.87 1.00 172.92
TextTrojaners 0.91 1.00 0.91 0.99 936.96
TurQUaz 0.62 1.00 0.65 0.96 68.30
BERT-ATTACK 0.64 0.98 0.66 0.99 487.85
DeepWordBug 0.41 0.53 0.77 1.00 396.18
C19 MMU_NLP 0.45 0.78 0.60 0.95 149.15
OpenFact 0.72 0.91 0.83 0.96 -
Palöri 0.53 0.99 0.57 0.93 110.90
SINAI 0.26 0.30 0.88 1.00 33.03
TextTrojaners 0.72 0.99 0.77 0.92 837.35
TurQUaz 0.58 0.93 0.65 0.96 103.10
BERT-ATTACK 0.50 0.84 0.62 0.95 127.17
DeepWordBug 0.33 0.48 0.70 0.99 61.15
find an AE, e.g. while the leading method for HN domain needed 937 queries to obtain the score of 0.91
with BiLSTM, attacking BERT with the same approach requires 4328 queries on average.
The results for the surprise victim (table 6) show that the adversarially-trained classifier indeed is
more challenging to attack, leading to lower scores in PR and RD domains. However, the general view
remains similar, with OpenFact again dominating with the exception of FC. Their approach clearly works
very well with Transformer-based victims, just with an exception of this single task. The increasing
level of difficulty is again reflected in a rising numbers of queries, reaching a record value of over 15000
for TextTrojaners attacking in the RD domain.
Finally, table 7 shows the final leaderboard obtained by averaging the BODEGA scores across
Table 5
Results of the automatic evaluation of attacks against the BERT victims, for each of the five domains, expressed
through BODEGA score and its constituents, as well as the number of queries (except for OpenFact, whose
submission did not include this value).
Domain Method BODEGA Confusion Semantic Character Queries
PR MMU_NLP 0.33 0.47 0.75 0.95 438.26
OpenFact 0.68 0.97 0.77 0.89 -
Palöri 0.56 0.97 0.64 0.88 65.37
SINAI 0.38 0.43 0.92 0.98 288.73
TextTrojaners 0.62 0.99 0.71 0.86 4,097.37
TurQUaz 0.46 0.68 0.72 0.94 254.75
BERT-ATTACK 0.43 0.70 0.68 0.90 80.16
DeepWordBug 0.28 0.36 0.79 0.96 27.43
FC MMU_NLP 0.55 0.73 0.78 0.96 710.29
OpenFact 0.80 1.00 0.83 0.97 -
Palöri 0.62 0.98 0.66 0.96 102.52
SINAI 0.82 0.97 0.86 0.98 250.74
TextTrojaners 0.79 1.00 0.83 0.96 1,390.83
TurQUaz 0.74 1.00 0.78 0.95 70.70
BERT-ATTACK 0.53 0.77 0.73 0.95 146.73
DeepWordBug 0.44 0.53 0.84 0.98 54.32
RD MMU_NLP 0.15 0.37 0.42 0.93 986.24
OpenFact 0.65 0.78 0.86 0.95 -
Palöri 0.18 0.45 0.42 0.94 1,077.24
SINAI 0.12 0.14 0.87 1.00 143.06
TextTrojaners 0.59 0.80 0.79 0.91 10,618.93
TurQUaz 0.22 0.38 0.61 0.95 417.75
BERT-ATTACK 0.18 0.44 0.43 0.96 774.31
DeepWordBug 0.16 0.23 0.70 0.99 232.74
HN MMU_NLP 0.47 0.86 0.55 0.97 806.60
OpenFact 0.91 1.00 0.92 0.99 -
Palöri 0.60 0.96 0.64 0.98 502.03
SINAI 0.24 0.27 0.87 1.00 245.10
TextTrojaners 0.85 1.00 0.87 0.97 4,327.67
TurQUaz 0.46 0.84 0.59 0.93 223.54
BERT-ATTACK 0.60 0.96 0.64 0.97 648.41
DeepWordBug 0.22 0.29 0.78 1.00 395.94
C19 MMU_NLP 0.45 0.82 0.58 0.95 142.18
OpenFact 0.72 0.91 0.82 0.96 -
Palöri 0.52 0.96 0.57 0.93 201.01
SINAI 0.41 0.47 0.89 1.00 32.16
TextTrojaners 0.71 0.98 0.78 0.92 2,628.90
TurQUaz 0.57 0.96 0.62 0.95 102.92
BERT-ATTACK 0.42 0.74 0.60 0.95 161.70
DeepWordBug 0.27 0.39 0.71 0.99 61.06
victims and domains. We can see that all solutions submitted to the task have beaten the DeepWordBug
reference and most have also outperformed BERT-ATTACK, which is a strong reference point. OpenFact
and TextTrojaners are clear leaders, with the former slightly better-performing, especially against
Transformers victim. However, we need to emphasise that the averaged ranking does not show the
whole picture and various methods work best for various scenarios. For example, SINAI is the best
approach for the BERT-FC combination.
Table 6
Results of the automatic evaluation of attacks against the Surprise victims, for each of the five domains,
expressed through BODEGA score and its constituents, as well as the number of queries (except for OpenFact,
whose submission did not include this value).
Domain Method BODEGA Confusion Semantic Character Queries
PR MMU_NLP 0.28 0.40 0.76 0.94 525.66
OpenFact 0.62 0.93 0.75 0.87 -
Palöri 0.25 0.54 0.55 0.83 482.21
SINAI 0.26 0.31 0.89 0.97 374.20
TextTrojaners 0.45 0.97 0.55 0.79 10,286.82
TurQUaz 0.20 0.26 0.78 0.95 471.40
BERT-ATTACK 0.20 0.32 0.69 0.91 117.64
DeepWordBug 0.13 0.17 0.81 0.96 26.87
FC MMU_NLP 0.51 0.68 0.78 0.96 201.15
OpenFact 0.80 1.00 0.82 0.97 -
Palöri 0.66 1.00 0.68 0.97 117.77
SINAI 0.44 0.50 0.89 0.99 43.14
TextTrojaners 0.82 1.00 0.84 0.97 498.93
TurQUaz 0.71 1.00 0.75 0.94 90.55
BERT-ATTACK 0.56 0.79 0.73 0.96 164.07
DeepWordBug 0.37 0.46 0.83 0.98 53.39
RD MMU_NLP 0.16 0.35 0.46 0.97 2,894.29
OpenFact 0.55 0.71 0.82 0.93 -
Palöri 0.19 0.47 0.43 0.93 1,513.25
SINAI 0.09 0.10 0.85 1.00 149.12
TextTrojaners 0.54 0.87 0.69 0.84 15,458.12
TurQUaz 0.17 0.28 0.63 0.96 466.13
BERT-ATTACK 0.17 0.41 0.42 0.95 951.87
DeepWordBug 0.12 0.18 0.69 0.99 229.56
HN MMU_NLP 0.47 0.77 0.62 0.97 713.89
OpenFact 0.83 0.99 0.86 0.97 -
Palöri 0.34 0.57 0.62 0.98 1,453.38
SINAI 0.36 0.41 0.88 1.00 202.67
TextTrojaners 0.67 1.00 0.72 0.92 4,596.62
TurQUaz 0.28 0.47 0.61 0.94 376.77
BERT-ATTACK 0.38 0.67 0.60 0.95 1,781.97
DeepWordBug 0.16 0.21 0.76 1.00 384.34
C19 MMU_NLP 0.42 0.76 0.58 0.94 155.45
OpenFact 0.72 0.99 0.78 0.93 -
Palöri 0.46 0.99 0.51 0.89 299.37
SINAI 0.17 0.18 0.92 1.00 37.97
TextTrojaners 0.65 1.00 0.71 0.91 6,491.39
TurQUaz 0.41 0.75 0.59 0.92 253.78
BERT-ATTACK 0.37 0.68 0.58 0.93 198.26
DeepWordBug 0.20 0.28 0.72 0.98 60.94
4.3. Manual Evaluation
We randomly selected 100 successful adversarial samples from each team’s submission to the scenario
including fact checking and surprise victim. During the evaluation, both the original and adversarial
samples were presented to the annotators, with differences highlighted. The annotators were asked to
categorise each sample pair into one of the three categories described in Section 3.2. They also provided
a confidence score for each annotation (5: very confident, 1: not confident). Each sample pair was
judged by at least two annotators. The annotation agreement of the initial annotators was 0.52 (Cohen’s
Kappa). A third annotator was invited if there was a conflict between the two initial annotators.
We determined the ranking of participants by the number of samples that fell into the ’Preserve the
Table 7
Final leaderboard, created by averaging BODEGA scores across all scenarios (five domains and three victims).
# Method BODEGA avg.
1. OpenFact 0.7458
2. TextTrojaners 0.7074
3. TurQUaz 0.4859
4. Palöri 0.4776
5. MMU_NLP 0.3848
6. SINAI 0.3507
- BERT-ATTACK 0.4261
- DeepWordBug 0.2682
Table 8
Manual evaluation results.
Team % of Preserve the meaning
SINAI 99%
MMU_NLP 96%
TurQUaz 62%
Plagori 14%
OpenFact 11%
TextTrojaners 7%
Semantic Meaning’ category, adhering to the principle that a higher count indicates better performance.
This ranking method was used because the task demands that the adversarial samples maintain their
original meaning. Table 8 presents the final manual evaluation results for all participants.
In general, we observe we observe a discrepancy between the manual and automatic evaluations
(see Table 7). This may be because the manual evaluation task is a fact-checking task. Even a slight
replacement of named entities (such as changing the year from 1990 to 1991) could result in a change of
meaning.
The leading team (SINAI) in manual evaluation uses an adversarial attack method mainly based on
the substitution of characters with homoglyphs and achieves a 99% score. This suggests that the use of
homoglyphs can successfully deceive the annotator’s eye. Similarly, team MMU employs similar attack
approaches, such as lexical substitution and character replacement, which also achieve a high manual
evaluation score (96%).
Team TurQUaz proposes a method of inserting white space to split English words, which achieved
third place on the leaderboard (62%). However, this method may sometimes change the meaning of
the original fact-checking document, resulting in a lower manual evaluation score compared to the
methods proposed by teams SINAI and MMU. Besides, by inserting white space in the original text,
the modified text may become non-interpretable by humans, resulting in a high proportion of third
category submissions by Team TurQUaz, i.e., “The sentence does not make any sense.”
We observe that OpenFact (11%) and TextTrojaners (7%) obtain lower manual evaluation scores. By
manually investigating text modified by Team OpenFact, we notice that some key information, such as
time and location, has been changed. Note that such named entities play a vital role in the context of
fact-checking downstream tasks. Therefore, changes to these named entities result in a higher number
of samples labelled as “Change the Semantic Meaning.” As for the solution of TextTrojaners, the choice
of alternative words solely depends on the context rather than the word to be replaced. This can
significantly deviate the meaning. In addition, the operation of word removal without further word
agreement adjustment may lead to nonsensical sentences.
5. Discussion
The first conclusion from the results obtained is clear: the state of the art in AE generation for misinfor-
mation detection, established by previous solutions, has advanced considerably. Various solutions were
submitted to the shared task, but they are based on the established lines of research in the area: word
replacements (preserving meaning similarity) or character replacements (preserving visual similarity).
The word-level solutions (esp. TextTrojaners and OpenFact) were performing the best in most
scenarios, but not all of them. Fact checking is a clear outlier due to its nature: every word matters,
making it hard to perform any change without drastically affecting the meaning. This opens the avenue
for character-level modifications and, indeed, such solution (SINAI) provided the best results in manual
evaluation.
We also need to acknowledge the limitations of the evaluation setup. It aims to predict the likelihood
of an AE fooling the victim classifier and transmitting the intended message, as encoded through
BODEGA score. However, human readers are the intended recipients of misinformation, and they are
also able to refuse to engage with a message that seems suspicious, artificial or distorted, e.g. due to use
of letter with non-standard shapes. Thus, the success of AEs will also depend on the visual appearance
of the manipulated content, which is not directly evaluated in the current setup. Quantifying this effect
would be challenging in manual evaluation and even more so in an automatic setup.
In any case, the results that we do have leave one thing clear: popular architectures for text classifi-
cation are very vulnerable to attack with AEs. While the adversarially-trained model posed slightly
harder challenge, ultimately AEs were found for nearly all cases in these scenarios as well.
How can we protect the real-world deployments of text classifiers against such attack? The first
barrier can be established by limiting the access to the victim model. We can see from the results that
the advance over previous state of the art was accompanied by a raise in the number of queries sent, well
into hundreds and thousands for each generated AE sample. Nevertheless, all machine-learning-based
solutions for content filtering should be only deployed after a thorough analysis of their adversarial
robustness.
6. Conclusion
In the InCrediblAE shared task, six teams participated with various solutions, both operating with the
word-level and character-level changes. The participants’ approaches (and two reference solutions)
were evaluated using five misinformation-detection scenarios and three victim models.
In total, 53,544 text modifications were considered and automatically assessed in terms of classifier
confusion, meaning preservation and character similarity. The submitted solutions easily outperformed
previous work in all of the tested scenarios. The manual evaluation highlighted the special role of the
fact-checking tasks and the efficacy of character replacement in performing modification imperceptible
to humans.
We hope that the combined effort of the participants and organisers of the InCrediblAE shared task
will succeed in both highlighting the importance of robustness testing and showcasing the best solutions.
To facilitate this outcome, the code and resources necessary for performing the automatic evaluation
remain openly available.9
Acknowledgments
The work of P. Przybyła is part of the ERINIA project, which has received funding from the European
Union’s Horizon Europe research and innovation programme under grant agreement No 101060930.
This work has been also partially funded by the European Commission under contract numbers HE-
101070278 and ISF-101080090. Views and opinions expressed are however those of the author(s) only
and do not necessarily reflect those of the funders. Neither the European Union nor the granting
9
https://github.com/piotrmp/BODEGA
authority can be held responsible for them. We also acknowledge support from Departament de Recerca
i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of
Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033.
References
[1] J. A. Tucker, A. Guess, P. Barberá, C. Vaccari, A. Siegel, S. Sanovich, D. Stukal, B. Nyhan,
Social Media, Political Polarization, and Political Disinformation: A Review of the Scien-
tific Literature, Technical Report, Hewlett Foundation, 2018. URL: https://hewlett.org/library/
social-media-political-polarization-political-disinformation-review-scientific-literature/.
[2] S. van der Linden, Misinformation: susceptibility, spread, and interventions to immunize the
public, Nature Medicine 2022 28:3 28 (2022) 460–467. URL: https://www.nature.com/articles/
s41591-022-01713-6. doi:10.1038/s41591-022-01713-6.
[3] J. Kiesel, M. Mestre, R. Shukla, E. Vincent, P. Adineh, D. Corney, B. Stein, M. Potthast, SemEval-2019
Task 4: Hyperpartisan News Detection, in: Proceedings of the 13th International Workshop on
Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA,
2019, pp. 829–839. URL: https://aclanthology.org/S19-2145. doi:10.18653/v1/S19-2145.
[4] G. da San Martino, A. Barrón-Cedeño, H. Wachsmuth, R. Petrov, P. Nakov, SemEval-2020 Task
11: Detection of Propaganda Techniques in News Articles, in: Proceedings of the Fourteenth
Workshop on Semantic Evaluation (SemEval-2020), 2020, pp. 1377–1414. URL: http://propaganda.
qcri.org/annotations/definitions.htmlhttp://arxiv.org/abs/2009.02696. arXiv:2009.02696.
[5] F. Rangel, P. Rosso, Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender
Profiling, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), CLEF 2019 Labs and Workshops,
Notebook Papers. CEUR Workshop Proceedings, CEUR-WS.org, 2019.
[6] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The FEVER2.0 Shared Task,
in: Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), 2018.
[7] M. Singhal, C. Ling, P. Paudel, P. Thota, N. Kumarswamy, G. Stringhini, S. Nilizadeh, SoK: Content
Moderation in Social Media, from Guidelines to Enforcement, and Research to Practice, in:
The 8th IEEE European Symposium on Security and Privacy (EuroS&P 2023), IEEE, 2022. URL:
https://arxiv.org/abs/2206.14855v2. doi:10.48550/arxiv.2206.14855. arXiv:2206.14855.
[8] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing
properties of neural networks, arXiv: 1312.6199 (2013). URL: https://arxiv.org/abs/1312.6199v4.
doi:10.48550/arxiv.1312.6199. arXiv:1312.6199.
[9] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, C. Li, Adversarial Attacks on Deep-learning Models in
Natural Language Processing, ACM Transactions on Intelligent Systems and Technology (TIST)
11 (2020). URL: https://dl.acm.org/doi/10.1145/3374217. doi:10.1145/3374217.
[10] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The clef-2024 checkthat! lab: Check-worthiness,
subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel-
lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
[11] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of
the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities
and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M.
Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
Conference of the CLEF Association (CLEF 2024), 2024.
[12] P. Przybyła, A. Shvets, H. Saggion, Verifying the Robustness of Automatic Credibility As-
sessment, arXiv preprint arXiv:2303.08032 (2023). URL: https://arxiv.org/abs/2303.08032v1.
arXiv:2303.08032.
[13] G. Zeng, F. Qi, Q. Zhou, T. Zhang, Z. Ma, B. Hou, Y. Zang, Z. Liu, M. Sun, OpenAttack: An
Open-source Textual Adversarial Attack Toolkit, in: ACL-IJCNLP 2021 - 59th Annual Meeting
of the Association for Computational Linguistics and the 11th International Joint Conference
on Natural Language Processing, Proceedings of the System Demonstrations, Association for
Computational Linguistics (ACL), 2021, pp. 363–371. URL: https://aclanthology.org/2021.acl-demo.
43. doi:10.18653/V1/2021.ACL-DEMO.43. arXiv:2009.09191.
[14] B. D. Horne, S. Adali, This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content
in Text Body, More Similar to Satire than Real News, in: Proceedings of the 2nd International
Workshop on News and Public Opinion at ICWSM, Association for the Advancement of Artificial
Intelligence, 2017. URL: http://arxiv.org/abs/1703.09398. arXiv:1703.09398.
[15] P. Przybyła, Capturing the Style of Fake News, in: Proceedings of the Thirty-Fourth AAAI
Conference on Artificial Intelligence (AAAI-20), volume 34, AAAI Press, New York, USA, 2020,
pp. 490–497. URL: https://aaai.org/ojs/index.php/AAAI/article/view/5386. doi:10.1609/aaai.
v34i01.5386.
[16] T. J. Smith, Propaganda: A Pluralistic Perspective, Praeger, 1989.
[17] L. Graves, Understanding the Promise and Limits of Automated Fact-Checking, Technical Report,
Reuters Institute, University of Oxford, 2018. URL: https://reutersinstitute.politics.ox.ac.uk/sites/
default/files/2018-02/graves{_}factsheet{_}180226FINAL.pdf. arXiv:arXiv:1011.1669v3.
[18] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The Fact Extraction and
VERification (FEVER) Shared Task, in: Proceedings of the First Workshop on Fact Extraction and
VERification (FEVER), 2018. arXiv:1811.10971v1.
[19] M. Al-Sarem, W. Boulila, M. Al-Harby, J. Qadir, A. Alsaeedi, Deep learning-based rumor detection
on microblogging platforms: A systematic review, IEEE Access 7 (2019) 152788–152812. doi:10.
1109/ACCESS.2019.2947855.
[20] S. Han, J. Gao, F. Ciravegna, Neural language model based training data augmentation for weakly
supervised early rumor detection, in: Proceedings of the 2019 IEEE/ACM International Conference
on Advances in Social Networks Analysis and Mining, ASONAM 2019, Association for Computing
Machinery, Inc, 2019, pp. 105–112. URL: https://dl.acm.org/doi/10.1145/3341161.3342892. doi:10.
1145/3341161.3342892. arXiv:1907.07033.
[21] Y. Jiang, X. Song, C. Scarton, I. Singh, A. Aker, K. Bontcheva, Categorising fine-to-coarse grained
misinformation: An empirical study of the covid-19 infodemic, in: Proceedings of the 14th
International Conference on Recent Advances in Natural Language Processing, 2023, pp. 556–567.
[22] Y. Mu, Y. Jiang, F. Heppell, I. Singh, C. Scarton, K. Bontcheva, X. Song, A large-scale comparative
study of accurate covid-19 information versus misinformation, arXiv preprint arXiv:2304.04811
(2023).
[23] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (1997) 1735–1780.
doi:10.1162/neco.1997.9.8.1735.
[24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transform-
ers for Language Understanding, in: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Asso-
ciation for Computational Linguistics, 2018, pp. 4171–4186. URL: http://arxiv.org/abs/1810.04805.
arXiv:1810.04805.
[25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-Art Natural Language Processing,
in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL:
https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[26] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, in: 7th International Conference
on Learning Representations, ICLR 2019, New Orleans, LA, USA, 2019. URL: https://openreview.
net/forum?id=Bkg6RiCqY7. arXiv:1711.05101v3.
[27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
P. G. Allen, RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019). URL: https:
//arxiv.org/abs/1907.11692v1. doi:10.48550/arxiv.1907.11692. arXiv:1907.11692.
[28] L. Li, R. Ma, Q. Guo, X. Xue, X. Qiu, BERT-ATTACK: Adversarial Attack Against BERT Using
BERT, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), Association for Computational Linguistics, 2020, pp. 6193–6202. URL: https:
//aclanthology.org/2020.emnlp-main.500. doi:10.18653/v1/2020.emnlp-main.500.
[29] T. Sellam, D. Das, A. Parikh, BLEURT: Learning Robust Metrics for Text Generation, in: Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics, Association
for Computational Linguistics, Online, 2020, pp. 7881–7892. URL: https://aclanthology.org/2020.
acl-main.704. doi:10.18653/v1/2020.acl-main.704.
[30] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet
Physics Doklady 10 (1966) 707–710.
[31] Y. Mu, M. Jin, C. Grimshaw, C. Scarton, K. Bontcheva, X. Song, Vaxxhesitancy: A dataset for
studying hesitancy towards covid-19 vaccination on twitter, in: Proceedings of the International
AAAI Conference on Web and Social Media, volume 17, 2023, pp. 1052–1062.
[32] D. Wilby, T. Karmakharm, I. Roberts, X. Song, K. Bontcheva, GATE Teamware 2: An open-
source tool for collaborative document classification annotation, in: D. Croce, L. Soldaini
(Eds.), Proceedings of the 17th Conference of the European Chapter of the Association for
Computational Linguistics: System Demonstrations, Association for Computational Linguis-
tics, Dubrovnik, Croatia, 2023, pp. 145–151. URL: https://aclanthology.org/2023.eacl-demo.17.
doi:10.18653/v1/2023.eacl-demo.17.
[33] J. Valle Aguilera, A. J. Gutiérrez Megías, S. M. Jiménez Zafra, L. A. Ureña López, E. Martínez Cámara,
SINAI at CheckThat! 2024: Stealthy character-level adversarial attacks using homoglyphs and
search, iterative, in: [45], 2024.
[34] C. Roadhouse, M. Shardlow, A. Williams, MMU NLP at CheckThat! 2024: Homoglyphs are
adversarial attacks, in: [45], 2024.
[35] H. He, Y. Song, D. Massey, Palöri at CheckThat! 2024 shared task 6: Glota - combining glove
embeddings with roberta for adversarial attack, in: [45], 2024.
[36] W. Lewoniewski, P. Stolarski, M. Stróżyna, E. Lewańska, A. Wojewoda, E. Księżniak, M. Sawiński,
OpenFact at CheckThat! 2024: Combining multiple attack methods for effective adversarial text
generation, in: [45], 2024.
[37] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, K.-W. Chang, Generating Natural
Language Adversarial Examples, in: Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018,
pp. 2890–2896. URL: https://aclanthology.org/D18-1316. doi:10.18653/v1/D18-1316.
[38] N. Mrkšić, D. Ó. Séaghdha, B. Thomson, M. Gasic, L. M. R. Barahona, P.-H. Su, D. Vandyke, T.-H.
Wen, S. Young, Counter-fitting word vectors to linguistic constraints, in: Proceedings of the 2016
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 2016, pp. 142–148.
[39] D. Jin, Z. Jin, J. T. Zhou, P. Szolovits, Is BERT Really Robust? A Strong Baseline for Natural
Language Attack on Text Classification and Entailment, in: The Thirty-Fourth AAAI Conference
on Artificial Intelligence, AAAI 2020„ AAAI Press, 2020, pp. 8018–8025. URL: https://ojs.aaai.org/
index.php/AAAI/article/view/6311.
[40] J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, Y. Qi, TextAttack: A Framework for Adversarial
Attacks, Data Augmentation, and Adversarial Training in NLP, in: Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing: System Demonstrations,
Association for Computational Linguistics, Online, 2020, pp. 119–126. URL: https://aclanthology.
org/2020.emnlp-demos.16. doi:10.18653/v1/2020.emnlp-demos.16.
[41] D. Li, Y. Zhang, H. Peng, L. Chen, C. Brockett, M.-T. Sun, W. B. Dolan, Contextualized perturbation
for textual adversarial attack, in: Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021,
pp. 5053–5069.
[42] B. Demirok, M. Kutlu, S. Mergen, B. Oz, TurQUaz at CheckThat! 2024: Creating adversarial
examples using genetic algorithm, in: [45], 2024.
[43] D. Guzman Piedrahita, A. Fazla, L. Krauter, TextTrojaners at CheckThat! 2024: Robustness of
credibility assessment with adversarial examples through beamattack, in: [45], 2024.
[44] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any
classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining, 2016, pp. 1135–1144.
[45] G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF
2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024.
A. Annotation Guidelines
InCrediblAE Shared Task Annotation All participants who submit entries will have their submis-
sions manually evaluated for semantic similarity. To aid in this human evaluation process, participants
in the shared tasks will be required to manually assess a selection of the submitted examples.
Manual Scoring We will gather assessments regarding the semantic similarities between attack
samples and the original samples. Participants in the shared task are requested to dedicate approximately
60 mins to this manual evaluation (i.e., 100 samples per participant), which will be conducted using this
online annotation tool. Judges will rate the sample pairs based on the following scale: (a). Preserve the
Semantic Meaning, (b). Change the Semantic Meaning, (c). No sense.
Labels and Definitions (a). Preserve the Semantic Meaning: This label is used when the semantic
content of the attack sample closely aligns with that of the original sample. Participants should use this
label if the meaning, context, and intent of the compared texts remain essentially unchanged, indicating
that the attack sample has effectively maintained the core message of the original. For example:
Text Original: middle-earth. middle-earth is the fictional setting of much of british writer
j. r. r. tolkien ’s legendarium . middle-earth is a setting.
Text Modified: middle-earth. middle-earth is the fictional background of much of british
writer j. r. r. tolkien ’s legendarium . middle-earth is a setting.
(b). Change the Semantic Meaning: Participants should apply this label when there is a noticeable
alteration in the semantic content between the original and the attack sample. This label indicates that
while the attack sample may be related or similar in some aspects to the original, it diverges enough in
meaning or intent to be considered distinct or modified. For example, name entities (including Year,
Name, Location, etc.) are changed in the modified text. For example:
Text Original: carol danvers. the character has also been known as binary , warbird , and
captain marvel at various points in her history , and has been featured in other marvel
licensed products including video games , animated television series , and merchandise
such as trading cards . carol danvers is a real person.
Text Modified: carol danvers. the character has also been known as binary , warbird , and
captain marvel at various points in her history , and has been featured in other marvel
licensed products including video games , animated television series , and merchandise
such as trading cards . carol danvers is a fictional person.
(c). No sense: The sentence does not make any sense.
Confidence Scores Please indicate how confident you are in your assigned class in the Confidence
Row. The confidence scores range from 1 to 5.
A.1. Information Sheet for InCrediblAE Participants
You are invited to participate in this research by contributing to the evaluation of the semantic similarity
of adversarial examples. It is important for you to understand the goals of the task and what your
participation will involve. Please take the time to read the following information. Please, ask us if there
is anything that is not clear or if you would like more information. Thank you very much for your time.
What is the task’s purpose? The aim of this research is to evaluate the robustness of the text
classifier in adversarial attacks (a detailed description of the task can be found here: https://checkthat.
gitlab.io/clef2024/task6/). In this manual evaluation, we will gather assessments regarding the semantic
similarities between attack samples and the original samples. Participants in the shared task are
requested to dedicate approximately 8 hours to this manual evaluation, which will be conducted using
an online tool. Judges will rate the sample pairs based on the following scale: 3 (Preserve the Semantic
Meaning), 2 (Change the Semantic Meaning), and 1 (No sense).
Your participation, what it involves and why we are grateful It is up to you to decide whether
or not you want to participate in this annotation task. If you do decide to support us in the project, you
will be given this information sheet to keep (and be asked to sign a separate consent form). You can
still withdraw at any time without any consequences and without giving any reason. The entire data
collected from withdrawn participants will be destroyed immediately, and no personal information
will be kept. If participants submitted a solution to the shared task that also means their submission to
the shared tasks will not be manually scored. If you wish to withdraw, please contact Dr Xingyi Song
(details in section 10). You will be asked to annotate adversarial examples into 4 categories using the
GATE Teamware Platform (https://annotate.gate.ac.uk/).
What are the possible advantages, disadvantages and risks of being involved? Participating
in this evaluation will support our research on the development of a more accurate assessment of the
robustness of text classifiers and effectiveness of the adversarial attack methods. No major disadvantages
or risks are foreseen, however, it is worth mentioning that the content being annotated may case distress.
You will be in charge of selecting the content to be annotated, therefore, you are free to only select
content that you are comfortable with. If, at any part of the experiment you feel uncomfortable with
the content you are accessing, please talk to one of the responsible researchers.
Will my involvement be kept confidential? All the information that we collect from you and
about you during the course of the research will be kept strictly confidential and will only be accessible
to members of the research team. You will not be able to be identified in any reports or publications
unless you have given your explicit consent for this on your participant consent form.
What is the legal basis for processing my personal data? According to data protection legislation,
we are required to inform you that the legal basis we are applying in order to process your personal
data is that ‘processing is necessary for the performance of a task carried out in the public interest’
(Article 6(1)(e)). Further information can be found in the University of Sheffield’s Privacy Notice, which
is available online under https://www.sheffield.ac.uk/govern/data-protection/privacy/general.
What will happen to the data collected in this study? The data gathered during this scoring
task will be used to assess the effectiveness of adversarial examples generation. Since this data will
also benefit other researchers, we plan to release a version of the annotated dataset. You will not be
identified and your scoring will be aggregated with multiple other annotators. In the case that you
provided us with your e-mail address at the beginning of the scoring task, we will destroy it after the
scoring task is finished. The University of Sheffield will act as the Data Controller for this study.
Who is organising and funding the research? This study is organised jointly by Universitat
Pompeu Fabra and the University of Sheffield at the CheckThat! Lab at CLEF 2024. Universitat Pompeu
Fabra is funded by the European Union’s Horizon Europe research and innovation programme under
grant agreement No 101060930. The University of Sheffield is funded by the the UK’s innovation agency
(Innovate UK) grant 10039055 (approved under the Horizon Europe Programme as vera.ai EU grant
agreement 101070093).
Who has ethically reviewed this study? This project has been ethically approved via the University
of Sheffield’s Ethics Review Procedure, as administered by the Computer Science Department.
What if something goes wrong and I wish to complain about the research? If you have any
complaints, either from the researcher or something occurring during or following your participation
in the project (e.g. a reportable serious adverse event), please contact Dr. Xingyi Song (contact
details in section 10). Should you feel your complaint has not been handled to your satisfaction, you
can also contact the Head of Department at the University of Sheffield, Professor Heidi Christensen
(heidi.christensen@sheffield.ac.uk) who will then escalate the complaint through the appropriate
channels. If the complaint relates to how your personal data has been handled, information about
how to raise a complaint can be found in the University’s Privacy Notice: https://www.sheffield.ac.uk/
govern/data-protection/privacy/general.
Contact for further information Details of who you should contact if you wish to obtain further
information are as follows: Dr. Xingyi Song, Department of Computer Science, University of Sheffield,
211 Portobello, Sheffield S1 4DP, UK. E-mail: x.song@sheffield.ac.uk Telephone: +44 114 222 18577.
B. COVID-19 Misinformation Dataset Examples
Table 9
Positive and negative examples from the COVID-19 misinformation dataset (C19).
Label False Claim User Text
COVID-19 the average diameter of the virus particles is
1 (Misinforma- N95 masks block few, around 120 nm (.12𝜇m). Any mask including N95 masks
tion related to if any COVID-19 parti- can’t filter this particle size, to filter it completely would
false claim) cles due to their size prevent the subject being able to breath. Therefore, the
mask are just the new normal, a fashion statement.
A #COVID19 particle is about 1 to 4 microns, an N95
N95 masks block few, will block 95% of tiny air particles, down to 0.3 microns,
0 (Other) if any COVID-19 parti- and surgical masks aren’t effective at blocking particles
cles due to their size smaller than 100 microns. Hence surgical masks cannot
stop the spread of #COVID19.