Palöri at CheckThat! 2024 Shared Task 6: GloTa -
                         Combining GloVe Embeddings with RoBERTa for
                         Adversarial Attack⋆
                         Notebook for the CheckThat! Lab at CLEF 2024

                         Haokun He1,† , Yafeng Song1,*,† and Dylan Massey1,†
                         1
                             University of Zurich


                                         Abstract
                                         This paper describes the submission of attack methods and results for shared task 6 at CheckThat! Lab at CLEF
                                         2024. We present two novel attack methods to test the robustness of credibility assessment (CA) classifiers across
                                         five tasks: fact-checking, COVID-19 misinformation detection, propaganda detection, style-based news bias
                                         assessment, and rumor detection. The methods were evaluated using the BODEGA score, which accounts for
                                         the success of the attack while preserving the original text’s meaning. Our GloTa method, combining GloVe
                                         embeddings with RoBERTa-based substitutions, demonstrated superior effectiveness in most tasks compared
                                         to baselines. Notably, GloTa achieved the highest BODEGA scores in propaganda detection and fact-checking,
                                         indicating significant vulnerability in these areas. However, the method showed comparable performance to
                                         baselines in style-based news bias and rumor detection, reflecting the inherent robustness of classifiers in these
                                         tasks. Against a more robust pre-trained RoBERTa classifier, GloTa still outperformed RoBERTa-ATTACK,
                                         although with generally lower success rates. These findings highlight the need for continuous improvement in
                                         adversarial attack techniques to enhance the robustness of CA systems against evolving threats.

                                         Keywords
                                         robustness, adversarial attack, BODEGA score, GloVe embeddings, credibility assessment, RoBERTa


                         1. Introduction
                         Credibility assessment (CA) can be understood as a family of tasks that have their goal in determining
                         whether a given textual document adheres to constraints, such as factuality, or not [1]. Advances in
                         NLP techniques and increased availability of high-quality domain-specific data have made classifiers
                         for CA viable for real-world deployment in contexts such as automated moderation of comments in
                         online platforms.
                            However, recent studies [2, 3] indicate that text classifiers can be easily deceived through simple
                         manipulations. For example, a user might circumvent a misinformation classifier by selectively re-
                         placing alphabetic characters with numbers. In the statement drinking water kills, this would lead to
                         a perturbation such as drinking w4ter k1lls, which a classifier might not robustly handle, leading to
                         misclassification. Such alterations, both simple and sophisticated, highlight that classifiers still lack the
                         robustness needed to withstand attacks from users with potentially malicious intent.
                            The robustness of classifiers can be systematically assessed by automatically perturbing initially
                         correctly classified input examples until the classifier’s decision is altered. From an attacker’s perspective,
                         the goal is to develop an algorithm that generates adversarial examples for each text sequence, resulting
                         in an opposite label compared to the original text sequence. If a decision can be changed (i.e., the
                         classifier gets confused), the attack is considered successful. To ensure that perturbations remain
                         human-readable and convey the original content, semantic and character-based distance metrics can be
                         employed in systematic robustness assessments.


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ haokun.he@uzh.ch (H. He); yafeng.song@uzh.ch (Y. Song); dylan.massey@uzh.ch (D. Massey)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   In the present working notes, we detail two attack methods1 used to assess the robustness of
five tasks given by this shared task: fact checking (FC), COVID-19 misinformation detection (C19),
propaganda detection (PR), style-based news bias assessment (HN), and rumor detection (RD). These
tasks are interpreted as binary classification tasks aimed at determining whether a given piece of text is
credible or not.
   We evaluate our attack methods on various classifier models, referred to as victim models, which
differ in architecture but are applied to the same tasks for comparison. The evaluation metric employed
is the BODEGA score, as proposed by Przybyła et al. [1], which measures the success rate of confusion
under the constraint of meaning preservation.


2. Background
Credibility assessment (CA) is the high-level task concerned with determining whether some given
natural language expression is credible with regards to some aspect, e.g. veracity, or not. Assessing
the robustness of classifiers performing CA is vital, for otherwise users with malicious intent might
easily bypass such classifiers in contexts such as automated content moderation, such as the screening
of contributions to online forums. We restrict our focus to five CA tasks. The first task, FC is concerned
with classifying whether a given natural language statement is true or false relative to some body of
knowledge [1]. The fact checking classifiers we attack are based on data from Thorne et al. [4]. The other
four tasks are similar to FC, but differ by text type and subsequently by the datasets they were trained
and evaluated on. These tasks have the goal of assessing whether some given text is misinformation [5],
a rumor [6], propaganda [7] or fake news [8] respectively. A summary of the statistical information for
these tasks is presented in the following Table 1.

Table 1
Detailed information about the task dataset used to attack the victim models.
                Task Name                 HN           PR             FC            RD              C19
                                         News     Propaganda         Fact        Rumor          COVID-19
                  Domain
                                         Bias      Detection       Checking     Detection     Misinformation
            Number of Texts2              400          416            405           415              541
         Average Words Per Text           323           21             47           147              43

   Previous work on classifier attack strategies can be broadly classified by information available to
an attacker (black-box, grey-box, white-box) and perturbation granularity (sentence-level, word-level,
character-level). In a white-box setting an attacker has full visibility of the models internals, including
model weights [9]. In a black-box scenario – as understood in the context of this task – an attacker
only can obtain information of the (binary) classification decision / confidence scores.
   We solve for the grey-box version of the task as outlined in Przybyła et al. [1]. An attacker hence,
(1) can obtain the confidence scores from the model, (2) is provided information about the high-level
architecture (not model parameters though), and (3) has access to training and development datasets as
also the evaluation method. Further, an attacker is free to query the model as many times as needed in
order to confuse it. Przybyła et al. [1] introduce an attack effectiveness metric, called BODEGA score,
that is composed of the confusion success rate, while also punishing semantic and lexical (character-
level) distance. They detail BODEGA scores for a number of methods on the five aforementioned tasks
(FC, C19, PR, HN, RD) subsuming three different model types – BERT [10], BiLSTM and RoBERTa [11].
The datasets used to train the victim models are openly available3 .

1
  The detailed code for our methods is available at: https://github.com/yafengsong/InCrediblAE-2024-GloTa
2
  The numbers of texts here are from the data used for the BERT Classifier. The number of texts may vary slightly for the
  other two models.
3
  cf.: https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task6/incrediblAE_public_release
   The measure of evaluation, the BODEGA score, is computed as the product of one binary and two
real-valued numbers 𝑆BODEGA = succ * semdist * chrdist . Where succ ∈ 0, 1. The success variable
(succ) takes 1 when confusion is achieved and 0 when not. The other two values, semantic distance
(semdist ) and character-edit distance (chrdist ) are ∈ [0, 1]. 1 indicates that similarity is preserved relative
to the original in both cases, whereas a value closer to 0 signifies a higher divergence. The BODEGA
scores over the individual adversarial generations are then mean-aggregated across test data points and
tasks, to generate a final score. Thus, a high BODEGA score against a classifier implies low robustness,
but a high fecundity of the attack method. The BODEGA score can only consider attacks successful
that targeted towards a potential attacker’s goals, i.e. we are only interested in changes from 1 → 0, or
consider both confusion directions as a success, which can be understood as untargeted. We consider
only the untargeted scenario.
   As basis for our experiments, we follow two promising methods, of which the latter also serves as one
of our baselines. Firstly, Li et al. [9], who replace words using nearest neighbor search and second Li
et al. [12], who use BERT to detect potential replacements for each input instance. Li et al. [12]’s method,
BERT-ATTACK, consists of probing the victim model for words that have high potential to change
the classification confidence and then in a subsequent step looking for suitable replacement words
for the most vulnerable words that still preserve the meaning. Their method outperforms previous
methods and is shown to work relatively well independent of the specific classifier architecture or task
(model-agnostic). The second baseline method is from Alzantot et al. [13], who use a genetic algorithm
over multiple generations to generate adversarial samples that are maximally fit to confuse the classifier.
As a framework for evaluation, we rely on the OpenAttack toolkit developed by Zeng et al. [14]. While
our work focuses on the word-level, some approaches have addressed perturbations on a more coarse-
respectively fine-grained level [15], such as character-switching [16] or paraphrasing [17].
   Our contributions in the CLEF CheckThat! 2024 edition [18, 19] of the Shared Task on Robustness of
Credibility Assessment with Adversarial Examples (IncrediblAE) [20] can be summarized as follows:

       • We introduce two novel methods to efficiently generate adversarial text samples for robustness
         assessment of CA classifiers.
       • We outperform previous baselines in the majority of CA tasks, with our GloTa approach appearing
         as the most promising.


3. Methods
Motivated by previous work, we initially attempted to address this task using either rule-based al-
gorithms or neural network-based methods. However, our experiments indicated that rule-based
algorithms, such as randomly arranging characters or replacing words using a preset synonym list, did
not achieve satisfactory performance across the five test tasks. Consequently, we focused primarily on
developing a new model-based method to solve this shared task.

3.1. Contextual Embedding with RoBERTa Attacker
Inspired by BERTAttacker4 [12], which provides a framework for automatically generating adversarial
samples, we first adopted a similar approach. BERTAttacker calculates an importance score for each
word and generates a candidate word list for substitution. However, we opted to use the RoBERTa
model [11] instead of BERT, given RoBERTa’s focus on masked word prediction with dynamic masking
and its training on a larger dataset, which should enhance its semantic understanding. We use the
RoBERTa-base model to generate importance scores for each word by calculating the difference in
output probability distributions between the original input sequence and the masked input sequence.
  Once the importance scores for each word are obtained, we rank the words in the sequence based
on these scores. This ranking identifies the most vulnerable words, with the highest-ranking word
4
    An implementation can be found at: https://github.com/thunlp/OpenAttack/blob/master/OpenAttack/attackers/bert_attack/
    __init__.py
being the most susceptible to an attack that could alter the classifier’s output. Then, we iteratively
substitute these words to execute the attack on the victim model. For each substitution, we identify the
word’s position and extract its contextual embedding from the second-to-last layer of the RoBERTa
model. We select 𝑘 (in our case 𝑘=36) other words from the masked sequence’s predictions at that
position and extract their contextual embeddings after substituting the original word in the RoBERTa
model. We experimented with different values of 𝑘 and determined that 36 offers the optimal balance
between attack success rate and semantic preservation. By comparing the original word’s contextual
embedding with those of the 𝑘 selected words, we retain only those with a similarity score above a
preset threshold(=0.3) as candidates for substitution.
  After obtaining a list of candidate words for each position in the original sequence, we substitute
each original word with candidates and check if the substitution fools the victim model. If successful,
we stop and return the modified sequence. If none of the candidates succeed, we retain the word that
most reduces the confidence in the original label and repeat the process for the next position. This
continues until either the attack is successful or all words in the input sequence have been processed.
The whole process of this method is shown in Algorithm 1.

Algorithm 1 Adversarial Attack using RoBERTa
Require: Original sequence X, victim model 𝑀 , RoBERTa model 𝑅, number of candidates 𝑘, similarity
    threshold 𝜏
Ensure: Modified sequence X′
 1: S ← CalculateImportanceScores(X, 𝑅)
 2: W ← RankWordsByImportance(S)
 3: for word 𝑤𝑖 in W do
 4:     𝑒𝑖 ← ExtractEmbedding(𝑤𝑖 , 𝑅)
 5:     C ← GenerateCandidateWords(𝑤𝑖 , 𝑘, 𝑅)
 6:     E ← ExtractEmbeddings(C, 𝑅)
 7:     C′ ← {𝑐 ∈ C | Similarity(𝑒𝑖 , 𝑒𝑐 ) > 𝜏 }
 8:     for candidate 𝑐 in C′ do
 9:         X′ ← SubstituteWord(X, 𝑤𝑖 , 𝑐)
10:         if 𝑀 (X′ ) ̸= 𝑀 (X) then
11:            return X′
12:         end if
13:     end for
14:     𝑤𝑖′ ← WordThatReducesConfidenceMost(C′ , 𝑀 )
15:     X ← SubstituteWord(X, 𝑤𝑖 , 𝑤𝑖′ )
16: end for
17: return X


3.2. GloTa: Combining GloVe Embeddings with RoBERTa
GloTa, which stands for GloVe and RoBerTa, represents a method that combines GloVe [21] embeddings
and RoBERTa to enhance adversarial attack techniques. Applying the aforementioned method yielded
a high success rate, but the semantic score was often low due to extensive substitution by RoBERTa-
generated candidates. These substitutions do not necessarily preserve the original meaning and may
even introduce opposite meanings. For example, RoBERTa-generated candidates for the word love
in the sentence I love you might include miss, forgive, or hate. To address this issue, we use GloVe
embeddings to generate candidate lists for substituting vulnerable words in the input sequence.
   The candidate lists are generated using a process akin to the initial step in Genetic algorithm [13].
We build a large synonym dataset by computing the 𝑁 nearest neighbors of each selected word based
on distance in the GloVe embedding space (Common Crawl, 840B tokens, 2.2M vocab), using the aclImdb
dataset [22] of movie reviews from IMDB as dictionary to construct the synonym dictionary, thereby
mitigating the semantic loss associated with candidates generated from masked language models. We
still use the aclImdb dataset because it was employed in the original Genetic algorithm paper, allowing
us to maintain consistency and compare our results with the original findings.

Algorithm 2 Adversarial Attack using RoBERTa and GloVe
Require: Original sequence X, victim model 𝑀 , RoBERTa model 𝑅, GloVe embeddings 𝐺, synonym
    dictionary 𝐷, number of candidates 𝑘, similarity threshold 𝜏
Ensure: Modified sequence X′
    S ← CalculateImportanceScores(X, 𝑅)
 2: W ← RankWordsByImportance(S)
    for word 𝑤𝑖 in W do
 4:     if 𝑤𝑖 ∈ 𝐷 then
            C ← 𝐷[𝑤𝑖 ]
 6:     else
            𝑒𝑖 ← ExtractEmbedding(𝑤𝑖 , 𝑅)
 8:         C ← GenerateCandidateWords(𝑤𝑖 , 𝑘, 𝑅)
            E ← ExtractEmbeddings(C, 𝑅)
10:         C′ ← {𝑐 ∈ C | Similarity(𝑒𝑖 , 𝑒𝑐 ) > 𝜏 }
            C ← ReRankCandidatesByBLEURT(C′ , X)
12:     end if
        for candidate 𝑐 in C do
14:         X′ ← SubstituteWord(X, 𝑤𝑖 , 𝑐)
            if 𝑀 (X′ ) ̸= 𝑀 (X) then
16:            return X′
            end if
18:     end for
        𝑤𝑖′ ← WordThatReducesConfidenceMost(C, 𝑀 )
20:     X ← SubstituteWord(X, 𝑤𝑖 , 𝑤𝑖′ )
    end for
22: return X


   Once the synonym dictionary is constructed, we extract the most vulnerable word by the method
in Algorithm 1, generate a list of the closest words in the GloVe embeddings as candidates, and then
substitute the original word with candidates until the decision is flipped. If none of the words in the
list achieves this, we proceed to the next vulnerable word in the sequence and repeat the process.
However, because the input sequences span five different domains and may even include URLs and
emojis, many words are absent from the constructed synonym dictionary. For these out-of-vocabulary
words, we revert to using RoBERTa to generate candidate lists and then re-rank the substitution words
by BLEURT [23] to prioritize semantically close words for substitution. The entire process is illustrated
in Algorithm 2.
   Additionally, we set two types of hyperparameters that can be tuned for different input datasets:
(1) max_candidates (2) max_substitutes and max_sub_rate. The first hyperparameter is the number
of substitution words. A longer list may increase the success rate but reduce semantic integrity. The
second type includes two hyperparameters: the number of substitutions made to the original sequence
and the substitution rate compared to the original input sentences. Ideally, we can substitute each
word, but excessive substitutions may significantly alter the semantic meaning. Therefore, we establish
these thresholds to balance the trade-off between the semantic score and success score. The attack on
the sequence will terminate when either threshold is reached. In our experiment, we tested different
parameter values and set the optimal parameters as follows: max_candidates to 30, max_substitutes to
80, and max_sub_rate to 0.5.
4. Results
We conducted the RoBERTa-ATTACK and our GloTa method, along with BERT-ATTACK and Genetic
algorithms as baselines, on five tasks to attack three victim models: the BERT Classifier, the Bi-LSTM
Classifier, and the RoBERTa Classifier. The last model, the RoBERTa Classifier, was introduced as
a "surprise" model for this shared task. The results are summarized in Table 2. GloTa achieved the
highest BODEGA scores in the Propaganda Detection (PR) and Fact Checking (FC) tasks. Analyzing the
sub-scores of these tasks, the gains were primarily from the success scores compared to the baseline
methods, indicating that our method is more effective at fooling the classifier in these tasks.

Table 2
Performance comparison of different attack methods on BERT and Bi-LSTM classifiers. Evaluation measures
include BODEGA score (BO), success score (suc), semantic score (sem), and character score (char). The best score
in each task and scenario is in boldface.
                                     BERT Classifier         Bi-LSTM Classifier         RoBERTa Classifier
 Task         Method
                              BO       suc sem cha         BO    suc sem cha           BO  suc sem cha
         BERT-ATTACK          0.60    0.96   0.64   0.97   0.64   0.98   0.66   0.99     -      -      -      -
            Genetic           0.40    0.86   0.47   0.98   0.44   0.94   0.48   0.98     -      -      -      -
  HN
        RoBERTa-ATTACK        0.58    0.95   0.62   0.98   0.63   0.99   0.64   0.99   0.34   0.57   0.60   0.98
             GloTa            0.59    0.96   0.62   0.98   0.63   0.99   0.64   0.99   0.44   0.77   0.59   0.97
         BERT-ATTACK          0.43    0.70   0.68   0.90   0.53   0.80   0.72   0.91     -      -      -      -
            Genetic           0.50    0.84   0.65   0.89   0.54   0.88   0.67   0.89     -      -      -      -
  PR
        RoBERTa-ATTACK        0.53    0.95   0.62   0.88   0.58   0.97   0.65   0.89   0.25   0.54   0.54   0.82
             GloTa            0.56    0.97   0.64   0.88   0.60   0.98   0.68   0.90   0.45   0.95   0.56   0.81
         BERT-ATTACK          0.53    0.77   0.73   0.95   0.60   0.86   0.73   0.95     -      -      -      -
            Genetic           0.52    0.79   0.70   0.95   0.61   0.90   0.71   0.95     -      -      -      -
  FC
        RoBERTa-ATTACK        0.62    1.00   0.64   0.96   0.67   0.99   0.70   0.97   0.67   0.99   0.69   0.97
             GloTa            0.62    0.98   0.66   0.96   0.69   1.00   0.71   0.97   0.67   0.99   0.70   0.96
         BERT-ATTACK          0.18    0.44   0.43   0.96   0.29   0.79   0.41   0.89     -      -      -      -
            Genetic           0.20    0.46   0.45   0.96   0.32   0.71   0.47   0.96     -      -      -      -
  RD
        RoBERTa-ATTACK        0.20    0.52   0.42   0.89   0.31   0.76   0.44   0.92   0.20   0.47   0.44   0.93
             GloTa            0.19    0.45   0.44   0.94   0.31   0.71   0.46   0.95   0.21   0.52   0.44   0.92
        RoBERTa-ATTACK        0.53    0.99   0.57   0.92   0.53   1.00   0.57   0.92   0.47   0.99   0.51   0.89
 C19
             GloTa            0.52    0.96   0.58   0.93   0.53   1.00   0.57   0.92   0.45   0.96   0.51   0.89


   For the Style-based News Bias Assessment (HN) and Rumor Detection (RD) tasks, our GloTa method
achieved similar BODEGA scores as the baseline. In the HN task, the baseline BERT-ATTACK already
achieved a high success score, limiting the potential for additional gains. In the meantime, GloTa’s
semantic score did not outperform the baseline, resulting in a comparable BODEGA score. In the
RD task, the victim model’s robustness led to low success scores across all methods, limiting GloTa’s
effectiveness in this context.
   The COVID-19 Misinformation Detection (C19) task is a new dataset released for this shared task,
lacking prior baselines for comparison. Both GloTa and RoBERTa-ATTACK achieved a relatively high
success score, indicating this task’s susceptibility. However, the semantic score was not high due to
the presence of non-word tokens such as URLs, hashtags, and emojis from Twitter data, making it
challenging to find semantically similar alternative words.
   The newly introduced classifier in this shared task, the RoBERTa classifier, is purported to be more
robust. We employed both RoBERTa-ATTACK and GloTa methods to attack this model, although we
did not have baseline results for comparison. GloTa significantly outperformed RoBERTa-ATTACK in
the HN and PR tasks, while achieving comparable results in the FC and RD tasks. These outcomes from
the RoBERTa classifier align with the comparative performance between GloTa and RoBERTa-ATTACK
observed in the BERT and Bi-LSTM classifier results. However, the performance gap between them
has widened, suggesting that GloTa is more effective at attacking more robust classifiers. Nonetheless,
RoBERTa-ATTACK marginally outperformed GloTa in the C19 task, primarily due to its higher success
score. Furthermore, when compared to all the BODEGA scores in BERT and Bi-LSTM classifiers, only
the FC task exhibited a higher BODEGA score in the RoBERTa classifier, while the other tasks showed
lower scores due to decreased success score or semantic score. This indicates that the RoBERTa classifier
is indeed more robust.


5. Discussion and Future Work
Given that the BODEGA score consists of three components—success score, semantic score, and
character score—the evaluation of adversarial attacks can be analyzed within these divisions. As shown
in Table 2, the character scores consistently remain high across different tasks. Therefore, our discussion
is focused on improving the success score and semantic score.

5.1. Success Score
Our GloTa method achieves near-perfect success scores across all three classifiers, with the exception
of the RD task in these three classifiers and the HN task in the RoBERTa classifier. We attribute this
high success rate primarily to our "greedy" attack method, which identifies vulnerable words and
replaces them sequentially until the classification is altered. While it is possible to replace all words
in the input sentences, this approach would significantly slow the process and lower the semantic
score, given that GloTa creates word-level replacements without considering the context. Therefore, we
introduced hyperparameters to control the number of substitutions in the original sequences, balancing
the trade-off between success and semantic scores. Additionally, as shown in Table 1, we observed that
HN and RD tasks involve longer texts compared to other tasks. This can lead to a more robust trained
victim model, which requires either replacing more words to alter the result or failing to attack when
the thresholds set by hyperparameters are reached. Future research could investigate allowing varying
numbers of substitutions in low-success tasks, such as RD, to determine if increasing the success rate
can enhance the overall BODEGA score.

5.2. Semantic Score
Our initial goal of combining BERT-ATTACK and Genetic was due to their superior performance in the
experiment conducted by Przybyła et al. [1]. We found that using a masked language model to replace
words could not maintain a relatively high semantic score, and character-level replacement resulted in
a low success rate. Therefore, we employed GloVe embeddings at word level to improve the semantic
score in the BODEGA calculation while maintaining a high success rate. However, results indicate that
the semantic score of our GloTa method showed only minor improvement compared to using words
from masked language models. For some tasks, such as FC, the semantic score decreased despite an
increase in the success score, due to more words being replaced in the original sequences. This may
be because the synonym dictionary was constructed from the aclImdb dataset, which differs from the
domains of the test tasks (HN, PR, FC, RD, and C19). As a result, many out-of-vocabulary words were
found in the original sequences, which required using RoBERTa to generate alternative words. This
likely explains the similarity in semantic scores between RoBERTa-ATTACK and GloTa.
   In addition, our system achieved an average BODEGA score of 0.4776 across all tasks and classifiers
with an average semantic score of 0.5867, ranking 4th out of 6 teams. However, in human evaluations,
our system managed to preserve the meaning in only 14% of the attack samples, also ranking 4th
out of 6 teams. The details of the ranking and human evaluation methodology are explained in the
overview paper of this shared task by Barrón-Cedeño et al. [18]. The human evaluation result is
notably lower compared to the semantic score achieved by our system. This discrepancy is likely due to
several factors. First, in the automatic evaluation, the BODEGA score utilizes BLEURT to calculate the
semantic score, where substituting a single word with its close synonym often results in a high score,
whereas human evaluation considers the entire context. Second, as previously mentioned, the synonym
dictionary did not adequately cover the domain of the test tasks, leading to candidate words that were
not semantically similar enough to the original words. Third, word-level replacements do not consider
the full context, leading to decreased sentence fluency and greater deviation from the original meanings
as more replacements are performed. However, we found that using GloVe embeddings performed
better in human evaluation compared to other teams that used only masked language models for word
substitution. This is likely because, as noted in Section 3.2, the words generated by the masked language
model do not necessarily preserve the original meaning and may even introduce opposite meanings.
   To further improve semantic scores, constructing the synonym dictionary using text data from the five
test tasks could reduce the occurrence of out-of-vocabulary words, thereby enhancing the semantic score.
Additionally, methods such as DeepWordBug [24], which performs character-level modifications, could
be explored to enhance the semantic score in both automatic and human evaluations after identifying
vulnerable words by the masked language model. Furthermore, our current methodology overlooks the
impact of emojis, which were ignored despite their importance in conveying emotional information.
Future research should incorporate emoji embeddings to enhance semantic understanding and model
performance.


6. Conclusion
In this shared task, we explored the robustness of classifiers for credibility assessment (CA) tasks by
developing and evaluating two attack methods: RoBERTa-ATTACK and GloTa. The GloTa method, which
combines GloVe embeddings and RoBERTa to enhance adversarial attack capabilities, demonstrated
superior effectiveness, achieving the highest BODEGA scores in propaganda detection (PR) and fact-
checking (FC) tasks. This indicates a significant vulnerability in these areas. However, its performance
was on par with baselines in style-based news bias assessment (HN) and rumor detection (RD), reflecting
the inherent robustness of classifiers in these tasks.
   When tested against a more robust RoBERTa classifier, GloTa outperformed RoBERTa-ATTACK,
although with generally lower success rates, underscoring the enhanced robustness of the RoBERTa
classifier. These findings highlight the trade-off between achieving high attack success rates and
maintaining semantic integrity. Our study enhances the understanding of CA classifier robustness and
demonstrates that using a masked language model to identify vulnerable words and replace them with
similar word embeddings in the original texts can be an effective method for adversarial attacks.


Acknowledgments
Special thanks to all the organizers of the CheckThat! Lab at CLEF 2024. We would also like to extend
our gratitude to Andrianos Michail, Simon Clematide and the Department of Computational Linguistics
at the University of Zurich for valuable suggestions and continuous support for this shared task.


References
 [1] P. Przybyła, A. V. Shvets, H. Saggion, Verifying the robustness of automatic credibility assessment,
     2023. URL: https://api.semanticscholar.org/CorpusID:257505431.
 [2] B. Liang, H. Li, M. Su, P. Bian, X. Li, W. Shi, Deep text classification can be fooled, arXiv preprint
     arXiv:1704.08006 (2017). URL: https://arxiv.org/abs/1704.08006.
 [3] Z. Kong, J. Xue, Y. Wang, L. Huang, Z. Niu, F. Li, A survey on adversarial attack in the age
     of artificial intelligence, Wireless Communications and Mobile Computing 2021 (2021) 1–22.
     doi:10.1155/2021/4907754.
 [4] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The fact extraction and
     VERification (FEVER) shared task, in: J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopou-
     los, A. Mittal (Eds.), Proceedings of the First Workshop on Fact Extraction and VERification
     (FEVER), Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1–9. URL:
     https://aclanthology.org/W18-5501. doi:10.18653/v1/W18-5501.
 [5] Y. Jiang, X. Song, C. Scarton, I. Singh, A. Aker, K. Bontcheva, Categorising fine-to-coarse grained
     misinformation: An empirical study of the COVID-19 infodemic, in: R. Mitkov, G. Angelova
     (Eds.), Proceedings of the 14th International Conference on Recent Advances in Natural Language
     Processing, INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, 2023, pp. 556–567. URL: https:
     //aclanthology.org/2023.ranlp-1.61.
 [6] S. Han, J. Gao, F. Ciravegna, Neural language model based training data augmentation for weakly
     supervised early rumor detection, in: Proceedings of the 2019 IEEE/ACM International Conference
     on Advances in Social Networks Analysis and Mining, ACM, Vancouver British Columbia Canada,
     2019, pp. 105–112. URL: https://dl.acm.org/doi/10.1145/3341161.3342892. doi:10.1145/3341161.
     3342892.
 [7] G. Da San Martino, A. Barrón-Cedeño, H. Wachsmuth, R. Petrov, P. Nakov, SemEval-2020 task 11:
     Detection of propaganda techniques in news articles, in: A. Herbelot, X. Zhu, A. Palmer, N. Schnei-
     der, J. May, E. Shutova (Eds.), Proceedings of the Fourteenth Workshop on Semantic Evaluation,
     International Committee for Computational Linguistics, Barcelona (online), 2020, pp. 1377–1414.
     URL: https://aclanthology.org/2020.semeval-1.186. doi:10.18653/v1/2020.semeval-1.186.
 [8] M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, B. Stein, A stylometric inquiry into hyperpartisan
     and fake news, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the
     Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, Melbourne, Australia, 2018, pp. 231–240. URL: https://aclanthology.org/P18-1022.
     doi:10.18653/v1/P18-1022.
 [9] J. Li, S. Ji, T. Du, B. Li, T. Wang, Textbugger: Generating adversarial text against real-world appli-
     cations, ArXiv abs/1812.05271 (2018). URL: https://api.semanticscholar.org/CorpusID:54815878.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transform-
     ers for Language Understanding, in: Proceedings of the 2019 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
     1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota,
     2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423.
[11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     RoBERTa: A robustly optimized BERT pretraining approach, ArXiv abs/1907.11692 (2019). URL:
     https://api.semanticscholar.org/CorpusID:198953378.
[12] L. Li, R. Ma, Q. Guo, X. Xue, X. Qiu, BERT-ATTACK: Adversarial attack against BERT using BERT,
     ArXiv abs/2004.09984 (2020). URL: https://api.semanticscholar.org/CorpusID:216036179.
[13] M. F. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. B. Srivastava, K.-W. Chang, Generating natural
     language adversarial examples, ArXiv abs/1804.07998 (2018). URL: https://api.semanticscholar.
     org/CorpusID:5076191.
[14] G. Zeng, F. Qi, Q. Zhou, T. Zhang, Z. Ma, B. Hou, Y. Zang, Z. Liu, M. Sun, OpenAttack: An
     open-source textual adversarial attack toolkit, in: H. Ji, J. C. Park, R. Xia (Eds.), Proceedings of the
     59th Annual Meeting of the Association for Computational Linguistics and the 11th International
     Joint Conference on Natural Language Processing: System Demonstrations, Association for
     Computational Linguistics, Online, 2021, pp. 363–371. URL: https://aclanthology.org/2021.acl-demo.
     43. doi:10.18653/v1/2021.acl-demo.43.
[15] X. Han, Y. Zhang, W. Wang, B. Wang, Text Adversarial Attacks and Defenses: Issues, Taxonomy,
     and Perspectives, Security and Communication Networks 2022 (2022) 6458488. URL: https://doi.
     org/10.1155/2022/6458488. doi:10.1155/2022/6458488, publisher: Hindawi.
[16] J. Gao, J. Lanchantin, M. L. Soffa, Y. Qi, Black-box generation of adversarial text sequences to
     evade deep learning classifiers, in: 2018 IEEE Security and Privacy Workshops (SPW), 2018, pp.
     50–56. doi:10.1109/SPW.2018.00016.
[17] M. Iyyer, J. Wieting, K. Gimpel, L. Zettlemoyer, Adversarial example generation with syntactically
     controlled paraphrase networks, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long Papers), Association for Computational Lin-
     guistics, New Orleans, Louisiana, 2018, pp. 1875–1885. URL: https://aclanthology.org/N18-1170.
     doi:10.18653/v1/N18-1170.
[18] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
     G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of
     the CLEF-2024 CheckThat! Lab: Check-Worthiness, subjectivity, persuasion, roles, authorities
     and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M.
     Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), 2024.
[19] G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF
     2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024.
[20] P. Przybyła, B. Wu, A. Shvets, Y. Mu, K. C. Sheang, X. Song, H. Saggion, Overview of the CLEF-
     2024 CheckThat! lab task 6 on robustness of credibility assessment with adversarial examples
     (InCrediblAE), in: [19], 2024.
[21] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
     Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. URL: http:
     //www.aclweb.org/anthology/D14-1162.
[22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, Learning word vectors for sentiment
     analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational
     Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland,
     Oregon, USA, 2011, pp. 142–150. URL: http://www.aclweb.org/anthology/P11-1015.
[23] T. Sellam, D. Das, A. Parikh, BLEURT: Learning robust metrics for text generation, in: D. Jurafsky,
     J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for
     Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7881–7892.
     URL: https://aclanthology.org/2020.acl-main.704. doi:10.18653/v1/2020.acl-main.704.
[24] J. Gao, J. Lanchantin, M. L. Soffa, Y. Qi, Black-box generation of adversarial text sequences to
     evade deep learning classifiers, in: 2018 IEEE Security and Privacy Workshops (SPW), IEEE, 2018,
     pp. 50–56. URL: http://arxiv.org/abs/1801.04354.