SINAI at CheckThat! 2024: Stealthy Character-Level
                         Adversarial Attacks Using Homoglyphs and Iterative
                         Search
                         José Valle-Aguilera*,† , Alberto J. Gutiérrez-Megías† , Salud María Jiménez-Zafra† ,
                         L. Alfonso Ureña-López† and Eugenio Martínez-Cámara†
                         Computer Science Department, SINAI, CEATIC Universidad de Jaén, Campus Las Lagunillas, 23071, Jaén, Spain.


                                     Abstract
                                     This paper describes the participation of the SINAI research group in Task 6 of CheckThat! Lab 2024 at CLEF.
                                     For this task, it is presented a language model adversarial attack scenario using the BODEGA framework. Given
                                     this scenario, the participants have to develop an Adversarial Attack Model Class, that is, a system that attacks a
                                     given text by modifying it, confusing the model, and making an incorrect prediction. We propose an adversarial
                                     attack experiment using search algorithms based on iterative search. Specifically, we propose a stealthy heuristic
                                     based on character level attack, modifying a letter (character) with its homoglyph. Our system obtained a 99% of
                                     preservation of meaning in the manual evaluation phase of this task, being the most stealthy proposal presented.

                                     Keywords
                                     Adversarial Attack, Homoglyph, Iterative Search, Stealth Attack


                         1. Introduction
                         With the rise of language models and their applications to different tasks and tools, new needs have
                         arisen in the world of security, robustness and reliability of these models. In a scenario where language
                         models are used to classify and label a given data, for example, spam detection, a malicious actor could
                         modify the model’s input data, which is known as an untargeted adversarial attack, or its internal
                         architecture, which is called a targeted adversarial attack, to modify the behavior of the model, stealthily,
                         for the user who is using it. These attacks reduce the trustworthiness of the model since its performance
                         is made less significantly [1].
                            This paper presents the details and results of different types of adversarial attacks developed by
                         the SINAI team for Task 6 of the CheckThat! lab [2] at CLEF 2024 [3][4]. The goal of this task is to
                         develop an Adversarial Attack Model Class for attacking the input data in various language models
                         trained for a binary classification task, changing the label of the input data via transformations. These
                         transformations must be general attacks, and not focused on a model, because among the available
                         models, there are three different models, BERT [5] , BiLSTM [6], and RoBERTa-base [7], being the
                         last one unknown for the participants. In addition, these Adversarial Attacks need to be stealthy [8],
                         that is, they have to maintain a semantic and character similarity between the provided data and
                         the result of the attack, called adversarial example. These attacks should be carried out with as few
                         transformations as necessary while keeping the text as intact as possible. This is measured using the
                         BODEGA framework [9], a system that provides metrics for semantic and character similarity. The
                         data for this task consists of five datasets on different classification tasks: Propaganda Detection (PR),
                         Fact-Checking (FC), Covid-19 (C19), Style-based News Bias Assessment (HN) and Rumour Detection
                         (RD).

                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ jvalle@ujaen.es (J. Valle-Aguilera); agmegias@ujaen.es (A. J. Gutiérrez-Megías); sjzafra@ujaen.es (S. M. Jiménez-Zafra);
                         laurena@ujaen.es (L. A. Ureña-López); emcamara@ujaen.es (E. Martínez-Cámara)
                          0000-0003-3274-8825 (S. M. Jiménez-Zafra); 0000-0001-7540-4059 (L. A. Ureña-López); 0000-0002-5279-8355
                         (E. Martínez-Cámara)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   For the development of our Adversarial Attack Class, we use different search algorithms along with
character-based attacks, called heuristics, to determine which one (or which combination) offers better
performance by reserving the label of the input data. We define character-based attacks as modifications,
deletions, or additions of a character in a given string. Among them, we define the heuristics that
modify a letter by its homoglyph and deleting simple characters such as punctuation marks or symbols.
Additionally, to discover the tokens we need to attack to change the label, we have developed two
search algorithms based on brute force. These search algorithms select each token iteratively to attack
it with a specified heuristic, in our case, we select a homoglyph attack because it is the most stealthy to
the human eye. Once we have the n most important tokens, we attack those strings with a search space,
in this case, using a character-level modification homoglyphs. With this approach, we obtained a 99%
of preservation of meaning in the manual evaluation phase of the competition, being the most stealthy
proposal presented in this task.
   The rest of the work is organized as follows: In Section 2 we describe the data used in this task. In
Section 3 we define our solution, explaining all the contemplated proposals and the final attack used.
We mention the results of our proposal in Section 4. We conduct an error analysis in Section 5. Finally,
the conclusion can be found in Section 6.


2. Datasets
The datasets provided by the BODEGA framework are the ones used for this task. The BODEGA
framework has a set represented by 5 datasets of different contexts:

    • Fact-Checking (FC): It is the most advanced way in which human experts can verify the credibility
      of a given text, by assessing the veracity of the claims it includes concerning a knowledge
      base. It deals with Natural Language Inference (NLI) in the field of encyclopedic knowledge and
      newsworthy events [10].
    • Style-based news bias assessment (HN): Assessing news credibility is usually based on three
      factors, writing style, veracity, and context. Training data includes statistical information to
      recognize sources with known credibility [11].
    • Propaganda detection: (PR): Detecting propagandizing in a text does not necessarily imply that
      something is false, but also detecting the persuasion techniques used in this type of journalism.
      This corpus belongs to SemEval 2020 task 11 scored by practitioners on 371 articles [12].
    • Rumour detection (RD): A rumour is information that is spread among people even if it does
      not originate from a reliable source. Although not all rumours have to be false, some can be
      confirmed later by legitimate sources [13].
    • Covid-19 (C19): With the COVID-19 pandemic, the creation of pandemic conspiracy theories and
      disinformation became increasingly critical and dubious. Conspiracy theories about COVID-19
      find more support among the general public than misinformation about treatment and transmis-
      sibility [14].

   Table 1 shows the distribution of these datasets, which have been used to train the victim models
of the competition. There is variability between the number of training data for each dataset, as we
observe that datasets such as PR and RD are unbalanced.


3. Proposal description
The proposal presented for this task consists of different search algorithms, an attack heuristic and
several constraints:

    • Search Algorithms: Search systems based on finding tokens in a target text to attack it. These
      algorithms are based on brute force, differentiating two heuristics, one stores information to
      attack and the other does not.
Table 1
Distribution of the dataset: (Training) is the number of training data used to train the victim models, (Attack) is
the number of instances of each dataset that are attacked and (Positive) is the percent of positive labels of the
dataset.
                                    Dataset Training Attack Positive %
                                    HN            60235       400       50.00%
                                    PR            12675       416       29.42%
                                    FC           172763       405       51.27%
                                    RD             8694       415       32.68%
                                    C19            1130       595       40.26%


        Figure 1: Types of brute force attacks implemented for this task.


    • Heuristics: Attack algorithms that modify one or multiple elements of the input data. For this
      task, we have focused on homoglyph attacks.
    • Constraints: Restrictions to take into account when attacking. These constraints can change or
      limit the attacks to make them less time-consuming or improve their performance.

  Each of these approaches is explained in detail below and the proposal in which they are combined is
presented at the end.

3.1. Search Algorithms
A search algorithm is defined as the method to select the tokens to attack which will change the
predicted label. In our proposal, we base these search algorithms on brute force attacks to try every
single token from the start of the text until it finishes. Also, two types of brute force attack are designed,
one keeping “memory” of the previously attacked tokens, and the other without “memory”. Figure 1
shows how the search algorithm and attack heuristic work altogether.
   In the case of the Non-Memory Search Algorithm, it selects the first token of the text and applies the
attack heuristic. Then, the system predicts the label of the modified text, called adversarial example. If
the label remains the same, the attack was unsuccessful and the search algorithm selects the next token
to attack from the unmodified text until the label changes to its goal or we reach the end of the text. If
the label changed to its goal, then the attack was successful and we may proceed with the subsequent
provided data, starting from the beginning. If we reach the end of the text without correctly changing
the label, we have failed to create an adversarial example and need to move on to the next one.
   In the case of the Memory Search Algorithms, the process is similar to the previous one, with one
single change. When an attack is unsuccessful, for the next iteration the search algorithm selects the
next token from the previous iteration, memorizing the changes made in the text. With this approach,
we “store” the changes made in the adversarial example and attack more tokens in every query we
make to the model.
   Now that the method for selecting a token to attack has been defined, it is time to establish the
heuristics and attacks implemented in the Adversarial Attack class.
        Figure 2: Types of heuristics implemented for this task.


3.2. Heuristic
We select a character-level technique to choose the type of perturbation to use in the search space. This
type of attack is more interesting, for the model and the human eye, as it makes it more difficult to
identify these perturbations, with the use of imperceptible characters or small modifications [8].
   The heuristic implemented for this task is a homoglyph based attack. It is a character-based modifi-
cation of the input string without taking into account the meaning or structure of the objective tokens.
For the development of our system, we focused on character-level attacks due to their stealthiness.
   The words to be attacked are selected by brute force. Modifications are made by sweeping the target
text, attacking each word individually, and checking the success of the modification. For each selected
word, a letter is randomly chosen to be replaced by one of the available homoglyphs, also randomly
selected, since the selection of the homoglyph does not change the final result. This heuristic has two
variations explained in Section 3.4.
   A homoglyph of a character is a symbol that resembles that character, usually a letter of the alphabet.
This homoglyph confuses the language model, which has not been trained with these symbols, causing
it to fail and change the label of the input data. In addition, since this symbol resembles an alphabet
character, and has been changed to a real letter, it makes the attack very stealthy to the human eye.
In order to obtain the homoglyphs of a letter, we use the Homoglyphs [15] Python library. With this
library, we can obtain a list of homoglyphs of a given letter. For this approach, we randomly select
a letter from the token, obtain the list of homoglyphs for this letter, and exchange it for a randomly
selected homoglyph of the list.
   Another character-based heuristic is the use of invisible characters. Similar to the homoglyphs,
invisible characters are almost impossible to detect by the human eye and can reduce the model’s
precision. Finally, other character-based heuristics are deletion, modification or addition of arbitrary
characters. Normally these heuristics follow a rule-based algorithm or rely simply on random mod-
ifications. However, these heuristics are easier to notice by the human eye than homoglyph attacks.
Additionally, homoglyph attacks do not worsen the character and semantic score as much as other
heuristics. Figure 2 shows examples of these attacks, with the homoglyph attack being the most stealthy
to the human eye.

3.3. Constraints
We do small attack algorithms and restrictions that change the selection of tokens to attack or the
heuristic to be used. Within the constraints, we find:

    • Remove simple characters: When attacking a selected token, regardless of the attack heuristic
      applied, if the selected token has only one character, that token is removed from the text and it
      is marked as attacked. On several occasions, simple characters are key to determining a label,
      such as the beginning and end of a quote (" "). Finding a homoglyph is more complicated in these
      cases, as it is not a letter. It is therefore considered necessary to remove this type of character.
    • Exception management in attacks: When an exception occurs in an attack, the system tries to
      recover, normally selecting another letter or word to attack. In the case of a homoglyph attack,
      the letter “m” does not have any available homoglyph. In this case, the other letter is randomly
      selected until the attack finalizes correctly.

  All these constraints are taken into account in the development of our system and can be easily
activated, deactivated or configured in the Adversarial Attack class variables.

3.4. Final System
Now that each approach has been defined, we can describe our final proposal. One of the most important
issues limiting our system is the execution time. Approaches such as using Memory Search Algorithms
in large texts are too time-consuming. That is why we decided to limit and adapt some attack proposals,
in order to sacrifice performance for more efficient approaches.
   For sort texts (PR, C19 and FC datasets) we define our system as follows:

    • Non-memory Search Algorithm: In the development phase, we detected that nearly 20% of the
      attacks were successful only modifying one token from the text. In order to make the attack less
      time-consuming, a first brute force attack without memory is made, that is, attack each token
      without maintaining the changes made in the text.
    • Memory Search Algorithm: If the attack is unsuccessful, we start from the beginning of the ,
      and initiate a brute force attack with memory, that is, maintaining the previous changes to the
      text until the attack succeeds or reaches the end of the text. In that case, the attack has been
      unsuccessful and we move on to the next text in the dataset.

  For large texts (HN and RD), a different approach is taken into account, in order to reduce the time
and resources consumed. In this case, only a Non-memory brute force attack is made.


4. Results
We perform the experiments with the test sets provided in the competition using our two search
algorithms over RoBERTa-base as the victim model. Considering the results of attacking this new model
with a character-level technique, we can conclude that we obtain a favorable result taking into account
the semantic score and the Levenshtein distance (Character Score). This search space performs best
on the PR2, FC and HN dataset. These datasets have a higher hit rate than the other datasets using a
homoglyph-based search space.

Table 2
Final adversarial attack results using a homoglyph-based search space on the competition datasets. The victim
model is RoBERTa-based, the model unknown to the participants in the test phase. PR2, FC and C19 are
considered datasets of short texts, and HN and RD of long texts, which changes the search algorithm used in the
attack of each one.
              Dataset Success Score Semantic Score Character Score BODEGA Score
              PR2                 0.3052           0.8943              0.9660              0.2645
              FC                  0.9160           0.8648              0.9660              0.7689
              C19                 0.1848           0.8242              0.9961              0.1702
              HN                  0.4300           0.8848              0.9994              0.3802
              RD                  0.1012           0.8642              0.9970              0.0870

   In general, the performance of the search space using homoglyphs works best with small text entries,
although in the HN dataset has also performed well compared to the other long text datasets, this
may be due to the fact that datasets such as RD and PR2 being unbalanced are more difficult to attack.
Due to the nature of long texts and the use of an iterative search that does not store information, it
cannot compromise the decision of the model in this type of dataset. In short texts, in many cases, by
removing punctuation marks or modifying a single word we are able to change the label of the victim
model in a short time without the need to store information. However, in large texts, such strategies
are not so successful, and using greedy techniques such as iterative search to store the information of
these changes is computationally time-consuming, which in a real detection environment would be
impractical.
   Our final result of the competition is the average BODEGA score of each attack performed on a
dataset and using a different victim model for each one. This means that the final score is the average of
15 executions, being our result 0.3507 avg. BODEGA score. This result is low considering, as shown in
Table 2, that we have very disparate results, especially in the long text datasets compared to the short
ones.
   On the other hand, besides the automatic evaluation, a manual evaluation of a subset of the data
has been carried out. Each of the attacked texts has been evaluated anonymously by two annotators,
obtaining an average pairwise annotator agreement of 0.59 in Cohen’s Kappa. The results are represented
in percentage which means how much they have kept the meaning of the text (N/100)%. Our work has
achieved a 99% meaning score by the annotators, being the team in the competition with the highest
score. We performed attacks based mostly on stealth. Therefore we chose to use character-level attacks.
Among these attacks we selected homoglyphs, as they are very small modifications in the text, they are
almost imperceptible to humans, so they do not change the meaning of the phrase. The results obtained
in the manual evaluation highlights that, in case of obtaining only short texts or with the use of other
more sophisticated techniques, the attack on character level, especially homoglyphs, can be a difficult
attack to detect in a real environment.


5. Error analysis
After analyzing the results, we found that perturbing a single word in the dataset can change the
outcome of the model, but in specific cases, such as longer input texts, it may be less successful. There
is also a success ratio in the results if the dataset is balanced, as is the case for FC and HN. Iterative
search in long texts has not been as successful as expected, so using advanced explainability methods
or employing word or phrase-based techniques may be more effective in these cases. Many of the
successful modifications made in short texts are due to the removal of punctuation marks.
   We found that the use of search spaces based on brute-force techniques, without storing changes,
can be fast but not as successful. Therefore, it is necessary to identify tokens of importance to the
victim model and attack these targets using attacks such as homoglyphs, as they prove to be sufficiently
stealthy for both the model and humans, and powerful enough to perform label switching.
   One of the most relevant problems is the processing of long texts using these heuristics because of
large processing time. One of the possible solutions could be to split the texts to be attacked in order to
have a better performance using the methodology that uses memory. Another possible solution would
be to identify which words are more important to modify in these longer texts, in order to increase
efficiency.


6. Conclusion
In this paper an Adversarial Attack Model Class is proposed, using iterative search algorithms with an
homoglyph attack. It is also worth noting the stealthiness of this attack and the difficulty of detecting
them by an external agent, in addition to the fact that no advanced techniques are needed to attack
widely used language models. It should be noted the stealthiness of our proposal, obtaining a 99%
of meaning preservation among all datasets, being the most silent Attack Model presented in this
task. Finally, future work could involve using explainability, as it allows computers to understand
human language by understanding the grammatical structure of the text and meaning of individual
words. Methods could be used on long texts, splitting these long texts into smaller ones and applying
different heuristics, either the iterative ones mentioned above or a mixture of greedy iterative search
and explainability-based search space. Additionally, we can further refine these attacks to make them
less time-consuming on large texts and try to apply explicability without penalty in time and resource
consumption.


Acknowledgements
This paper has been partially supported by projects CONSENSO (PID2021-122263OB-C21), MODER-
ATES (TED2021-130145B-I00), SocialTOX (PDC2022-133146-C21) and FedDAP (PID2020-116118GA-I00)
funded by MCIN/AEI/ 10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”.
The research work conducted by Salud María Jiménez-Zafra has been supported by Action 7 from
Universidad de Jaén under the Operational Plan for Research Support 2023-2024.


References
 [1] N. Papernot, P. McDaniel, A. Sinha, M. P. Wellman, Sok: Security and privacy in machine
     learning, in: 2018 IEEE European Symposium on Security and Privacy (EuroSP), 2018, pp. 399–414.
     doi:10.1109/EuroSP.2018.00035.
 [2] P. Przybyła, B. Wu, A. Shvets, Y. Mu, K. C. Sheang, X. Song, H. Saggion, Overview of the CLEF-
     2024 CheckThat! lab task 6 on robustness of credibility assessment with adversarial examples
     (incrediblae), 2024.
 [3] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
     M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The clef-2024 checkthat! lab: Check-worthiness,
     subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel-
     lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
     Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
 [4] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
     G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of
     the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities
     and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M.
     Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), 2024.
 [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
     for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
     doi:10.18653/v1/N19-1423.
 [6] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint
     arXiv:1508.01991 (2015).
 [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL:
     http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
 [8] N. Boucher, I. Shumailov, R. Anderson, N. Papernot, Bad characters: Imperceptible nlp attacks, in:
     2022 IEEE Symposium on Security and Privacy (SP), IEEE, 2022, pp. 1987–2004.
 [9] P. Przybyła, A. Shvets, H. Saggion, Bodega: Benchmark for adversarial example generation in
     credibility assessment, arXiv preprint arXiv:2303.08032 (2023).
[10] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a large-scale dataset for fact
     extraction and VERification, in: NAACL-HLT, 2018.
[11] J. Kiesel, M. Mestre, R. Shukla, E. Vincent, D. Corney, P. Adineh, B. Stein, M. Potthast, Data for
     pan at semeval 2019 task 4: Hyperpartisan news detection, November. type: dataset (2018).
[12] G. Da San Martino, A. Barrón-Cedeño, H. Wachsmuth, R. Petrov, P. Nakov, Semeval-2020 task 11:
     Detection of propaganda techniques in news articles, in: Proceedings of the Fourteenth Workshop
     on Semantic Evaluation, 2020, pp. 1377–1414.
[13] S. Han, J. Gao, F. Ciravegna, Augmented dataset of rumours and non-rumours for rumour detection,
     2019. doi:10.5281/zenodo.3249977.
[14] A. M. Enders, J. E. Uscinski, C. Klofstad, J. Stoler, The different forms of covid-19 misinformation
     and their consequences, The Harvard Kennedy School Misinformation Review (2020).
[15] Homoglyphs: get similar letters, convert to ascii, detect possible languages and utf-8 group.,
     https://pypi.org/project/homoglyphs/#description, 2022. [Accessed 31-05-2024].