TextTrojaners at CheckThat! 2024: Robustness of Credibility Assessment with Adversarial Examples through BeamAttack⋆ Notebook for the CheckThat! Lab at CLEF 2024 David Guzman Piedrahita1,*,† , Arnisa Fazla1,† and Lucas Krauter1,† 1 University of Zurich, Rämistrasse 71, CH-8006 Zürich, Switzerland Abstract This paper describes the submission of TextTrojaners for the CheckThat! 2024 lab task 6: Robustness of Credibility Assessment with Adversarial Examples [1]. We achieved the second best results on the BODEGA evaluation metric. Our approach BeamAttack is a novel algorithm for generating adversarial examples in natural language processing through the application of beam search. To further improve the search process, we integrate a semantic filter that prioritizes examples with the highest semantic similarity to the original sample, enabling early termination of the search. Additionally, we leverage a model interpretability technique, LIME, to determine the priority of word replacements, along with existing methods such as that determine word importance through the model’s logits. Our approach also allows for skipping and removing words, enabling the discovery of minimal modifications that flip the label. Furthermore, we utilize a masked language model to predict contextually plausible alternatives to the words to be replaced, enhancing the coherence of the generated adversarial examples. BeamAttack demonstrates state-of-the-art performance, outperforming existing methods with scores of up to 0.90 on the BiLSTM, 0.84 on BERT, and 0.82 on the RoBERTa classifier 1 . Keywords Model Robustness, Adversarial Attack, Beam search, Masked Language Models 1. Introduction Social media platforms increasingly rely on machine learning algorithms for content filtering to identify misleading, harmful, or illegal content. Despite advancements, these models remain vulnerable to adversarial attacks, wherein the input text is manipulated to deceive the model, even after undergoing adversarial training [2]. As adversaries continue to develop new attack techniques, assessing the robustness of text classification approaches becomes imperative. We participated in “Task 6: Robustness of Credibility Assessment with Adversarial Examples (In- CrediblAE)” at the ‘CheckThat! lab’ during CLEF 2024, which was designed to assess the robustness of popular text classification methods used in credibility assessment tasks [1]. The task involves conducting adversarial attacks on multiple text classification systems trained on various credibility assessment datasets. The goal is to modify the input text in a way that flips the classification of the models while minimizing semantic and syntactic modifications. These attacks are categorized as grey-box attacks, where we have access to the output logits of the victim models, and we can query the victim models as many times as we want. Previous research on grey-box adversarial attacks on text classification systems has explored mod- ifications at the character [3], word [4, 5, 6], and sentence levels of the input text. These methods encompassed a range of techniques, including random character and word replacements [7, 8, 9, 10, 11, 5], substitution with similar words [12, 13, 6], and controlled sentence paraphrasing to generate semanti- 1 Code is available at https://github.com/LucK1Y/BeamAttack CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ david.guzmanpiedrahita@uzh.ch (D. G. Piedrahita); arnisa.fazla@uzh.ch (A. Fazla); lucassteffen.krauter@uzh.ch (L. Krauter) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings cally similar text [14]. Our submission follows the line of research in the field that focuses on sequential word replacement using Masked Language Models (MLMs) combined with a search algorithm. We introduce BeamAttack, our word-level adversarial attack and submission to the shared task. At its core, BeamAttack employs a beam search algorithm to identify optimal word substitutions, thereby enabling the generation of multiple alternative adversarial text hypotheses capable of flipping the model’s classification. Leveraging techniques from existing literature and innovative approaches, BeamAttack integrates a Masked Language Model (MLM), specifically RoBERTa [15], to produce contextually appropriate word substitutions. BeamAttack demonstrated superior perfor- mance compared to all baseline methods across all datasets, securing the second place in the competition. Our contributions include the following: • We employ a beam search algorithm to explore combinations of multiple word substitutions, optimizing for the output logits of classifier models. This approach significantly expands the search space compared to the greedy search commonly used in the literature and allows for the generation of multiple alternative adversarial samples in a single run. While beam search has been utilized in white-box attacks by Ebrahimi et al. [16], to our knowledge, it has not been applied in a grey-box scenario. • In line with existing methodologies [6], we generate contextual word substitution alternatives using a Masked Language Model. Although Li et al. [6] utilized a BERT model [17], we employ a RoBERTa model [15] to enhance the effectiveness of our approach. • We adapt the beam search algorithm to allow for the option of not modifying certain words or removing them instead of substituting them at each step. This modification enables a more flexible and comprehensive exploration of possible word replacement combinations. • In addition to the logit-based word importance ranking method commonly used in the literature [6], we experiment with identifying vulnerable words in the input text using the explainable AI framework LIME [18], offering a novel perspective on word vulnerability in adversarial text generation. In the following sections, we will present the details of our approach addressing the challenge of robustness in credibility assessment. Section 2 provides an overview of related work in the field of NLP adversarial attacks, highlighting the fundamental principles of various attack methods and their applications. Section 3 describes the specific task we participate in, namely “Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE)”, and outlines the datasets and victim models used in our experiments. In Section 4, we present our algorithm and methodology for generating adversarial examples. The results of our experiments are presented in Section 5, where we evaluate the effectiveness of our algorithm in attacking the victim models. In Section 6, we conduct an ablation study to analyze the impact of hyper-parameters on our algorithm’s performance. Finally, we conclude our paper in Section 7 and outline limitations and potential avenues for future work in Section 8. 2. Related Work Despite their powerful capabilities, neural networks suffer from a lack of interpretability, making it difficult to predict their behavior, which can raise security concerns. Szegedy et al. [19] highlighted that while neural networks exhibit impressive generalization capabilities, they are also vulnerable to adversarial examples. An adversarial sample is generated from the original sample by applying the smallest possible change that leads to a misclassification by the target model, known as the victim classifier. The goal is to alter the sample just enough to trick the classifier into making an incorrect prediction. A formal definition for adversarial samples or examples can be found in [2]. While much of the existing research focuses on adversarial attacks in image classification [20], similar vulnerabilities exist in text classification tasks [21, 22, 4]. 2.1. Categories of Adversarial Attacks Adversarial attacks in NLP can be classified in three categories: white-box attacks, black-box attacks, and grey-box attacks. White-box attacks are also called gradient-based attacks, since the adversary (the attacker) has access to the weights and thus the gradients of the victim model [23, 24, 16, 5]. Black-box attacks are the more realistic scenario, where the adversary only has access to the prediction of the victim for any given input samples. Most algorithms assume an infinite number of queries to find the best perturbation [25, 14], or make use of surrogate models [26]. Przybyła et al. [27] proposes a grey-box scenario, where the adversary has access to not only the predictions, but also to the probabilities given to each class by the victim. This scenario enables adversaries to optimize their attacks based on the output probability distributions [28, 29, 6, 13, 30, 31, 32, 8]. We can also distinguish between targeted and untargeted attacks. In a targeted attack, the attacker aims to mislead the model into a specific incorrect class or label. In an untargeted attack, the attacker aims to mislead the model into any incorrect class or label, without a specific target in mind. The shared task at hand, InCrediblAE, focuses solely on untargeted attacks. 2.2. Granularity of Modifications in Adversarial Attacks Adversarial attacks in NLP differ fundamentally from those in computer vision due to the symbolic and discrete nature of text versus the continuous nature of images. This discreteness complicates the creation of adversarial examples in NLP, as modifications must preserve the grammatical and semantic integrity of the text while still deceiving the model. Consequently, NLP adversarial attacks often focus on specific granularities of modification, such as characters, words, subwords, or entire sentences. Character-level methods involve changing individual characters in the adversarial text [5, 16, 8, 3]. Sentence-level methods modify longer passages of a text, including paraphrasing entire sentences in a constrained manner to create adversarial samples [14]. Word-level methods modify the original text by removing, adding, or substituting words, often combining these techniques to create adversarial examples [10, 11]. This process typically involves identifying vulnerable words in the input text and then modifying these selected words to generate adversarial samples [4, 33, 28, 29, 6, 13, 23]. Most of the recent work in adversarial text is based on search algorithms that first rank the words in an input sequence by their importance scores and then search the space of possible replacements for each word according to this pre-computed ranking. This common approach requires a method for calculating word importance scores to determine the optimal order for word replacement, a strategy for selecting suitable word replacements, and a search strategy for iterative word replacement to guide the process of finding the most effective substitutions. The replacement strategies involve replacing selected words with out-of-vocabulary words [7, 8, 9]; introducing common typos [10, 11, 5]; replacing words with their synonyms or using rule-based systems like dictionaries or WordNet [12]; leveraging embedding spaces to identify semantically similar words [5]; and using masked language models (MLM) to find a word that fits the given context [13, 6]. The methods for calculating the word importance scores involve grey-box methods utilizing the change in the output probabilities of the victim model, when some words are masked or removed [4, 5, 6] and black-box attacks such as calculating the gradient of a surrogate model’s output regarding the input words [26]. For example, BERT-Attack [6] iteratively masks each word, measuring the resulting change in prediction, and thereby assessing the degree to which the outcome depends on the deleted word. One of the most common search strategies is greedy search where each word is replaced sequentially in the order of the word importance [33, 4, 26, 6]. For example, BERT-Attack tries replacing each word with alternatives iteratively until it either manages to flip the classification or moves on to the next word to be replaced. This is a relatively query-efficient strategy, however the search space is severely limited, and it is highly dependent on the performance of the word importance scoring method. Other search strategies with a larger search space include evolutionary algorithms [28, 29]; and beam search [16]. For example, HotFlip [16], a white-box adversarial attack for text classification, uses a beam search-based approach that leverages gradient information to guide the search process. 3. Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE) The CheckThat! lab at CLEF 2024 [34] introduced a novel task designed to assess the robustness of popular text classification approaches. The task 6 [1] simulates the content filtering processes commonly employed in social media, spanning multiple domains. For each domain, we were provided with three pre-trained classifiers and a corresponding attack dataset. Our objective was to craft adversarial examples for each sample in the dataset, which would flip the binary classification outcome while preserving the original meaning. 3.1. Dataset Description The basic statistics of the provided datasets are summarized in Table 1. Each dataset is divided into three subsets: training, development (dev), and attack. The training and dev subsets are used to train the classifier, while the attack subset serves as the evaluation dataset, where the effectiveness of our attack is measured. Each dataset is designed for a binary classification task, and therefore, the positive rate corresponds to the portion of samples belonging to class 1, and is therefore also a measure of class imbalance. Notably, we observed significant variations in the length of samples across the datasets, measured in terms of the number of characters, which are also reported in the table. Table 1 Datasets used in BODEGA, described by the task, number of instances in training, attack and development subsets, an overall percentage of positive (non-credible) class, and maximum length of samples in characters. Task Training Attack Dev Positive Max Sample Length RD 8,683 415 2,070 34.24% 3,8350 FC 172,763 405 19,010 51.27% 6,983 HN 60,234 400 3,600 50.00% 142,492 PR2 11,546 407 3,186 71.03% 11,626 C19 1,130 595 - 34.96% 349 Next, we briefly outline the different text domains. All except C19 are already published and described in the CheckThat! lab at CLEF 2023 [27]: • Style-based news bias assessment (HN): Categorizing news articles as either credible or non-credible based on stylistic cues. • Propaganda detection (PR): Text passages from news articles that employ propaganda tech- niques to influence readers. • Fact checking (FC): Evaluating the accuracy of news articles by considering contextual informa- tion from given related Wikipedia snippets. • Rumor detection (RD): Identifying Twitter threads that disseminate information without a reliable source. • COVID-19 misinformation detection (C19): Comprises social media messages that convey either factual information or misinformation about the COVID-19 pandemic. The classifier must rely on subtle cues, such as writing styles reminiscent of those found in high-quality news sources (HN), to make its assessments. 3.2. Target Classifier Models For each domain, the organizers provided three classifiers, each trained in a binary classification setup on the training split of the corresponding dataset. Figure 1: Diagram depicting the series of actions in our BeamAttack algorithm, where beam size 𝑘 = 2, branching factor 𝑏 = 6, hypothesis count ℎ = 2. • Fine-tuned BERT • Trained BiLSTM, • Fine-tuned RoBERTa (Surprise classifier): Trained to improve robustness, only available in the test phase. 3.3. Evaluation The evaluation process is designed to assess the effectiveness of generated adversarial examples in altering the victim’s decision while preserving the original meaning. Driven by the formulation of the shared task itself, our approach employs an untargeted attack strategy (see Section 2.1). This means that, for every sample, we aim to alter the victim’s decision by flipping it to the opposite of its original prediction. To quantify the success of an attack, InCrediblAE uses the BODEGA score [27], a comprehensive metric that evaluates the adversarial examples from multiple angles. The BODEGA score is calculated by multiplying three components. The confusion score, which indicates whether the victim classifier’s decision has changed and the attack was successful; the semantic score, which measures the similarity between the original and adversarial examples using BLEURT; and the character score, which calculates the Levenshtein distance as a similarity score. All three metrics range in values from 0 to 1. 4. Methodology Our approach focuses on word-level attacks, a well-regarded technique in recent literature for its balance between edit distance and semantic similarity. We aim to improve upon this by evaluating and addressing the shortcomings of previous techniques.1 We redefine previous word-level attacks, such as HotFlip [16] and BERT-Attack [6], as search problems. In this perspective, the total search space is determined by the number of words to be replaced and the number of valid replacements for each word. This redefinition highlights the potential of search algorithms in optimizing word replacements. 4.1. BeamAttack BeamAttack combines three components: a method for calculating a ranking of words to determine their replacement order, a word replacement strategy, and the beam search algorithm to explore the search space of possible word replacement combinations. The algorithm overview is given in Figure 1. 4.1.1. Beam Search Most existing techniques employ greedy search algorithms, selecting the first word(s) that effectively alters the model’s prediction. This greedy strategy can lead to sub-optimal adversarial samples, requiring more word changes than necessary. Our approach adopts beam search as a replacement for greedy search, leveraging its efficacy as a decoding strategy for language models. Beam search explores multiple potential solutions concurrently, structured in a tree-like fashion, where each node represents a partial solution and branches signify possible extensions. At each step, beam search maintains a predefined number of the most promising partial solutions, known as the beam size 𝑘. This parameter balances exploration and computational efficiency: larger 𝑘 values increase the likelihood of finding optimal or near-optimal solutions but elevate computational complexity, while smaller 𝑘 values reduce demands at the cost of potentially overlooking superior solutions. In our approach, beam search operates by iteratively expanding the most promising nodes until a final solution is reached. The root node represents the input text, and subsequent nodes denote candidate adversarial texts. At each depth, one word is replaced, with branching occurring when alternative words are considered for replacement. To manage computational complexity, we decouple the number of alternatives evaluated at each depth (branching factor 𝑏) from the number of branches retained after pruning. At each depth beyond the first, 𝑘 × 𝑏 nodes are evaluated, with 𝑏 replacements per alternative from the previous depth and 𝑘 nodes from the previous depth, since we prune down to the 𝑘 nodes retained from the previous depth post-pruning. Selection of promising nodes is based on their effectiveness in reducing the predicted probability of the original class while increasing the probability of the target class. This objective guides branch pruning during the search process. The search continues until cumulative changes are sufficient to alter the model’s classification for a designated number of hypotheses. To preserve semantic similarity to the original sample, we introduce the hypothesis count ℎ: the search extends until at least ℎ branches flip the model classification. The final adversarial sample is chosen as the node with the highest semantic similarity to the original sample, as measured by Bleurt scores [35], among the selected hypotheses. 4.1.2. Word Replacement Strategy For each individual replacement, we enable the option to either keep the word intact or remove it altogether at any depth of the beam search. This added flexibility allows us to preserve the original word if replacements have no impact on changing the probability, thus reducing similarity and edit distances. Conversely, removing the word entirely may be the best strategy to confuse the model in more challenging cases. To maximize text fluency, we use a masked language model to suggest replacements for any given word. This approach leverages the model’s pretraining to ensure that the replacements are likely to be coherent with the rest of the text. We primarily use RoBERTa-Large for this purpose. 1 The pseudocode for the algorithm can be found in the Appendix 1. For each depth of beam search, we evaluate the top 𝑏 (branching factor) highest likelihood word replacements as suggested by the masked language model. This strategy ensures that the potential replacements not only fit well within the context of the sentence but also maintain the overall semantic coherence of the text. 4.1.3. Word Importance Scoring Method To determine the order in which words should be replaced, we use two different techniques to identify the most important words for the model’s decision: LIME [18] and logit-based importance scores, as proposed by Li et al. [6]. LIME is an explainability algorithm that provides local explanations for model predictions. It works by perturbing the input data and observing the changes in the output. Specifically, for each word in the input sentence, LIME generates a set of perturbed versions of the sentence with that word replaced by various alternatives. It then fits a simple, interpretable model to these perturbed instances to estimate the importance of each word based on how much the model’s output changes. Logit-based importance scores, on the other hand, measure the influence of each word on the model’s output logits directly. Let 𝑆 = [𝑤0 , . . . , 𝑤𝑖 , . . .] denote the input sentence, and 𝑜𝑦 (𝑆) represent the logit output by the target model for the correct label 𝑦. The importance score 𝐼𝑤𝑖 for each word 𝑤𝑖 is defined as the difference in logits when the word is masked: 𝐼𝑤𝑖 = 𝑜𝑦 (𝑆) − 𝑜𝑦 (𝑆 ∖ 𝑤𝑖 ), where 𝑆 ∖ 𝑤𝑖 is the sentence with 𝑤𝑖 replaced by a [MASK] token. This score quantifies how much the presence of 𝑤𝑖 contributes to the model’s prediction. Our approach is specifically designed to maximize the risk of incorrect model predictions by exploiting the vulnerabilities in the model’s decision-making process. We achieve this by identifying the most susceptible words in the input text and strategically replacing them with semantically consistent alternatives. This targeted approach enables us to effectively manipulate the model’s decision while preserving the overall coherence of the text. Moreover, by focusing on the most critical words, we increase the likelihood of achieving our goal with a minimal number of word replacements, thereby maintaining the semantic integrity of the original text. 4.2. Hyper-parameters For each dataset and victim model combination, we determined a different set of hyperparameters for our BeamAttack approach. To achieve this, we employed an informed random search on subsets of 10-50 samples for each task and victim model. Our search strategy was guided by intuition and initial observations, allowing us to effectively fine-tune the hyperparameters. We considered the following ranges for the hyperparameter values: 1. Beam size 𝑘: We tested values between 10 to 100, increasing in steps of 5 (i.e., 10, 15, 20, ...). We increased the beam size until the improvement in the BODEGA score became negligible (less than 1 point). Figure 2 illustrates how this approach led to convergence. 2. Branching factor 𝑏: We tested a range of values similar to the chosen 𝑘 values, within a range of -20 to +20 around the selected 𝑘 values. 3. Hypothesis count ℎ: Based on initial experiments, we decided to test hypothesis counts of 5, 10, and 20. This decision was made considering the difficulty of the task at hand and the BODEGA scores obtained during the hyper-parameter search. We observed that further increasing ℎ did not significantly improve the scores, as explained in Section 6.4. 4. Word importance scoring method: We compared the results of both word importance scoring methods, LIME and logit-based, in the initial experiments of each dataset and victim model combination. We report the selected parameters for each scenario in Table 2. As mentioned, we employed RoBERTa-large2 to replace the masks, with one exception: for the Covid-19 (C19) task on the BERT classifier, we utilized the "vinai/bertweet-large"3 model instead. This decision was motivated by the intuition that, given the Twitter domain of the COVID-19 dataset, this model might perform better. Table 2 Hyper-parameters for each combination of victim model and dataset. Victim Task beam size 𝑘 hypothesis count ℎ branching factor 𝑏 importances BiLSTM Pr2 25 30 50 logit-based BiLSTM FC 10 10 30 logit-based BiLSTM RD 10 10 30 logit-based BiLSTM HN 20 10 20 logit-based BiLSTM C19 15 10 30 logit-based BERT PR2 60 10 80 logit-based BERT FC 10 10 30 LIME BERT RD 10 10 10 logit-based BERT HN 20 5 20 LIME BERT C19 10 10 30 logit-based RoBERTa PR2 60 10 80 logit-based RoBERTa FC 60 10 80 logit-based RoBERTa RD 20 20 40 logit-based RoBERTa HN 10 10 10 LIME RoBERTa C19 40 10 60 logit-based 4.3. Computational Resources This chapter outlines the computational resources utilized and the hyper-parameter tuning strategies employed to optimize our BeamAttack approach, which demanded substantial computational power. To overcome the limitations imposed by computational and time constraints, we resorted to a random search strategy to determine the optimal parameter set for each scenario, recognizing that this approach may not necessarily result in the identification of the globally optimal combination. We primarily used Kaggle’s free GPU infrastructure, which offers NVIDIA Tesla P100s and T4x 4 . We report the running time statistics for the final runs of each model and dataset combination in Table 3. 5. Results The outcomes of this shared task and a comparison with those reported by Przybyła et al. [27] are presented in Tables 4, 5, and 6. Notably, we do not report scores for the Covid (C19) classification subtask besides our own, as no official published scores are available. The same thing applies to the RoBERTa victim in Table 6. Our algorithm outperforms all other algorithms on BERT and BiLSTM victims, demonstrating the superior effectiveness of BeamAttack. While other algorithms may generate adversarial samples with better character or semantic scores in specific tasks, they fail to achieve this consistently across the entire dataset. Moreover, our confusion scores are always superior to those of other techniques. Nevertheless, our algorithm needs many victim queries to achieve this goal. For example, in the task RD on the BERT classifier (see Table 4), our BeamAttack takes roughly double the number of queries 2 FacebookAI/roberta-large model from Hugging Face: https://huggingface.co/FacebookAI/roberta-large 3 vinai/bertweet-large model from Hugging Face: https://huggingface.co/vinai/bertweet-large 4 For the submission phase, we used an NVIDIA Tesla T4 GPU. Table 3 Running time statistics (in seconds) for each combination of victim model and dataset, measured on the entire dataset. Victim Task Time per Example (s) Total Attack Time (s) Total Evaluation Time (s) BiLSTM Pr2 1.513 629.279 15.843 BiLSTM FC 2.111 854.859 54.206 BiLSTM RD 13.248 5497.992 560.019 BiLSTM HN 9.578 3831.309 615.956 BiLSTM C19 2.26 1344.562 95.838 BERT PR2 12.586 5235.587 15.986 BERT FC 7.988 3235.195 51.033 BERT RD 70.347 29193.831 488.945 BERT HN 86.049 34419.532 660.498 BERT C19 12.453 7409.627 103.194 RoBERTa PR2 72.192 30031.747 20.433 RoBERTa FC 3.467 1403.967 44.0466 RoBERTa RD 72.192 30031.747 20.433 RoBERTa HN 144.188 57675.011 544.444 RoBERTa C19 54.731 32564.714 117.173 compared to the Genetic algorithm. This highlights a trade-off between the quality of the generated adversarial samples and the computational resources required to achieve them. In summary, our BeamAttack algorithm has shown exceptional capability in producing high-quality adversarial samples that are highly effective in deceiving victims, thereby surpassing the performance of other algorithms when tested on BERT and BiLSTM victims. 6. Ablation Experiments We conducted a small-scale ablation study to investigate the impact of various hyper-parameters on the performance of our BeamAttack algorithm, shedding some light on the relationship between these parameters and the algorithm’s effectiveness. Specifically, we investigate the impact of logit-based versus LIME, beam size 𝑘, branching factor 𝑏 and the hypothesis count ℎ on the algorithm’s performance, exploring each of these parameters in the context of specific scenarios of task and victim model. The full results of the ablation studies can be found in the Appendix B. 6.1. Word Importance Besides the logit-based approach from BERT-Attack [6], we utilized LIME, a state-of-the-art inter- pretability framework, to identify crucial words in a sample that influence the victim’s decision-making process. By modifying these words, we increased the victim’s susceptibility to our adversarial attack. We report some comparisons of LIME in the appendix in Table 7. From this table, we can gather multiple insights. For instance, for the BERT victim the choice of importance method depends on the task dataset, with RD favoring the logit-based method and FC preferring LIME. For the BiLSTM, the logit-based method is superior on all datasets. For RoBERTa, which is architecturally similar to BERT, the choice of importance method also depends on the dataset, with FC favoring LIME. Furthermore, we observed that for the RoBERTa victim and C19 task, the choice between LIME and logit-based depends on other hyper-parameters such as beam size, branching factors, and hypothesis count. More specifically, our results suggest transformer-based classifiers, namely classifiers BERT and RoBERTa are generally more vulnerable to LIME over logit-based for the HN and FC datasets. Both these datasets classify whether a statement contains facts, with HN using subtle clues in writing styles and FC using external knowledge from Wikipedia. We leave the investigation behind this observation Table 4 Results of adversarial attacks on the BERT classifier in five misinformation detection tasks in an untargeted scenario, including previously reported results by Przybyła et al. [27] and our findings. Evaluation measures include BODEGA score (B.), confusion score (con), semantic score (sem), character score (char) and number of queries to the attacked model (Q.). The best score in each task is in boldface. Task Method B. con sem char Q. HN BAE 0.34 0.60 0.58 0.96 606.83 BERT-ATTACK 0.60 0.96 0.64 0.97 648.41 DeepWordBug 0.22 0.29 0.78 1.00 395.94 Genetic 0.40 0.86 0.47 0.98 2713.80 SememePSO 0.16 0.34 0.50 0.99 341.70 PWWS 0.38 0.82 0.47 0.98 2070.78 SCPN 0.00 0.92 0.08 0.02 11.84 TextFooler 0.39 0.92 0.44 0.94 660.52 BeamAttack (Ours) 0.84 1.00 0.86 0.97 4327.67 PR BAE 0.11 0.18 0.69 0.94 33.96 BERT-ATTACK 0.43 0.70 0.68 0.90 80.16 DeepWordBug 0.28 0.36 0.79 0.96 27.43 Genetic 0.50 0.84 0.65 0.89 962.40 SememePSO 0.41 0.68 0.66 0.90 96.17 PWWS 0.47 0.75 0.68 0.91 131.92 SCPN 0.09 0.47 0.36 0.46 11.47 TextFooler 0.43 0.77 0.64 0.87 57.94 BeamAttack (Ours) 0.69 0.98 0.77 0.89 1373.41 FC BAE 0.34 0.51 0.70 0.96 80.69 BERT-ATTACK 0.53 0.77 0.73 0.95 146.73 DeepWordBug 0.44 0.53 0.84 0.98 54.32 Genetic 0.52 0.79 0.70 0.95 1215.19 SememePSO 0.44 0.64 0.71 0.96 148.20 PWWS 0.48 0.69 0.72 0.96 225.27 SCPN 0.09 0.90 0.29 0.31 11.90 TextFooler 0.46 0.70 0.70 0.93 106.13 BeamAttack (Ours) 0.79 1.00 0.83 0.96 1390.83 RD BAE 0.07 0.18 0.41 0.98 313.01 BERT-ATTACK 0.18 0.44 0.43 0.96 774.31 DeepWordBug 0.16 0.23 0.70 0.99 232.74 Genetic 0.20 0.46 0.45 0.96 4425.11 SememePSO 0.10 0.21 0.46 0.97 345.89 PWWS 0.16 0.38 0.45 0.95 1105.99 SCPN 0.01 0.38 0.16 0.10 11.35 TextFooler 0.16 0.41 0.43 0.91 657.15 BeamAttack (Ours) 0.59 0.80 0.80 0.91 10618.93 C19 BeamAttack (Ours) 0.71 0.98 0.78 0.92 2628.90 to future work. Our results suggest that exploring explainable AI frameworks for adversarial attacks is a promising direction of research. However, our experiments show that LIME only improved upon the logit-based approach on a few datasets. Additionally, LIME requires more queries, and when the improvement was marginal, we opted for the logit-based approach due to resource constraints. Table 5 Results of adversarial attacks on the BiLSTM classifier in five misinformation detection tasks in an untargeted scenario, including previously reported results by Przybyła et al. [27] and our findings. Evaluation measures include BODEGA score (B.), confusion score (con), semantic score (sem), character score (char) and number of queries to the attacked model (Q.). The best score in each task is in boldface. Task Method B. Con Sem Char Q. HN BAE 0.48 0.77 0.64 0.98 489.27 BERT-ATTACK 0.64 0.98 0.66 0.99 487.85 DeepWordBug 0.41 0.53 0.77 1.00 396.18 Genetic 0.44 0.94 0.48 0.98 2029.31 SememePSO 0.21 0.42 0.50 0.99 313.51 PWWS 0.44 0.93 0.48 0.99 2044.96 SCPN 0.00 0.94 0.08 0.02 11.86 TextFooler 0.43 0.94 0.47 0.97 543.68 BeamAttack (Ours) 0.90 1.00 0.91 0.99 936.96 PR BAE 0.15 0.23 0.72 0.94 32.94 BERT-ATTACK 0.53 0.80 0.72 0.91 61.41 DeepWordBug 0.29 0.38 0.79 0.96 27.45 Genetic 0.54 0.88 0.67 0.89 782.15 SememePSO 0.47 0.76 0.68 0.89 85.34 PWWS 0.53 0.84 0.69 0.90 130.85 SCPN 0.12 0.55 0.39 0.50 11.55 TextFooler 0.51 0.85 0.67 0.88 52.59 BeamAttack (Ours) 0.70 0.97 0.80 0.90 593.38 FC BAE 0.36 0.55 0.69 0.96 77.76 BERT-ATTACK 0.60 0.86 0.73 0.95 132.80 DeepWordBug 0.48 0.58 0.85 0.98 54.36 Genetic 0.61 0.90 0.71 0.95 840.99 SememePSO 0.53 0.76 0.72 0.96 112.84 PWWS 0.57 0.82 0.73 0.96 221.60 SCPN 0.08 0.75 0.29 0.32 11.75 TextFooler 0.55 0.82 0.71 0.94 98.31 BeamAttack (Ours) 0.76 1.00 0.81 0.94 1549.12 RD BAE 0.09 0.21 0.43 0.98 312.77 BERT-ATTACK 0.29 0.79 0.41 0.89 985.52 DeepWordBug 0.16 0.24 0.68 0.99 232.75 Genetic 0.32 0.71 0.47 0.96 3150.24 SememePSO 0.15 0.31 0.48 0.97 314.63 PWWS 0.29 0.64 0.47 0.97 1059.07 SCPN 0.01 0.55 0.17 0.09 11.53 TextFooler 0.24 0.64 0.41 0.87 639.97 BeamAttack (Ours) 0.83 1.00 0.87 0.96 3831.72 C19 BeamAttack (Ours) 0.72 0.99 0.78 0.92 837.35 6.2. Beam Size 𝑘 We also conducted an experiment to investigate the effect of the beam size 𝑘 on our algorithm’s performance. Specifically, we experimented on a subset of 30 samples from the PR2 dataset using the RoBERTa victim model. The results are presented in Figure 2, with the exact scores provided in Appendix Table 8. Our analysis reveals a consistent improvement in all scores as the beam size increases. However, we encountered a significant limitation. As shown in the table, a beam size of 100 requires 16,500 queries per sample, which is impractical for limited GPU setup. Therefore, we sought an optimal Table 6 Results of our adversarial attack on the RoBERTa classifier in five misinformation detection tasks in an untargeted scenario. Evaluation measures include BODEGA score (B.), confusion score (con), semantic score (sem), character score (char) and number of queries to the attacked model (Q.). Task Method B. Con Sem Char Q. HN BeamAttack (Ours) 0.67 1.00 0.72 0.91 4596.62 PR BeamAttack (Ours) 0.45 0.97 0.55 0.79 10286.82 FC BeamAttack (Ours) 0.82 1.00 0.843 0.97 498.93 RD BeamAttack (Ours) 0.54 0.87 0.69 0.83 15458.12 C19 BeamAttack (Ours) 0.66 1.00 0.72 0.91 6491.39 trade-off where the scores would reach a plateau. In the case of PR2 and RoBERTa, we determined that a beam size of 60 strikes a reasonable balance between performance and computational feasibility. Figure 2: Comparison of the scores across different settings for the beam size 6.3. branching factor 𝑏 We further explored the impact of the branching factor 𝑏 on our algorithm’s performance, using a fixed beam size of 40. We conducted an experiment on a subset of 30 samples from the PR2 dataset, employing the RoBERTa victim model. The results are illustrated in Figure 3, which displays the scores against varying branching factors. The exact scores can be found in Appendix Table 9. Our analysis reveals that increasing the branching factor 𝑏 leads to a slight improvement in the BODEGA score. 6.4. Hypothesis Count ℎ Our algorithm’s flexibility in replacing, skipping, or removing words, combined with the use of a masked language model, ensures that the resulting sentence remains semantically correct. The beam search approach enables us to efficiently query for multiple adversarial samples and use the one that is most semantically close to our original sentence. We conducted an ablation study on the hypothesis count ℎ, which defines how many adversarial samples to use before selecting the closest. We explored different scenarios, incrementing the parameter from 10 to 20, but observed only minimal effects, resulting in a negligible improvement or deterioration Figure 3: Comparison of the scores across different settings for the branching factor of less than 1% in the BODEGA score. We provide a small-scale report of our findings for the RoBERTa victim and various tasks in the appendix (Table 10). Notably, in some cases, it came at the cost of requiring more queries. Therefore, we opted to maintain a constant value of 10 for the hypothesis count in most cases. We acknowledge that a more comprehensive investigation of this parameter could have potentially yielded further improvements. 6.5. LIME Parameters The LIME explainability method itself presents several hyper-parameters that can be optimized. Al- though tuning these parameters was not our primary focus, we recognized their potential impact on our approach. Due to time constraints, we initially chose to keep the parameters fixed, but we later realized that adjusting them could have a significant effect on our algorithm’s performance. The complete documentation of parameters can be found here5 . Specifically, we found that by reducing the number of samples used by LIME, we could substantially decrease the number of victim model queries required without compromising our algorithm’s perfor- mance. Notably, by setting the number of samples to 500, a significant reduction of 4500 queries per explanation was achieved, compared to LIME’s default setting of 5000. This adjustment may come at the cost of reduced performance for longer samples, highlighting the need for further exploration of these parameters in future work. In future studies, researchers could explore matching the explainability model better to the victim model. This could involve trying out different kernel sizes or entirely different kernels. We think this adjustment could improve the efficency of generating an explanation. 7. Conclusion In this shared task, we introduced BeamAttack, an innovative adversarial text generation method that combines multiple techniques from the literature with novel approaches. At the core of BeamAttack lies the beam search algorithm, which serves as a fundamental tool in optimizing adversarial text generation. Our method leverages a Masked Language Model (MLM), specifically RoBERTa, to produce contex- tually appropriate word substitutions. Notably, RoBERTa’s ability to generate alternative words with typos and controversial connotations significantly enhances BeamAttack’s effectiveness. 5 Code Documentation of LimeTextExplainer. https://lime-ml.readthedocs.io/en/latest/lime.html#lime.lime_text. LimeTextExplainer Additionally, we integrate LIME, an interpretability method, to calculate the contribution of each word to the model’s classification. Our experiments demonstrate significant improvements in attack effectiveness, particularly in specific task contexts. The key feature of BeamAttack is its utilization of beam search, which enables the identification of optimal word-level modifications. Unlike previous methods relying on simplistic greedy search approaches, BeamAttack’s beam search strategy allows for more nuanced modification decisions. Moreover, by incorporating the capability to skip or remove words, inspired by prior work, we enhance the algorithm’s flexibility in exploring potential solutions. Furthermore, BeamAttack’s beam search produces multiple alternative adversarial text hypotheses capable of flipping the model’s classification. By selecting the hypothesis with the highest semantic similarity to the input text, our method effectively filters out dissimilar or contradictory hypotheses. In summary, BeamAttack’s efficacy is significantly attributed to its beam search component, which plays a crucial role in optimizing adversarial text generation. The flexibility provided by tunable hyper-parameters allows users to prioritize semantic or character similarity based on their specific requirements. 8. Limitations Our word replacement strategy in BeamAttack is currently limited to methods of substituting and removing words. Removing a word can sometimes result in syntactically incorrect and nonsensical sentences, especially since RoBERTa is not utilized for this operation. This decreases the coherence of the context after the change. To enhance the quality of our outputs, we recommend exploring the integration of word insertion techniques into the beam search process. This approach could potentially improve the syntactic and semantic integrity of the sentences. The BeamAttack approach strives to maintain semantic similarity to the original sample via semantic filtering. However, there could be a benefit to scoring the beams at each step of the beam search, rather than solely relying on selecting the hypothesis with the highest semantic score. Moreover, while RoBERTa provides context-based alternatives, these alternatives solely depend on the context rather than the word to be replaced itself. This limitation may impact the performance of the beam search, potentially resulting in alternatives that are opposite in meaning to the word to be replaced. To address this, potential solutions include prompting Language Models (LLMs) to produce word candidates at each step of the beam search or filtering out produced alternatives based on similarity to the word to be replaced. However, these approaches may introduce trade-offs and require careful consideration to maintain performance. In our experiments, we limited ourselves to the use of the Explainable AI method LIME. However, we suggest that future research should explore the potential of diverse Explainable AI methods to uncover vulnerabilities in the victim’s decision-making process. Our experience with LIME has demonstrated the value of these approaches, but we believe that Explainable AI frameworks could provide even more insightful explanations. By leveraging these techniques, researchers may be able to identify more effective strategies for generating adversarial examples, ultimately leading to more robust and reliable machine learning models. Acknowledgments We would like to express our gratitude to the Department of Computational Linguistics at the University of Zurich (UZH) for the valuable advice and guidance they offered throughout our research. Additionally, we appreciate their efforts in providing us with access to a GPU during the evaluation phase of our project. For more information about the department, please visit their website at https://www.cl.uzh.ch/en.html. References [1] P. Przybyła, B. Wu, A. Shvets, Y. Mu, K. C. Sheang, X. Song, H. Saggion, Overview of the CLEF- 2024 CheckThat! lab task 6 on robustness of credibility assessment with adversarial examples (incrediblae), in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024. [2] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, C. Li, Adversarial attacks on deep-learning models in natural language processing: A survey, ACM Trans. Intell. Syst. Technol. 11 (2020). URL: https://doi.org/10.1145/3374217. doi:10.1145/3374217. [3] B. Formento, C. S. Foo, L. A. Tuan, S. K. Ng, Using punctuation as an adversarial attack on deep learning-based NLP systems: An empirical study, in: A. Vlachos, I. Augenstein (Eds.), Findings of the Association for Computational Linguistics: EACL 2023, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 1–34. URL: https://aclanthology.org/2023.findings-eacl.1. doi:10.18653/v1/2023.findings-eacl.1. [4] D. Jin, Z. Jin, J. T. Zhou, P. Szolovits, Is bert really robust? natural language attack on text classification and entailment, 2019. URL: https://api.semanticscholar.org/CorpusID:198967870. [5] J. Li, S. Ji, T. Du, B. Li, T. Wang, Textbugger: Generating adversarial text against real-world appli- cations, ArXiv abs/1812.05271 (2018). URL: https://api.semanticscholar.org/CorpusID:54815878. [6] L. Li, R. Ma, Q. Guo, X. Xue, X. Qiu, BERT-ATTACK: Adversarial attack against BERT using BERT, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6193–6202. URL: https://aclanthology.org/2020.emnlp-main.500. doi:10.18653/v1/2020.emnlp-main.500. [7] Y. Belinkov, Y. Bisk, Synthetic and natural noise both break neural machine translation, ArXiv abs/1711.02173 (2017). URL: https://api.semanticscholar.org/CorpusID:3513372. [8] J. Gao, J. Lanchantin, M. L. Soffa, Y. Qi, Black-box generation of adversarial text sequences to evade deep learning classifiers, in: 2018 IEEE Security and Privacy Workshops (SPW), 2018, pp. 50–56. doi:10.1109/SPW.2018.00016. [9] H. Hosseini, S. Kannan, B. Zhang, R. Poovendran, Deceiving google’s perspective api built for detecting toxic comments, ArXiv abs/1702.08138 (2017). URL: https://api.semanticscholar.org/ CorpusID:15418780. [10] B. Liang, H. Li, M. Su, P. Bian, X. Li, W. Shi, Deep text classification can be fooled, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, AAAI Press, 2018, p. 4208–4215. [11] S. Samanta, S. Mehta, Towards crafting text adversarial samples (2017). [12] G. A. Miller, WordNet: A lexical database for English, in: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994. URL: https: //aclanthology.org/H94-1111. [13] S. Garg, G. Ramakrishnan, Bae: Bert-based adversarial examples for text classification, ArXiv abs/2004.01970 (2020). URL: https://api.semanticscholar.org/CorpusID:214802269. [14] M. T. Ribeiro, S. Singh, C. Guestrin, Semantically equivalent adversarial rules for debugging NLP models, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 856–865. URL: https://aclanthology.org/P18-1079. doi:10.18653/v1/P18-1079. [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [16] J. Ebrahimi, A. Rao, D. Lowd, D. Dou, HotFlip: White-box adversarial examples for text clas- sification, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computa- tional Linguistics, Melbourne, Australia, 2018, pp. 31–36. URL: https://aclanthology.org/P18-2006. doi:10.18653/v1/P18-2006. [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [18] M. T. Ribeiro, S. Singh, C. Guestrin, "why should I trust you?": Explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144. [19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, 2014. arXiv:1312.6199. [20] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, D. Song, Natural adversarial examples, 2021. arXiv:1907.07174. [21] T. McCoy, E. Pavlick, T. Linzen, Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3428–3448. URL: https://aclanthology.org/P19-1334. doi:10. 18653/v1/P19-1334. [22] R. Jia, P. Liang, Adversarial examples for evaluating reading comprehension systems, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2021–2031. URL: https://aclanthology.org/D17-1215. doi:10.18653/v1/D17-1215. [23] N. Papernot, P. McDaniel, A. Swami, R. Harang, Crafting adversarial input sequences for recurrent neural networks, in: MILCOM 2016 - 2016 IEEE Military Communications Conference, IEEE Press, 2016, p. 49–54. URL: https://doi.org/10.1109/MILCOM.2016.7795300. doi:10.1109/MILCOM.2016. 7795300. [24] P. Gaiński, K. Bałazy, Step by step loss goes very far: Multi-step quantization for adversarial text attacks, in: A. Vlachos, I. Augenstein (Eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Dubrovnik, Croatia, 2023, pp. 2038–2048. URL: https://aclanthology.org/2023.eacl-main. 149. doi:10.18653/v1/2023.eacl-main.149. [25] Z. Yu, X. Wang, W. Che, K. He, TextHacker: Learning based hybrid local search algorithm for text hard-label adversarial attack, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 622–637. URL: https://aclanthology.org/ 2022.findings-emnlp.44. doi:10.18653/v1/2022.findings-emnlp.44. [26] R. Maheshwary, S. Maheshwary, V. Pudi, A strong baseline for query efficient attacks in a black box setting, ArXiv abs/2109.04775 (2021). URL: https://api.semanticscholar.org/CorpusID:237485150. [27] P. Przybyła, A. Shvets, H. Saggion, Verifying the robustness of automatic credibility assessment, 2023. arXiv:2303.08032. [28] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, K.-W. Chang, Generating natural language adversarial examples, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 2890–2896. URL: https://aclanthology.org/ D18-1316. doi:10.18653/v1/D18-1316. [29] Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, M. Sun, Word-level textual adversarial attacking as combinatorial optimization, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Compu- tational Linguistics, Online, 2020, pp. 6066–6080. URL: https://aclanthology.org/2020.acl-main.540. doi:10.18653/v1/2020.acl-main.540. [30] X. Yang, Y. Gong, W. Liu, J. Bailey, D. Tao, W. Liu, Semantic-preserving adversarial text attacks, IEEE Transactions on Sustainable Computing 8 (2023) 583–595. doi:10.1109/TSUSC.2023.3263510. [31] M. Ni, Z. Sun, W. Liu, Fraud’s bargain attack: Generating adversarial text samples via word manipulation process, IEEE Transactions on Knowledge and Data Engineering (2024) 1–14. doi:10.1109/TKDE.2024.3349708. [32] D. Lee, S. Moon, J. Lee, H. O. Song, Query-efficient and scalable black-box adversarial attacks on discrete sequential data via bayesian optimization, 2022. arXiv:2206.08575. [33] S. Ren, Y. Deng, K. He, W. Che, Generating natural language adversarial examples through probability weighted word saliency, in: A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 1085–1097. URL: https://aclanthology.org/ P19-1103. doi:10.18653/v1/P19-1103. [34] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The clef-2024 checkthat! lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonel- lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458. [35] T. Sellam, D. Das, A. Parikh, BLEURT: Learning robust metrics for text generation, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7881–7892. URL: https://aclanthology.org/2020.acl-main.704. doi:10.18653/v1/2020.acl-main.704. A. BeamAttack Algorithm Input: 𝑆 = [𝑤0 , 𝑤1 , . . . , 𝑤𝑛 ] // input tokenized sentence, 𝑌 // gold-label, 𝐾 // beam size, 𝐵 // branching factor, 𝐻 // hypothesis count, 𝑚𝑜𝑑𝑒𝑙 // the target model to attack Output: 𝑆𝑎𝑑𝑣 // adversarial sample Initialize the beam as a priority queue with the original sentence and its score 𝑏𝑒𝑎𝑚 ← PriorityQueue() 𝑏𝑒𝑎𝑚.𝑎𝑑𝑑((𝑆, 𝑚𝑜𝑑𝑒𝑙.𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑝𝑟𝑜𝑏𝑎(𝑆)[𝑌 ])) begin FCalculateWordImportance 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒_𝑠𝑐𝑜𝑟𝑒𝑠 ← [] foreach word 𝑤𝑖 in 𝑆 do 𝑠𝑐𝑜𝑟𝑒_𝐿𝐼𝑀 𝐸 ← LIME_importance(𝑤𝑖 , 𝑆, 𝑚𝑜𝑑𝑒𝑙) 𝑠𝑐𝑜𝑟𝑒_𝑙𝑜𝑔𝑖𝑡 ← logit_importance(𝑤𝑖 , 𝑆, 𝑚𝑜𝑑𝑒𝑙) 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒_𝑠𝑐𝑜𝑟𝑒𝑠.𝑎𝑝𝑝𝑒𝑛𝑑((𝑤𝑖 , 𝑠𝑐𝑜𝑟𝑒_𝐿𝐼𝑀 𝐸 + 𝑠𝑐𝑜𝑟𝑒_𝑙𝑜𝑔𝑖𝑡)) end return 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒_𝑠𝑐𝑜𝑟𝑒𝑠 end begin FGetTopReplacements Input: 𝑤𝑜𝑟𝑑, 𝑆, 𝐵 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡𝑠 ← masked_language_model.predict(𝑤𝑜𝑟𝑑, 𝑆, 𝐵) return 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡𝑠 end 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒_𝑠𝑐𝑜𝑟𝑒𝑠 ← CalculateWordImportance(𝑆, 𝑚𝑜𝑑𝑒𝑙) Sort 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒_𝑠𝑐𝑜𝑟𝑒𝑠 by descending order while beam is not empty do 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑏𝑒𝑎𝑚 ← [] foreach (𝑆𝑐𝑢𝑟𝑟𝑒𝑛𝑡 , 𝑠𝑐𝑜𝑟𝑒) in beam do foreach word 𝑤𝑗 in 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒_𝑠𝑐𝑜𝑟𝑒𝑠 do 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡𝑠 ← GetTopReplacements(𝑤𝑗 , 𝑆𝑐𝑢𝑟𝑟𝑒𝑛𝑡 , 𝐵 ) foreach replacement in replacements do 𝑆𝑛𝑒𝑤 ← 𝑆𝑐𝑢𝑟𝑟𝑒𝑛𝑡 with 𝑤𝑗 replaced by replacement 𝑛𝑒𝑤_𝑠𝑐𝑜𝑟𝑒 ← 𝑚𝑜𝑑𝑒𝑙.𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑝𝑟𝑜𝑏𝑎(𝑆𝑛𝑒𝑤 )[𝑌 ] 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑏𝑒𝑎𝑚.𝑎𝑑𝑑((𝑆𝑛𝑒𝑤 , 𝑛𝑒𝑤_𝑠𝑐𝑜𝑟𝑒)) end end end Sort 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑏𝑒𝑎𝑚 by score 𝑏𝑒𝑎𝑚 ← top 𝐾 items from 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑏𝑒𝑎𝑚 Check for success: foreach (𝑆𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 , 𝑠𝑐𝑜𝑟𝑒) in beam do if model.predict(𝑆𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ) ̸= 𝑌 then 𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠.𝑎𝑑𝑑((𝑆𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 , 𝑠𝑐𝑜𝑟𝑒)) end end if length of 𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 ≥ 𝐻 then break end end 𝑆𝑎𝑑𝑣 ← SelectBestCandidate(𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠) return 𝑆𝑎𝑑𝑣 if 𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 is not empty, otherwise return None begin FSelectBestCandidate 𝑏𝑒𝑠𝑡_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ← None 𝑏𝑒𝑠𝑡_𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 ← −∞ foreach (𝑆𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 , 𝑠𝑐𝑜𝑟𝑒) in 𝑠𝑢𝑐𝑐𝑒𝑠𝑠_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 do 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 ← Bleurt_score(𝑆𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 , 𝑆) if 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 > 𝑏𝑒𝑠𝑡_𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 then 𝑏𝑒𝑠𝑡_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ← 𝑆𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 𝑏𝑒𝑠𝑡_𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 ← 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 end end return 𝑏𝑒𝑠𝑡_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 end begin FLIME_importance Input: 𝑤𝑜𝑟𝑑, 𝑆, 𝑚𝑜𝑑𝑒𝑙 return LIME_score end begin FLogit_importance 𝑚𝑎𝑠𝑘𝑒𝑑_𝑆 ← 𝑆 with 𝑤𝑜𝑟𝑑 replaced by [MASK] return 𝑚𝑜𝑑𝑒𝑙.𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑝𝑟𝑜𝑏𝑎(𝑆)[𝑌 ] − 𝑚𝑜𝑑𝑒𝑙.𝑝𝑟𝑒𝑑𝑖𝑐𝑡_𝑝𝑟𝑜𝑏𝑎(𝑚𝑎𝑠𝑘𝑒𝑑_𝑆)[𝑌 ] end Algorithm 1: BeamAttack B. Ablation Studies Queries Task Victim Importance Method 𝑘, ℎ, 𝑏 Success Semantic Character BODEGA per example RD BERT LIME 10, 10, 10 16342.9 0.7 0.8160 0.9174 0.5380 RD BERT logit-based 10, 10, 10 5947.0 0.9 0.8189 0.9130 0.6874 FC BERT LIME 10, 10, 10 5520.44 1.0 0.8392 0.9546 0.8044 FC BERT logit-based 10, 10, 10 732.66 1.0 0.8188 0.9466 0.7787 FC BiLSTM LIME 10, 10, 30 2597.3 0.985 0.8111 0.9332 0.7511 FC BiLSTM logit-based 10, 10, 30 1549.1 1.0 0.8059 0.9410 0.7642 HN BiLSTM LIME 10, 10, 20 1529.04 1.0 0.9043 0.9837 0.8905 HN BiLSTM logit-based 10, 10, 20 665.72 1.0 0.9112 0.9893 0.9031 FC RoBERTa LIME 10, 10, 20 557.08 1.0 0.8398 0.9644 0.8110 FC RoBERTa logit-based 10, 10, 20 112.4 1.0 0.8392 0.9641 0.8105 C19 RoBERTa LIME 40, 10, 40 9714.4 1.0 0.5625 0.8602 0.4860 C19 RoBERTa logit-based 40, 10, 40 5505.0 1.0 0.6541 0.8939 0.5853 C19 RoBERTa LIME 10, 10, 20 1858.0 1.0 0.6304 0.8669 0.5500 C19 RoBERTa logit-based 10, 10, 20 972.7 1.0 0.6120 0.8677 0.5316 Table 7 Evaluation metrics for different importance methods on different tasks and victims. We always bold the better BODEGA score with the same settings for either LIME or BERT. 𝑘 refers to the selected beam size, ℎ to hypothesis counts and 𝑏 to the branching factor. beam size 𝑘 Queries Subset size Success Semantic Character BODEGA branching factor 𝑏 per example 10 30 530.53 0.93 0.58 0.78 0.44 20 30 1334.07 1.00 0.56 0.78 0.47 30 30 2451.77 1.00 0.58 0.79 0.49 35 30 3090.13 1.00 0.59 0.81 0.50 40 30 3656.93 1.00 0.60 0.83 0.51 50 30 5308.23 1.00 0.60 0.84 0.52 60 30 7151.87 1.00 0.61 0.84 0.53 70 30 9279.90 1.00 0.61 0.83 0.53 80 30 11374.77 1.00 0.62 0.84 0.54 90 30 14546.27 1.00 0.62 0.84 0.54 100 30 16499.97 1.00 0.63 0.84 0.55 Table 8 Scores for Different Beam Sizes 𝑘 for task PR2 and victim RoBERTa. Here hypothesis counts ℎ is always 10 and 𝑏 = 𝑘 for each case. Queries branching factor 𝑏 Beam size 𝑘 Subset size Success Semantic Character BODEGA per example 30 40 30 3032.733 1.0 0.5835 0.7924 0.4904 40 40 30 3656.933 1.0 0.5993 0.8280 0.5130 50 40 30 4476.300 1.0 0.5957 0.8331 0.5115 60 40 30 4836.333 1.0 0.6058 0.8309 0.5210 70 40 30 5799.533 1.0 0.6113 0.8249 0.5253 Table 9 Scores for Different branching factors 𝑏 with a fixed Beam size 𝑘 for task PR2 and victim RoBERTa with hypothesis counts ℎ = 10. hypothesis beam branching Queries Task Subset size Success Semantic Character BODEGA count ℎ size 𝑘 factor 𝑏 per example RD 20 20 40 20 37576.3 0.95 0.6645 0.7952 0.5501 RD 10 20 40 20 37764.7 0.95 0.6654 0.7951 0.5508 PR2 20 40 40 30 3744.8 1.0 0.5996 0.8235 0.5101 PR2 10 40 40 30 3656.9 1.0 0.5993 0.8280 0.5130 C19 20 40 40 10 5505.0 1.0 0.6541 0.8939 0.5853 C19 10 40 40 10 5505.0 1.0 0.6541 0.8939 0.5853 Table 10 Results of different hypothesis counts ℎ with the RoBERTa victim and different tasks.