=Paper=
{{Paper
|id=Vol-3731/paper08
|storemode=property
|title=A Black-box Adversarial Attack on Fake News Detection Systems
|pdfUrl=https://ceur-ws.org/Vol-3731/paper08.pdf
|volume=Vol-3731
|authors=Farwa Batool,Federico Canino,Federico Concone,Giuseppe Lo Re,Marco Morana
|dblpUrl=https://dblp.org/rec/conf/itasec/BatoolCCRM24
}}
==A Black-box Adversarial Attack on Fake News Detection Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-3731/paper08.pdf</pdf>
<pre>
                         A Black-box Adversarial Attack on Fake News Detection
                         Systems
                         Farwa Batool1,* , Federico Canino2 , Federico Concone2 , Giuseppe Lo Re2 and Marco Morana2
                         1
                             Scuola IMT Alti Studi Lucca, Lucca, Italy
                         2
                             Università degli Studi di Palermo, Palermo, Italy


                                         Abstract
                                         The widespread diffusion of misinformation through digital platforms has raised significant concerns due to its
                                         adverse impacts on society and economy. Nowadays, the adoption of Artificial Intelligence and Machine Learning
                                         based mechanisms to automate fact checking processes and distinguish genuine from fake contents is mandatory.
                                         However, recent studies reveal vulnerabilities in AI models to adversarial attacks, where slight modifications
                                         of the input can deceive the classifiers. Adversarial Machine Learning strategies aim to compromise machine
                                         learning algorithms, posing challenges also for fake news detection models. This study focuses on the impact of
                                         adversarial attacks on fake news detection systems, utilizing a black-box attack approach against an unknown
                                         algorithm used by the online platforms. The research introduces a methodology leveraging a surrogate model to
                                         test the validity of malicious samples offline, with the aim of overcoming known limitations such as the high
                                         number of queries made to the target model.

                                         Keywords
                                         Adversarial Machine Learning, Fake News Detection, Cybersecurity


                         1. Introduction
                         Nowadays, the spread of misinformation through the internet has become a major concern due to the
                         negative impact it brings to the social and economic community we live in [1]. For example, losses
                         of $130 billion have been reported in the stock market due to a false report that US President Barack
                         Obama was injured in an explosion [2]. More recently, misinformation related to the Covid-19 pandemic
                         has led masses of people to drink bleach to counteract the virus, thus seriously endangering human
                         lives [3].
                            In general, whether it is a social network platform, a blog, or any other virtual environment for
                         sharing facts and news, the current trend to counter misinformation is to leverage fact-checking compa-
                         nies [4]. Meta-platforms provide tangible evidence of industry interest in the problem of misinformation
                         detection, such as Facebook, Instagram, and WhatsApp [5]. These provide their users with internal
                         mechanisms to report misleading content (e.g., the community feedback), but have also created part-
                         nerships with independent third-party fact checkers. However, the speed and efficiency of manual
                         fact-checking cannot keep up with the pace with which online information is posted and circulated.
                         The community may benefit from tools that, at least partially, automate fact-checking, particularly
                         by automating more mechanical tasks, so that human effort can instead be dedicated to more labor-
                         intensive tasks [6, 7, 8]. In this sense, both the scientific and industrial communities exploit Artificial
                         Intelligence (AI) and Machine Learning (ML) techniques that allow to timely identify potential fake
                         content and trigger fact-checkers only for the most uncertain content. Most of the ML-based fake news
                         detection techniques [9] share a common structure, based on the use of Natural Language Processing
                         (NLP) techniques such as data pre-processing and word embedding, together with Convolutional Neural
                         Networks (CNNs) and Recurrent Neural Networks (RNNs). The idea is that a malicious entity tends to

                          ITASEC 2024: Italian Conference on CyberSecurity, April 08–12, 2024, Salerno, Italy
                         *
                           Corresponding author.
                          $ farwa.batool@imtlucca.it (F. Batool); federico.canino@unipa.it (F. Canino); federico.concone@unipa.it (F. Concone);
                          giuseppe.lore@unipa.it (G. Lo Re); marco.morana@unipa.it (M. Morana)
                           0009-0003-2767-5526 (F. Batool); 0000-0001-7638-3624 (F. Concone); 0000-0002-8217-2230 (G. Lo Re); 0000-0002-5963-6236
                          (M. Morana)
                                      © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
adopt a particular style while writing, with the intent of convincing as many people as possible about
the veracity of a statement. In this scenario, features representing the text under analysis, its title, or
even the author are analysed to distinguish genuine and fake content. Lexical features, overemphasized
words, or the absence of the information source may also indicate the falsity of the news [10]. However,
recent studies have proven that it is possible to deceive AI models by adding a certain amount of
perturbation to the input, which causes the classifier to make an error in the final prediction [11]. This
logic falls under the field of Adversarial Machine Learning (AML), which defines a set of strategies
that aim to compromise the correct functioning of a machine learning algorithm. Regardless of the
application scenario they address, AML attacks can be classified as operating in a white-box or black-box
setting. The former guarantees a higher probability of success because it assumes that the attacker
knows everything about the model to be targeted; in contrast, the latter typology is much more complex
and require the attacker to query a given model with some input in order to estimate its behavior.
   In the broader landscape of AML, researchers are making substantial efforts to understand and fortify
machine learning models against deliberate attempts. However, amidst this progress, a noteworthy
observation is that the literature on adversarial attacks to ML-based fake news detection is quite
limited [12]. The proposed study focuses on this domain, examining the impact of machine learning
adversarial attacks on fake news detection models. In particular, this paper presents a black-box attack
against an unknown NLP algorithm adopted by the target online platform to identify fake contents.
A naive approach would require producing a number of malicious samples and testing whether the
system classifies them as genuine or malicious [12]. However, such a strategy has two main limitations:
making too many queries could be seen as an attempt to compromise the entire system, and this would
cause the account used by the attacker to be blocked; the time required for the attack to be successful
would be very long. To overcome these limitations, the proposed attack exploits a local model that is
used to test the validity of the malicious samples offline, i.e., without querying the real target system.
   The remainder of this paper is organized as follows: related work is described in Section 2. The
background and methodology are discussed in Section 3. Experimental settings and results are discussed
in Section 4. Conclusions are given in Section 5.


2. Related Work
Amidst the vast dissemination of information on online platforms [13], the detection of unauthentic
content is necessary, which has become a focal point of research. Various methodologies have emerged,
each in the struggle for discerning truth from misinformation. To achieve the aim, in [2], it was
introduced a Hybrid-CNN model, achieving promising results on different datasets, whereas the authors
in [14] emphasizing on the flow of effective information for fake news detection. Another interesting
research was described in [15], proposing a novel framework that preserves domain-specific and cross-
domain knowledge. It introduces an unsupervised technique to choose informative unlabeled news
records, for manual labeling, from a large pool of unlabeled records. Thus, reducing labeling costs.
Although the model performs efficiently for extracting the domain knowledge of news, it still involves
the human intervention for the labelling of record selected by the model, utilizing significant time and
man power. As the sophistication of fake news evolves, the demand for resilient detection systems
has led researchers to explore the realm of AML, where intentional attacks on models pose a new set
of challenges. These attacks can be of various domains. Focusing on the realm of Natural Language
Processing (NLP), the authors of [16] created a framework that performs adversarial training using
text attacks. This framework consists of various attack recipes including Text-Bugger, Deep Word Bug,
and Text-Fooler. Another framework for adversarial comment generation is introduced in [17] called
Malcom . The assumption is that the attacker is not allowed to modify the text’s content, so they exploit
the comments on the article in order to make an attack.
   These adversarial attacks are proposed in order to assess the vulnerabilities of ML models in different
application contexts, from malware analysis [18] to intrusion detection [19]. The authors in [20]
showed a possibility to attack on object detectors. An attack on intrusion detection systems (IDSs)
is introduced in [21] by introducing a small perturbation in network traffic, [22] and [23] deal with
adversarial attacks in the field of medical.
   In the context of fake news detection, the authors in [24] conducted experiments to assess the vulner-
ability of automatic fake news detectors to adversarial attacks. They employed the TextAttack library,
conducting experiments on pretrained models, including RoBERTa, BERTweet, and FlairEmbeddings.
The results were evaluated based on the success rate of flipping labels achieved by the attack recipes.
However, a notable gap exists in the paper as it does not elaborate on the implications of these findings
for the overall performance of fake news detection models. Although the authors claim that their
process could potentially bypass fake news detection systems, a comprehensive understanding of the
vulnerability of these detectors necessitates an assessment of the impact of perturbed data on their
actual performance. It is crucial to bridge the gap between the efficacy of adversarial attacks and
their implications in the real world for fake news detection models. This issue has been addressed in
the research done in[12], which evaluated the robustness of fake news detectors using four distinct
architectures: multilayer perceptron (MLP), convolutional neural network (CNN), recurrent neural
network (RNN) and a hybrid CNN-RNN model, across diverse datasets, including Kaggle fake news
dataset, ISOT dataset, and LIAR dataset. Employing adversarial attacks such as TextBugger, TextFooler,
PWWS, and DeepWordBug, they varied the complexity of detectors, experimented with different input
lengths, and explored loss functions. The outcome indicated that CNNs exhibit the highest level of
robustness, closely followed by RNNs.
A conclusion of the analysis of the related literature is that while considerable progress has been
achieved in the domains of AML and fake news detection, a notable gap persists concerning their
convergence. In particular, the experiments addressing this integration face a significant challenge
related to the potential for denial of service (DoS). Current methodologies involve direct querying
of target models for testing on fake detection models, posing a risk of DoS in real-world scenarios.
Therefore, there is a need for the development of a model capable of avoiding this DoS vulnerability
while effectively introducing adversaries into the target model.


3. Background and Methodology
The term Adversarial Machine Learning is generally referred to as a set of techniques that aim to
compromise the proper functioning of ML systems through the use of malicious inputs called adversarial
examples. In the case of a classification task, the goal is to fool the targeted ML model by obtaining an
output different from the expected one.
   The guidelines of every AML strategy require the definition of the adversary model according to
three main aspects [23, 25].
   Firstly, the attacker’s goal defines the expected result of the attack and, in which phase it must be
launched. Multiple objectives may be targeted by the adversary, such as confidentiality (if the attacker
aims to obtain private information), integrity (if the aim is to cause the malfunction of the target model),
and availability (if it is to make the system offline). Since a piece of news is public and that the attacker
wants to deceive the platform to share and spread fake news, in the scenario considered here, an integrity
violation is the only objective, e.g., fooling the classifier at the platform disposal without disrupting the
platform itself. In doing so, the proposed strategy can be categorized as a label-targeted attack because
the attacker wants to maximize the probability that adversarial examples get classified in a specific class,
i.e., real news. This can be achieved during the inference phase where the crafted sample is evaluated.
   Then, the attacker’s capability defines which values of the original sample are to be perturbed by
the attacker, and how. The process of perturbing values is a crucial aspect of the attack: the more we
alter them, the less they will be similar to the original ones, making our adversarial examples useless
for the attacker’s goal. In the scenario considered here, the attack logic translates in perturbing the
original sequence of words 𝑋 (related to a fake news) with some noise, 𝛿, so that the new sequence
𝑋˜ = 𝑋 + 𝛿 is misclassified as a real news. To do this aim, a set of 𝑆𝑥 important words ordered by their
influence is identified. One by one, the attacker replaces these words with their disrupted version while
                                   Offline Operation                                Online Operation

               Eating garlic                                              E@ting g@rlic
               can prevent                             Training            c@n prevent         Training
              infection with                                              infection with
                    the                                  Data                   the              Data
               coronavirus.                                               coron@virus.


                               Adversarial             Surrogate                                Target
                                                                   TRUE                                   TRUE
                                 Attack                 Model                                   Model


                                             FAKE


Figure 1: The proposed AML strategy.


maintaining semantic and grammatical similarity to the initial text.
   Last but not least, the attacker’s knowledge defines the background the attacker has about the
target ML model. The attack is based on the white-box threat model if a complete knowledge about the
target is provided, including the dataset used during the training phase and the model’s structure. In
contrast, the gray box threat model is used when the attacker has partial knowledge of the target. In
real applications, the black-box threat model is often adopted. This is the case of the scenario addressed
in this paper as the attacker has no information about the data, the inner ML model, or any other
element used by the target. The only way to interact with the remote model is to use it as an oracle to
be contacted to have the text classified.
   However, continuously querying the remote platform could trigger security mechanisms to the
point of causing the attacker to be compromised. For example, looking at a social network platform,
continuously posting news to find out its assigned class could be detected as spamming activity, almost
certainly leading to account banning or blacklisting. Similar discussions can be made in other contexts,
such as the fact checking tool provided by Google. In this case, even more drastic measures could be
taken since the activity could be seen as a DoS attack.
   The limits introduced above are overcome by the proposed strategy following the general structure
of black-box adversarial attacks. As shown in Figure 1, the attack is divided in Offline and Online
operations.
   In the first phase, the attacker trains a surrogate model 𝑆 so that it emulates the behavior of the target
model 𝑇 . Note that the application scenario does not impose any special requirements for the surrogate
model; however, according to reports in the scientific literature [11], the adoption of low-complex ML
models (such as a linear regression or support vector machines) allows one to increase the probability
that the attack is successful on the target, especially when it consists of convolutional or recurrent
neural networks.
   To collect a valid dataset for 𝑆, the attacker may take advantage of different strategies discussed in
the field of AML [26]. In addition to these, the attacker could also independently assemble a dataset by
exploiting well-known Fact Checking sites. It is important to note that the adversary does not need
to know the feature space representation for the input samples. Indeed, perturbations can be applied
directly to the input text, which in turn will be submitted to the system for the classification. Once
the model training is completed, the adversary begins to perform the actual offline operation, which
consists in locally querying 𝑆 until an erroneous label is returned.
   In the second phase, the perturbed text is passed as input to 𝑇 , and the transferability is evaluated.
According to this property, it is possible to attack a machine learning system with no knowledge about
the underlying model. If the attack is transferred successfully, the target’s output will be the same of
the surrogate.
                                                 TB    TF   PWWS       IR   DWB
                           Character removal     ✓     X      X         X    ✓
                        Character substitution   ✓     X      X         X    ✓
                              Character swap     ✓     X      X         X    ✓
                            Whitespace insert    ✓     X      X         X    X
                          Semantic similarity    ✓     ✓      ✓         X    X
                          Syntactic similarity    X    X      X         X    X
                               Word deletion      X    X      X        ✓     X
Table 1
Admissible perturbations for each of the considered attacks. TextBugger(TB), TextFooler(TF), PWWS, InputRe-
duction(IR), DeepWordBug(DWB)


3.1. Adversarial Attacks for Text Perturbation
The adversary’s objective is to introduce perturbations into the corpus by altering numerous words. The
goal is to impact the predictive capabilities of the model without significantly changing the semantic
meaning of the text. To accomplish this, the adversary identifies pivotal words within the corpus and
implements modifications, such as character replacements or substitution of the entire word with its
synonym. In particular, we considered the following five attacks:

    • The Probability Weighted Word Salience (PWWS) [27]: This attack preserves the lexical,
      grammatical, and semantic constraints of the input while performing the attack. The approach
      operates by evaluating the importance of individual words, called "word salience," to establish
      their ranking based on this metric. Subsequently, the approach identifies the word with the
      highest salience score and compiles a list of prospective substitutions for this word, typically
      encompassing synonyms or lexically and semantically similar terms. This substitution process is
      done to alter the model’s predictive outcome.
    • Text-Bugger[28]: Identifies the most important sentences in the text, and for each, an importance
      value is assigned through a score function. After, variations of the sentences are computed in
      order to obtain a new score value to be compared to the original one. The differences between the
      original score and those obtained from text’s variations is used to determine a set of keywords. This
      set represents the starting point of the method because it is used to generates five perturbations:
      random removal of a character, swapping of unique characters, substitution of a character with
      an homoglyph, random insertion of a whitespace, substitution of the word by a semantically
      similar word. Therefore, the attacker chooses the optimal perturbation for each keyword in order
      to reduce the classifier’s output score.
    • Text Fooler (TF)[29]: The approach identifies keywords by computing the difference between the
      model’s score before and after the deletion of a word from the input. The attacker then replaces
      every keyword with the words that are closer to the actual word in a predefined Embedding space
      and selects the best, i.e., the one that reduces the most the output score.
    • Input Reduction (IR)[30]: This attack, in contrast to others, completely deletes the words which
      are less important. The result of doing this causes the remaining words to appear nonsensical to
      humans, and the model also interprets these words as nonimportant. Iterative removal of words
      affects the performance of the model.
    • Deep Word Bug (DWB)[31]: This attack identifies the critical tokens by using a unique scoring
      strategy. And then perform perturbations on character-level, including character swap, character
      insert, character deletion and character substitution making the words unidentified and this
      impacting the model’s performance.

   For the sake of clarity, Table 1 summarizes the perturbations that each of the considered attacks can
make on the original text.
   The assessment criteria for the success, failure, or skipping of an attack are contingent on its ability
to modify the output label. An attack is called Successful when it makes a modification that results in a
                                                                   (d)

                                                       (c)
                                      Attack Library


                           Dataset          (a)           Surrogate                Target
                                                                         (e)


                                                             (b)

Figure 2: The experimental workflow to assess the performance of the proposed methodology. (a) Evaluation of
surrogate model without launching any attack. (b) Evaluation of target model without launching any attack. (c)
Analysis of surrogate model on data generated by the attacks. (d) Analysis of target model on data generated by
the attacks. (e) Evaluation of target model on data perturbed by surrogate model


change of the output label. Conversely, if the attack makes textual perturbations that are not capable of
changing the output label, it is categorized as Failed.


4. Experimental Analysis and Discussion
Experiments have been carried out on LIAR 1 , a data set consisting of 12.8K manually labeled short
statements. The samples are collected from Politifact.com. Each news article encompass various
attributes including the ID, label, statement, subject(s), speaker, speaker’s job title, state info, party
affiliation, barely true counts, false counts, half true counts, mostly true count, pants on fire counts and
venue. This dataset has been widely used for binary [32] as well as multi-label classification [33] for
fake news detection.
   The dataset underwent preprocessing in order to convert the raw text into a numerical format,
computable to machine learning models. The statement column served as the input variable, while the
output variable underwent a transformation where multi-label categories were converted into binary
labels denoting either 0 (Fake) or 1 (Real). This conversion was implemented by associating the labels
{barely-true, false, pants-on-fire} to the class 0, and the labels {true, mostly-true, half-true} to the class 1.
The statement column was additionally elaborated with the aim to improve the recognition performance.
First, the special characters are removed from the news samples, retaining only alphanumeric values.
Then, the words are reduced to their base form using WordNetLemmatizer. As stemming does rough
reduction and often reduces the words to non-words, lemmatization works better by producing valid
words and a more accurate representation.
   To simulate a real black-box scenario, two different representations were used for the surrogate and
target models. Specifically, we have adopted Term Frequency-Inverse Document Frequency (TF-IDF) for
the surrogate and the CountVectorizer technique for the target model.
   In the following of this section, we used Logistic Regression (LR), Random Forest (RF), Support
Vector Machine (SVM) as surrogate models and a Neural Network (NN) as the target model. In the last
case, the input data is passed through an embedding layer to convert words into relevant embedding.
Subsequently, the output is flattened and passed through a dense layer comprising 128 units, the output
of which is passed to another dense layer containing 64 units. Finally, the data is sent through a dense
layer with two units, as the classification is binary. The ReLU activation function is used for the first
two dense layers, while the last layer used the Sigmoid function.
   The efficiency of the proposed methodology has been evaluated following the flow depicted in
Figure 2.

1
    https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset
                                                         Surrogate        Target
                                                   LR      SVM RF          NN
                                     Accuracy      .62      .62    .61      .57
                                     Precision     .61      .61    .60      .56
                                        Recall     .60      .60    .59      .55
                                     F1-Score      .60      .60    .59      .55
Table 2
Classification performance of the surrogates on 𝐷𝑆 and of the target on 𝐷𝑇 .

                                       DWB         IR      PWWS          TB        TF
                           20
     ACCURACY DIFFERENCE


                           15

                           10

                            5

                            0
                                LR               SVM                          RF        Target
                                                            MODELS

Figure 3: Performance variations of models (target and surrogates) after each attack.


   The first set of experiments aim to evaluate the performance of the three surrogate models
(configuration (a)) and of the target model (configuration (b)) without launching any kind of attack.
This evaluation allows to determine how the several models perform in the recognition process. The
outcomes reported in Table 2 suggest that LR, SVM and RF achieved similar performance in terms of
the considered evaluation metrics, as well as the NN. These values reveal that the different embedding
techniques adopted into the ML models do not affect so much the classification performance. The
low recognition efficacy on the LIAR dataset is known into the literature [34], as this dataset contains
mislabelled data.
   More interesting findings can be deducted for the configurations (c) and (d), where the attacks are
launched against the models. Here, we measured the Accuracy Difference achieved by each model before
and after the attack: the greater the difference, the more successful the attack. The outcomes shown
in Figure 3 suggest that each model’s performance is decreased by the introduction of adversaries’
samples. The most effective attacks are PWWS, TF and TB that, in average, show a significant difference
in models’ performance. All of three attacks share the same logic of perturbation strategy as the
semantic similarities of the words is exploited. This observation implies that these attacks are more
likely to impact the model’s performance. In contrast, strategies which primarily focus on character-
level perturbations might have perturbed words such that they go unrecognized by the tokenizer, and
possibly filtered out without causing a substantial impact on the model’s performance. Other interesting
insights concern the number of queries made by each attack procedure. In fact, to launch a successful
attack, each time the technique crafts the original text, the target model is queried. This aspect will be
examined later in the proposed work.
   Finally, the configuration (e) considers the entire elaboration pipeline: the samples that fooled each
surrogate by each attack are transferred to the target to evaluate the model resilience to adversarial
examples. In this experimental evaluation, the number of samples that fool the surrogate and the
Accuracy Difference are examined. The first metric provides objective data on how many samples
fool the surrogate. The second metric, on the other hand, provides information on the effectiveness
of the proposed technique by allowing the demonstration that the adoption of the surrogate still
allows adversarial attacks to be conducted to deceive the target. In this regard, to conduct an effective
evaluation, the difference in accuracy was calculated by deriving the deviation between the target’s
                                          DWB     IR        PWWS           TB     TF       average
   100
    80
    60
    40
    20
     0
            SamplesS(%)        Accuracy     SamplesS(%)         Accuracy    SamplesS(%)        Accuracy   SamplesS(%)       Accuracy
                          LR                              SVM                             RF                        average
  DWB           33                7             30                 6            37                6          33,3             6,3
  IR            62                3             63                 3            87                3          70,7             3,0
  PWWS          75               10              32                6             38               7          48,3             7,7
  TB            38                9              37                7             44               7          39,7             7,7
  TF            49                9              48                9             62              10          53,0             9,3
  average      51,4              7,6            42,0              6,2           53,6             6,6

Figure 4: Percentage of samples that fooled the target and Accuracy Difference for each setting in the configu-
ration (e).


accuracy on the original test set and that achieved by the same model when the test set contains the
perturbed samples.
    Figure 4 shows the number of samples that fooled the surrogate models and the accuracy achieved
by the target for each of considered setting. As already observed in Figure 3, also here the introduction
of perturbed samples leads to a subsequent decline in the accuracy of the target model. In general,
it is possible to note that all the surrogates on average (last row) achieve similar performance when
transfer to the target model (from 6.2 to 7.6). Here, LR is the best as it shows a trade-off between the
metrics considered. Looking at specific settings, it can be observed that they do not achieve accuracy
differences greater than 10, which occurs when the (surrogate, attack) pairs are (LR, PWWS) and (RF, TF)
respectively. In contrast, the worst strategy is IR as the accuracy is only decremented by 3 regardless of
the surrogate model adopted.
    More insights can be obtained by analyzing the right part of Figure 4, where the average of both
metrics for each attack is depicted. In terms of Accuracy Difference, the best attack is TF which allows to
obtain a decrement of 9.3 on the target model; PWWS and TB are characterized by similar results. The
outcome confirms the findings already obtained (Figure 3) for the configurations (c) and (d), except for
IR. In fact, it has, on average, a very high percentage of samples that fool the surrogate (i.e., successfully
attacks on the surrogate), but a limited effectiveness in decreasing the accuracy of the target. This result
can be attributed to the random nature of IR, which logic is to prioritize random word deletions over
targeted perturbations, unlike other attacks that strategically modify the content of the text to deceive
the model.
    The very last analysis concerns the focus of the proposed work, i.e., the number of queries made on
the target model. Recall that, in a real-world scenario, making an innumerable amount of queries to
the remote system could activate defense mechanisms that could compromise the attacker’s purposes.
Table 3 summarizes the average number of queries for three of configurations so long discussed in
this section. For the configurations (c), the highest and lowest number of queries is made by PWWS
and IR, while they are TF and IR for the configurations (d). As might be expected (also, referring to
Figure 3), the techniques that are characterized by a higher number of queries are also those that are
more likely to be successful in fooling the model, be it the surrogate or the target. The only exception is
always IR, which as always is characterized by a random behavior.
    To understand the benefits of the proposed solution, these values must then be compared with those
of our proposal. In particular, the comparison must consider the results of configuration (c) with
those of the “Surrogate” column in configuration (e) and, likewise, the values of configuration (d)
with the “Target” column in configuration (e). In the first case, the findings from the comparison
are not surprising since, on average, similar results are obtained as before: PWWS continues to be the
                           configuration (c)       configuration (d)      configuration (e)
              Attack      LR     SVM      RF              NN             Surrogate Target
              PWWS       95.77 132.82 90.44              79.36             104.32       1
                  TF     64.42 112.20 65.89             113.92              82.72       1
                  TB     39.22   64.99   40.08           39.70              49.11       1
                  IR     19.62   26.63   19.59           13.73              20.59       1
               DWB       23.22   36.43   24.10           23.70              28.11       1
Table 3
Average number of queries for configuration (c), configuration (d), and configuration (e). In the last case,
the queries on surrogate and target are considered separately.


technique with the most queries, and IR the worst. In the second case, however, it can be seen that the
proposed approach performs on average only one query to the target as opposed to configuration
(d). This result is justified by the inherent nature of the proposed approach, which (i) starts from the
original test set of 𝑛 samples, (ii) tries to perturb each sample via the chosen attack strategy, and (iii)
composes a new test set of 𝑛 samples consisting of both the texts that did not fool the surrogate and
those that were perturbed. Finally, the generated test set is tested on the target, then a query for each
sample.


5. Conclusions
In this paper, we have presented a black-box approach for attacking machine learning models. Central
to the proposed method is the introduction of a surrogate model situated between the attacker and the
target model. This intermediary model acts as a medium for efficiently transferring offline attacks in
minimal queries, ensuring successful adversarial manipulation while minimizing resource consumption
and potential disruption to operational systems.
   In the experimental phase, we leveraged various attack strategies with the aim of introducing distinct
perturbations, including synonym replacement for most words and nuanced character-level alterations.
The latter can be susceptible to mitigation by spell checker systems in numerous systems; while the
former, which still worked fine for an improved approach-substituting antonyms for synonyms. This
strategy has the potential to fabricate news articles that deviate significantly from their original content,
thereby substantiating the generation of deceptive or counterfeit news while maintaining semantic
correctness as well.
   For future investigations, an exploration or creation of an alternative attack strategy is warranted.
Furthermore, it should be acknowledged that the attacks encounter limitations failing to perturb all the
samples in our dataset. Consequently, the performance of the target attack is diminished to a limit. As
a general trend, the efficacy of the attack diminishes proportionally with the perturbed samples. Thus,
for robust evaluations, a comprehensive perturbation coverage is required.


References
 [1] F. Concone, A. De Paola, G. Lo Re, M. Morana, Twitter analysis for real-time malware discovery, in:
     2017 AEIT International Annual Conference, 2017, pp. 1–6. doi:10.23919/AEIT.2017.8240551.
 [2] J. A. Nasir, O. S. Khan, I. Varlamis, International Journal of Information Management Data Insights
     1 (2021) 100007.
 [3] A. Dharawat, I. Lourentzou, A. Morales, C. Zhai, Drink bleach or do what now? covid-hera:
     A study of risk-informed health decision making in the presence of covid-19 misinformation,
     Proceedings of the International AAAI Conference on Web and Social Media 16 (2022) 1218–1227.
 [4] S. Gaglio, G. Lo Re, M. Morana, Real-time detection of twitter social events from the user’s
     perspective, in: 2015 IEEE International Conference on Communications (ICC), 2015, pp. 1207–
     1212. doi:10.1109/ICC.2015.7248487.
 [5] F. Concone, G. Lo Re, M. Morana, S. K. Das, Spade: Multi-stage spam account detection for online
     social networks, IEEE Transactions on Dependable and Secure Computing 20 (2023) 3128–3143.
     doi:10.1109/TDSC.2022.3198830.
 [6] X. Zeng, A. S. Abumansour, A. Zubiaga, Automated fact-checking: A survey, Language and
     Linguistics Compass 15 (2021) e12438.
 [7] F. Concone, G. Lo Re, M. Morana, C. Ruocco, Assisted labeling for spam account detection on
     twitter, in: 2019 IEEE International Conference on Smart Computing (SMARTCOMP), 2019, pp.
     359–366. doi:10.1109/SMARTCOMP.2019.00073.
 [8] F. Concone, G. Lo Re, M. Morana, C. Ruocco, Twitter spam account detection by effective labeling,
     in: 2019 Italian Conference on CyberSecurity (ITASEC), volume 2315, CEUR-WS, 2019.
 [9] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data mining
     perspective, SIGKDD Explor. Newsl. 19 (2017) 22–36.
[10] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language
     in fake news and political fact-checking, in: Proceedings of the 2017 Conference on Empirical
     Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen,
     Denmark, 2017, pp. 2931–2937.
[11] B. Biggio, F. Roli, Wild patterns: Ten years after the rise of adversarial machine learning, in:
     Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security,
     CCS ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 2154–2156.
[12] H. Ali, M. S. Khan, A. AlGhadhban, M. Alazmi, A. Alzamil, K. Al-Utaibi, J. Qadir, All your fake
     detector are belong to us: evaluating adversarial robustness of fake-news detectors under black-box
     settings, IEEE Access 9 (2021) 81678–81692.
[13] S. Gaglio, G. Lo Re, M. Morana, A framework for real-time twitter data analysis, Computer Com-
     munications 73 (2016) 236–242. doi:https://doi.org/10.1016/j.comcom.2015.09.021,
     online Social Networks.
[14] B. Ghanem, S. P. Ponzetto, P. Rosso, F. Rangel, Fakeflow: Fake news detection by modeling the
     flow of affective information, in: Conference of the European Chapter of the Association for
     Computational Linguistics, 2021.
[15] Y. Zhu, Q. Sheng, J. Cao, Q. Nan, K. Shu, M. Wu, J. Wang, F. Zhuang, Memory-guided multi-view
     multi-domain fake news detection, IEEE Transactions on Knowledge and Data Engineering (2022).
[16] J. X. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, Y. Qi, Textattack: A framework for adversarial
     attacks, data augmentation, and adversarial training in nlp, 2020. arXiv:2005.05909.
[17] T. Le, S. Wang, D. Lee, Malcom: Generating malicious comments to attack neural fake news
     detection models, in: 2020 IEEE International Conference on Data Mining (ICDM), 2020, pp.
     282–291.
[18] L. Demetrio, B. Biggio, F. Roli, Practical attacks on machine learning: A case study on adversarial
     windows malware, IEEE Security & Privacy 20 (2022) 77–85.
[19] K. He, D. D. Kim, M. R. Asghar, Adversarial machine learning for network intrusion detection
     systems: A comprehensive survey, IEEE Communications Surveys & Tutorials (2023) 1–1.
[20] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, F. Tramèr, A. Prakash, T. Kohno, D. Song,
     Physical adversarial examples for object detectors, in: Proceedings of the 12th USENIX Conference
     on Offensive Technologies, WOOT’18, USENIX Association, USA, 2018, p. 1.
[21] M. Usama, M. Asim, S. Latif, J. Qadir, Ala-Al-Fuqaha, Generative adversarial networks for launch-
     ing and thwarting adversarial attacks on network intrusion detection systems, in: 2019 15th
     International Wireless Communications & Mobile Computing Conference (IWCMC), 2019, pp.
     78–83.
[22] S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, I. S. Kohane, Adversarial attacks on
     medical machine learning, Science 363 (2019) 1287–1289.
[23] S. Gaglio, A. Giammanco, G. Lo Re, M. Morana, Adversarial machine learning in e-health:
     Attacking a smart prescription system, in: AIxIA 2021 – Advances in Artificial Intelligence,
     Springer International Publishing, Cham, 2022, pp. 490–502.
[24] C. Koenders, J. Filla, N. Schneider, V. Woloszyn, How vulnerable are automatic fake news detection
     methods to adversarial attacks?, arXiv preprint arXiv:2107.07970 (2021).
[25] F. Concone, S. Gaglio, A. Giammanco, G. Lo Re, M. Morana, Adverspam: Adversarial spam
     account manipulation in online social networks, ACM Trans. Priv. Secur. 27 (2024). URL: https:
     //doi.org/10.1145/3643563. doi:10.1145/3643563.
[26] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, A. Swami, Practical black-box attacks
     against machine learning, in: Proceedings of the 2017 ACM on Asia Conference on Computer and
     Communications Security, ASIA CCS ’17, Association for Computing Machinery, New York, NY,
     USA, 2017, p. 506–519.
[27] S. Ren, Y. Deng, K. He, W. Che, Generating natural language adversarial examples through
     probability weighted word saliency, in: Proceedings of the 57th annual meeting of the association
     for computational linguistics, 2019, pp. 1085–1097.
[28] J. Li, S. Ji, T. Du, B. Li, T. Wang, Textbugger: Generating adversarial text against real-world
     applications, arXiv preprint arXiv:1812.05271 (2018).
[29] D. Jin, Z. Jin, J. T. Zhou, P. Szolovits, Is bert really robust? a strong baseline for natural language
     attack on text classification and entailment, in: Proceedings of the AAAI conference on artificial
     intelligence, volume 34, 2020, pp. 8018–8025.
[30] S. Feng, E. Wallace, A. Grissom II, M. Iyyer, P. Rodriguez, J. Boyd-Graber, Pathologies of neural
     models make interpretations difficult, arXiv preprint arXiv:1804.07781 (2018).
[31] J. Gao, J. Lanchantin, M. L. Soffa, Y. Qi, Black-box generation of adversarial text sequences to
     evade deep learning classifiers, in: 2018 IEEE Security and Privacy Workshops (SPW), IEEE, 2018,
     pp. 50–56.
[32] B. Bhutani, N. Rastogi, P. Sehgal, A. Purwar, Fake news detection using sentiment analysis, in:
     2019 twelfth international conference on contemporary computing (IC3), IEEE, 2019, pp. 1–5.
[33] T. Rasool, W. H. Butt, A. Shaukat, M. U. Akram, Multi-label fake news detection using multi-layered
     supervised learning, in: Proceedings of the 2019 11th international conference on computer and
     automation engineering, 2019, pp. 73–77.
[34] D. Choudhury, T. Acharjee, A novel approach to fake news detection in social networks us-
     ing genetic algorithm applying machine learning classifiers, Multimedia Tools and Appli-
     cations 82 (2023) 9029–9045. URL: https://doi.org/10.1007/s11042-022-12788-1. doi:10.1007/
     s11042-022-12788-1.

</pre>