Gray-box Techniques for Adversarial Text Generation


                              Prithviraj Dasgupta, Joseph Collins and Anna Buhman∗


                            Abstract                                   classifier-based learning algorithm is that they are suscepti-
                                                                       ble to adversarial attacks where a malicious adversary could
  We consider the problem of adversarial text generation in the        send corrupted or poisoned data to the classifier that results
  context of cyber-security tasks such as email spam filtering
  and text classification for sentiment analysis on social media
                                                                       in incorrect classification, or, for more calculated attacks,
  sites. In adversarial text generation, an adversary attempts to      use corrupted data to alter the classification decisions of the
  perturb valid text data to generate adversarial text such that       classifier. Both these attacks could compromise the opera-
  the adversarial text ends up getting mis-classified by a ma-         tion of the classifier, making it behave in an unintended, pos-
  chine classifier. Many existing techniques for perturbing text       sibly insecure and unsafe manner. As an example, a compro-
  data use gradient-based or white-box methods, where the ad-          mised email classifier could end up categorizing valid email
  versary observes the gradient of the loss function from the          messages as spam (false positives) while delivering spam
  classifier for a given input sample, and uses this information       email messages as non-spam (false negatives). An impor-
  to strategically select portions of the text to perturb. On the      tant defense against adversarial attacks is to build a model
  other hand, black-box methods where the adversary does not           of the adversary including the data that the adversary gener-
  have access to the gradient of the loss function from the clas-
  sifier and has to probe the classifier with different input sam-
                                                                       ates, and use this model as a virtual adversary to improve the
  ples to generate successful adversarial samples, have been           learner’s robustness to adversarial attacks. Towards this ob-
  used less often for generating adversarial text. In this paper,      jective, in this paper, we investigate adversarial data gener-
  we integrate black-box methods where the adversary has a             ation techniques that could be used to model corrupted data
  limited budget of the number of probes to the classifier, with       generated by a virtual adversary. Our work focuses on ad-
  white-box, gradient-based methods, and evaluate the effec-           versarial text data generation as many of the aforementioned
  tiveness of the adversarially generated text in misleading a         cyber-security tasks operate mainly on text data.
  deep network classifier model.
                                                                          Adversarial data generation techniques can be broadly
                                                                       classified into two categories called white-box and black-
                        Introduction                                   box. In white-box techniques, the adversary has access to
Machine learning-based systems are currently used widely               information of the classifier, such as the parameters of
for different cyber-security tasks including email spam fil-           the model of the classifier, e.g., weights in a deep neural
tering, network intrusion detection, sentiment analysis of             network-based classifier, or, to internal calculations of the
posts on social media sites like Twitter, and validating au-           classifier, such as the gradients of the loss function calcu-
thenticity of news posts on social media sites. Most of these          lated by the classifier on a sample. In contrast, in black-
systems use supervised learning techniques where a learn-              box techniques, the adversary does not have any informa-
ing algorithm called a classifier is trained to categorize data        tion about the classifier and treats it like a black-box, by
into multiple categories. For instance, an automated classi-           passing data samples as queries to the classifier and observ-
fier for an email spam filter classifies incoming email into           ing the classifier’s output category or label for each sample.
spam and non-spam categories. A critical vulnerability of              Additionally, in most practical black-box settings, the adver-
                                                                       sary can send only a finite number of queries to the classi-
   ∗
     P. Dasgupta and A. Buhman are with the Computer Science           fier due to the adversary’s budget limitations in generating
Dept., University of Nebraska at Omaha, USA. e-mail: {pdasgupta,       samples, or, due to restrictions in the number of queries ac-
abuhman}@unomaha.edu. J. Collins is with the Distributed Sys-          cepted by the classifier. While white-box techniques such as
tems Section, Naval Research Laboratory, Washington D.C., USA.         gradient-based methods (Liang et al. 2018) have been pro-
e-mail: joseph.collins@nrl.navy.mil                                    posed for generating adversarial text, black box methods for
Copyright ⃝  c by the paper’s authors. Copying permitted for private
and academic purposes. In: Joseph Collins, Prithviraj Dasgupta,        generating adversarial text (Gao et al. 2018) are less investi-
Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo-         gated in literature. In this paper, we have describe gray-box
sium on Adversary-Aware Learning Techniques and Trends in Cy-          techniques for generating adversarial text where gradient-
bersecurity, Arlington, VA, USA, 18-19 October, 2018, published        based white-box techniques for generating adversarial text
at http://ceur-ws.org                                                  are combined with budget-limited, black-box techniques for
generating adversarial samples. We have evaluated the ad-         criminator or classifier and provided to the generator or ad-
versarial text data generated using our proposed technique        versary. The adversary treats this information as a feedback
using the DBPedia dataset that contains short articles on         signal from the classifier about the quality of its adversari-
14 different categories extracted from Wikipedia, in terms        ally generated text, and adapts future adversarial generations
of the difference from the original data used as a basis to       of words or phrases in the text towards improving the qual-
generate the adversarial data, as well as the effectiveness of    ity of its generated text. Additional methods for generating
the adversarial data in fooling a classifier into making mis-     adversarial text are described in (Iyyer et al. 2018), (Jia and
classifications. Our results show that using gray-box tech-       Liang 2017), (Shen et al. 2017). Following gradient-based
niques the divergence in the perturbed text from the original,    techniques for image perturbation (Goodfellow, Shlens, and
unperturbed text increases with the amount of perturbation,       Szegedy 2014), (Ebrahimi et al. 2018), (Liang et al. 2018)
and, more perturbation of original text results in changing       have used gradient-based methods to identify tokens in text
the label of the perturbed text with respect to the original      that have the highest gradient and consequently most influ-
text more often.                                                  ence in determining the text’s label in the classifier. These
                                                                  tokens are then strategically modified so that the text with
                      Related Work                                modified tokens results in a different label when classified,
Adversarial data generation techniques have received con-         as compared to the original text’s label. Both these meth-
siderable attention from researchers in the recent past. The      ods methods require the gradient information for the unper-
main concept in adversarial data generation is to add slight      turbed text when it is classified by the classifier to perturb the
noise to valid data using a perturbation technique so that the    text. In contrast, in (Gao et al. 2018), Gao et al. use a similar
corrupted data still appears legitimate to a human but gets       idea, but instead of selecting words or characters to perturb
mis-classified by a machine classifier. Most research on ad-      based on classifier-calculated gradients, they first rank words
versarial data generation has focused on image data, includ-      in a piece of text to be perturbed based on a score assigned
ing gradient-based methods for perturbing valid data (Biggio      to the word’s context, followed by changing the spelling of
et al. 2013), (Goodfellow, Shlens, and Szegedy 2014), (Pa-        the highest ranked words. Their technique is relevant when
pernot et al. 2016), recurrent neural networks (Gregor et al.     the gradient information from the classifier might not be
2015), and generative adversarial networks (GANs) (Good-          readily available to perturb text, for example, in an online
fellow et al. 2014) and its extensions (Mirza and Osindero        classification service where the classifier can be accessed
2014), (Chen et al. 2016), (Arjovsky, Chintala, and Bot-          only as a black box. Sethi and Kantardzic (Sethi and Kan-
tou 2017). However, rather than generating synthetic data         tardzic 2018), also described black-box methods for gener-
towards misleading a machine classifier, the objective of         ating adversarial numerical and text data while considering
GAN-enabled data generation has been to create synthetic          budget limitations of the adversary in generating adversar-
data that looks convincing to humans.                             ial data. Our work is complimentary to these gradient-based
   Adversarial Text Generation. For adverarial image gen-         and black-box approaches as it compares the effectiveness
eration, image characteristics like RGB pixel values, HSV         of generating adversarial text by integrating gradient-based
values, brightness, contrast, etc., are real-valued numbers       methods with concepts used in budget-limited black-box
that can be manipulated using mathematical operations to          methods.
generate perturbed images that can fool a machine classifer
while remaining imperceptible to humans. In contrast, for                    Adversarial Text Perturbation
perturbing text, adding a real value to a numerical represen-     One of the main functions of an adversary in an adversarial
tation of a word, character, or token in an embedding space,      learning setting is to generate, and possibly mislabel, adver-
might result in a new word that might either be nonsense,         sarial text to fool the classifier. Adversarial text is usually
or, might not fit in with the context of the unperturbed, orig-   generated by starting from valid text and changing certain
inal text - in both cases, the adversarial text can be easily     words or characters in it. This process is also referrred to
flagged by both a machine classifier and a human. To ad-          to as perturbing text. We define adversarial text perturba-
dress this shortcoming, instead of directly using image per-      tion using the following formalization: Let V denote a vo-
turbation techniques to generate adversarial text, researchers    cabulary of English words and X denote a corpus of En-
have proposed using auto-encoders (Bengio et al. 2013) and        glish text. {Xi } ⊂ X denotes a dataset consisting of En-
recurrent neural networks (RNN) with long short term mem-         glish text samples, where Xi denotes the i-th sample. Fur-
ory (LSTM) architecture to generate text as sequences of to-                      (1)    (2)       (W )           (j)
                                                                  ther, Xi = (Xi , Xi , ..., Xi ), where Xi denotes the
kens (Mikolov et al. 2010), albeit with limitations when gen-
                                                                  j-th word in Xi and W is the maximum number of words
erating adversarial text that looks realistic to humans (Ben-                                       (j)
gio et al. 2015), (Huszár 2015). Recently, GAN-based meth-       of a text sample. Each word Xi is represented by a feature
                                                                             (j)
ods have been shown to be successful for adversarial text         vector Xi = f1:F generated using some word embedding
generation. In adverarial text generation GANs, a feedback        like word2Vec (Mikolov et al. 2013), Global Vector (GloVe)
signal (Yu et al. 2017), (Zhang et al. 2017) (Subramanian         (Pennington, Socher, and Manning 2014), or fastText (Joulin
et al. 2017) such as a reward value within a reinforcement        et al. 2016), where F is the embedding space dimension.
learning framework (Fedus, Goodfellow, and Dai 2018) or           Each sample Xi belongs to a class with label l ∈ L, where
high-level features of text identified by the classifier called   L is a finite set of class labels. For notational convenience,
leaked information (Guo et al. 2017), is evaluated by the dis-    we assume that l = y(Xi ), where y : X → L is a rela-
tion that ascertains ground truth label l for Xi . A classifier
C is also used to determine a label of Xi , the classifier’s
output is given by yC : X → L. We say that the classi-
fier classifies Xi correctly only when yC (Xi ) = y(Xi ). A
valid example Xi is perturbed by altering a subset of words
    (j)
{Xi } ∈ Xi using a perturbation strategy π. The perturba-
                                                         (j)
tion strategy π : X → X modifies the j-th word Xi to
         (j)                                      (j)
word X̃i ∈ V, 1 ≤ j ≤ W . Let nπ = |{X̃i }| denote
the number of words perturbed by the perturbation strategy
π and X̃i,nπ denote text Xi after perturbing nπ words in it.
Finally, let ∆ : X ×X → [0, 1] denote a divergence measure
between two pieces of text with ∆(Xi , Xi ) = 0. Within this
formalization, the objective of the adversary is to determine
a minimal perturbation n∗π to Xi satisfying:

            n∗π = argnπ min ∆(Xi , X̃i,nπ )
                  s.t. yC (X̃i,nπ ) ̸= y(Xi ),
                      and, yC (Xi ) = y(Xi )                 (1)

   The objective function in Equation 1 finds the number
of words to perturb that gives the minimum divergence be-
tween the original and perturbed text, while the last two con-
straints respectively ensure that the classifier mis-classifies
the perturbed text, X̃i,nπ , giving it a different label than the   Figure 1: Example of white-box adversarial text generation via
ground truth y(Xi ), but the classifier correctly classifies the    perturbing words using gradient information. (Top): The gradient
original text Xi . In the next section, we describe different       of the loss function of the classifier from classifying original text
perturbation strategies we have used to perturb valid text and      is available to the adversary. (Left): Adverary uses this information
generate adversarial text.                                          to select and perturb two words from original text. (Bottom): The
                                                                    perturbed text, when classified, yields a different label (’Person’)
                                                                    than original text (’Company’).
Perturbation Methods for Text Data

An adversarial perturbation method adds a certain amount of
noise to unperturbed data to generate perturbed data. There                                         δL
are two main questions in a perturbation method: how much                          x∗    = arg min
                                                                                                j δx(j)
noise to add, and where in the data to add the noise. For the
                                                                                                   ∑F
                                                                                                        δfk δL
first question, the two types of perturbation methods, white-                            = arg min
box and black-box, both add a random, albeit small amount                                       j
                                                                                                   k=1
                                                                                                       δx(j) δfk
of noise. But these two types of methods differ in their ap-
proach to answer the second question. Below, we describe                                                 ∑
                                                                                                         F
                                                                                                                      δL
the white-box and black-box perturbation methods, and pro-                               = arg min             wj,k       ,          (2)
                                                                                                     j                δfk
                                                                                                         k=1
pose a gray-box method that combines the advantages of the
                                                                                                                            δL
two.                                                                   where L is the loss function from the classifier, δf   k
                                                                                                                                 is
                                                                    the gradient of the loss function for the k-th feature of a
White-box Gradient-based Perturbation. White-box                    word, wj,k is the weight in the embedding layer neural net-
methods first query the classifier with the unperturbed text        work connecting the j-th word to its k-th feature in embed-
and observe the gradient of the loss function of the classi-        ding space. x∗ is then perturbed by replacing it with x̃(j) ,
fier’s output from the query for the different words in the         the word from the vocabulary that has the smallest positive
queried text. Because the loss function expresses the dif-          gradient of loss function, and, consequently, has the least in-
ference of the classifier’s determined label from the ground        fluence in changing the label determined by the classifier.
truth label for the queried text, the word in the text that cor-    Mathematically, x̃(j) is given by:
responds to the most negative gradient or change of the loss                                 ∑         δL         δL
function is most influential in changing the label for the text             x̃(j) = arg min       wj,k     , s.t.     >0        (3)
                                                                                           j           δfk        δfk
determined by the classifier. This word is selected as the                                       k
word to perturb. Mathematically, the selected word, x∗ , hav-         The general schematic of the white-box gradient-based
ing the most negative gradient is given by:                         perturbation is shown in Figure 1.
Black-box Perturbation In contrast to white-box meth-                                                        1
                                                                                                                                          1 word
                                                                                                                                                              3500


                                                                      Frac. of perturbed examples w/ same
ods, black-box methods select the perturbation position ran-                                                0.9                20 words


                                                                         label as unperturbed examples
                                                                                                                                                              3000
domly within unperturbed text. Consequently, black-box                                                      0.8
methods could yield adversarial examples that are less ef-


                                                                                                                                                                     Number of Samples
                                                                                                            0.7                                               2500
                                                                                                                          30 words
fective in misguiding the classifier than white-box generated                                               0.6
                                                                                                                                                              2000
examples. Below, we describe two black-box methods used                                                     0.5
to perturb text.                                                                                            0.4
                                                                                                                                                              1500

• Black-box, Budget-Limited Perturbation (Anchor                                                            0.3                                               1000
  Points). The anchor points (AP) method, proposed                                                          0.2
                                                                                                                                                              500
  in (Sethi and Kantardzic 2018), is prescribed when the ad-                                                0.1
  versary has a limited budget and can send fewer queries to                                                 0                                                0
                                                                                                                  0.2        0.4      0.6          0.8   1
  the classifier. The adversary starts with a set of valid, un-
                                                                                                                                     BLEU Score              10 words
  perturbed samples drawn randomly from the unperturbed
  dataset. AP adversarial data generation proceeds in two
  stages called exploration and exploitation, each with its
  respective budget. In the exploration stage, the adversary       Figure 2: Variation of the ratio of correct classifications for differ-
  uses one unit of exploration budget to randomly select           ent values of BLEU score of the perturbed text.
  one sample from the unperturbed set and and adds to it
  a perturbation vector drawn from a normal distribution
  N (0, R), where R ∈ [Rmin , Rmax ) is a perturbation ra-         generate larger and more diverse set of adversarial examples
  dius. The perturbed sample is sent as a query to the clas-       with fewer queries or probes to the classifier than white-box
  sifier and if it is categorized with the same label as the       methods. With this insight, we propose to combine white-
  original sample, it is retained as a candidate adversar-         box and black-box methods into a gray-box method to in-
  ial example. The perturbation radius is adjusted propor-         vestigate perturbation methods that can generate effective
  tional to the fraction of candidate adversarial examples         yet diverse adversarial examples. In the gray-box method,
  and these steps are repeated until the exploration budget        instead of drawing unperturbed samples for the black-box
  is expended. During the exploit phase, the adversary cre-        methods randomly from the original dataset, we first use the
  ates an adversarial example by generating a convex com-          white-box method to generate a small seed set of perturbed
  bination of a pair of randomly selected candidate adver-         examples from samples that are randomly-drawn from the
  sarial examples created during exploration phase. Gener-         original dataset. This seed set of perturbed examples is then
  ating each adversarial example consumes one unit of ad-          used to create a larger set of perturbed examples using the
  versary’s exploitation budget.                                   AP and RE black-box methods.
• Black-box, Budget-Lenient Peturbation (Reverse En-
  gineering). The reverse engineering (RE) attack, also pro-                                                            Experimental Evaluation
  posed in (Sethi and Kantardzic 2018), is accomplished            Word Embedding and Classifier Network. We have eval-
  once again in two stages called exploration and exploita-        uated our proposed techniques on the DBPedia dataset.
  tion. The adversary once again starts with a set of valid,       The dataset contains 700, 000 short articles extracted from
  unperturbed samples drawn randomly from the unper-               Wikipedia and categorized into 14 different categories. For
  turbed dataset. In the exploration stage, the adversary          generating word embeddings from English words we have
  trains its own stand-in classifier representing the real clas-   used a widely used vector representation format called
  sifier. To get training data for its stand-in, it generates      Word2Vec (Mikolov et al. 2013). Word2Vec trains a neural
  sample adversarial data by generating a random vector in         network on a vocabulary of 150, 000 English words, each
  a direction that is orthogonal to the difference vector be-      word in the vocabulary is given a unique one-hot identi-
  tween a pair of valid samples taken from the unperturbed         fier. Word2Vec gives a feature vector in a 300-dimension
  set. The adversarial example is generated by adding the          space for each word in the vocabulary. We have used a pub-
  random vector and the average of the two valid samples           licly available, pre-trained Word2Vec model for generating
  used to generate the random vector. The adversarial ex-          word embeddings for our experiments. Following recent re-
  ample and its category obtained by sending the example           search (Yu et al. 2017) that showed that the prefix of a long
  as a query to the classifier, are recorded as training data      piece of text is important for determining its label, we have
  and used to train the adversary’s stand-in classifier. In the    assumed each query is limited to the first 40 words in the
  exploitation stage, the stand-in classifier is used to gener-    text. Before sending a query text to the classifier, words in
  ate samples that are predicted by it to produce a desired        the text are given a unique integer identifier, words not in the
  classification. The anchor points method described above         vocabulary are given an id of zero. The classifier used for our
  could be used for the exploitation stage. The reader is re-      experiments is based on (Zhang and Wallace 2017)1 . It uses
  ferred to (Sethi and Kantardzic 2018) further details about      a deep neural network with three convolutional layers with
  the AP and RE algorithms.                                        filter sizes of 3, 4, and 5 respectively. A rectified linear unit
Gray-box Perturbation Despite their limitation of gener-               1
                                                                         Code available at https://github.com/dennybritz/cnn-text-
ating less effective adversarial text, black-box methods can       classification-tf
(ReLU) activation function and max pooling are used with           intelligible for the machine classifier as well.
each layer. The pooled outputs are combined by concatenat-            For our next set of experiments we analyzed the effect
ing them followed by a dropout layer with keep probability         of the main algorithm parameters in the AP and RE al-
0.5. The output of the dropout layer is a probability distri-      gorithms, the maximum and minimum perturbation radii,
bution indicating the confidence level of the classifier for       Rmin and Rmax on the amount of perturbation measured
the query text over the different data categories. The cate-       in terms of BLEU score, and the fraction of samples that re-
gory with the maximum confidence level is output as the            tain the same label as the original. Results are shown in Fig-
category of the query text. The loss function used for calcu-      ures 3 through 5. We observe that for both AP (Figure 3(a))
lating gradients of the inupt is implemented using softmax         and RE (Figure 4(a)) algorithms, as the perturbation radius
cross entropy. The gradients are backpropagated through the        increases, the BLEU score reduces, which corroborates the
classifier’s network to calculate the gradients of the different   fact that the degree of perturbation increases the divergence
features at the input. The feature gradients are again back-       between the original and perturbed text. Correspondingly,
propagated through the Word2Vec network to calculate the           the fraction of perturbed text that retains the same label as
gradients of the words, as shown in Figure 1.                      the original text generally decreases with the increase in per-
   For evaluating the effectiveness of the perturbation tech-      turbation radius for both AP and RE, in Figure 3(b) and
nique we have used the Bilingual Evaluation Understudy             Figure 4(b) respectively. Figure 5 shows the BLEU score
(BLEU) score (Papineni et al. 2002). BLEU is a widely used         and fraction of samples that retain same label as the original
evaluation metric for automated machine translation of text        versus the number of words perturbed using the white-box
and has recently been used to measure the difference be-           gradient-based approach, where the words with highest neg-
tween adversarially generated synthetic text with original         ative gradient were replaced with words with smallest posi-
text (Zhang, Xu, and Li 2017). It compares the similarity          tive gradient calculated using Equations 2 and 3. As before,
between a machine translation of text and a professional hu-       we observe that more perturbation results in lower BLEU
man translation without considering grammar or whether the         scores implying more divergent text after pertubation. More
text makes sense. We have used BLEU-4 that scores sets of          perturbation also reduces the fraction of perturbed samples
four consecutive words, or 4-grams. When two pieces of text        that retain the same label as the original text.
are identical their BLEU score is 1.0, and as the dissimilarity
increases, the BLEU score approaches 0.                                       Conclusions and Future Work
   In our first experiment, we analyzed the effect of the          In this paper we investigated gray-box techniques for gen-
amount of perturbation of different text samples measured          erating adversarial text as a combination of white-box
in terms of BLEU score (x-axis) on the number of samples           gradient-based techniques and black-box techniques. We
that are assigned the same label before and after perturbation     validated the correct behavior of the gray-box techniques in
(y-axis). Perturbed text was generated using the white-box         generating perturbed text while showing that more perturba-
gradient-based method where the features of the words with         tion results in greater divergence as well as greater degree of
the most negative gradients, calculated using Equation 2,          label-changes in the perturbed text with respect to the orig-
were perturbed with a small random amount. The number of           inal text. In the future, we plan to compare the proposed
words to perturb was varied over {1, 10, 20, 30}. For each         gray-box adversarial text generation methods with GAN-
of these perturbation amounts, a batch of 1000 text samples        based and RNN-based synthetic text generation. While sin-
were perturbed, and results were averaged over 4 batches.          gle words are usually treated as the basic lexical unit, it
BLEU scores of perturbed text was binned into intervals of         would be interesting to analyze how character-level pertur-
0.1 by rounding each BLEU score to the first place of deci-        bations and word sequence or n-gram-level perturbations af-
mal. Results shown in Figure 2, illustrate that as the BLEU        fect adversarial text generation. We are also interested in get-
score decreases (perturbed text becomes more different from        ting a better understanding of how to determine a critical or
the original text), the fraction of samples that retain the same   minimal amount of perturbation that would be successful in
label before and after perturbation also decreases. This result    generating adversarial text. We envisage that this work and
appears intuitive because the more different a piece of text       its extensions will generate interesting results that would en-
is from its original, unperturbed version, the less likely it is   able a better understanding and novel means for adversarial
to retain the same label. However, for BLUE score of 0.4           text generation, and methods to make machine classifiers ro-
and lower, we observe that the fraction of samples retaining       bust against adversarial text-based attacks.
the same label as the original slightly increases when 10 or
more words are perturbed. This is possibly due to the fact                                 References
that when the perturbed text is very different from the origi-
                                                                   Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein
nal text, most of the perturbed words are out of context from
                                                                   gan. arXiv preprint arXiv:1701.07875.
the original text and the text appears as nonsense to a human
reader. The machine classifier however is confounded into          Bengio, Y.; Yao, L.; Alain, G.; and Vincent, P. 2013. Gen-
labeling the nonsensical, perturbed text with the same label       eralized denoising auto-encoders as generative models. In
as the original text, possibly due to one or more of the un-       Advances in Neural Information Processing Systems, 899–
perturbed words. Further investigation into this issue would       907.
lead to a better understanding of the degree of perturbation       Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015.
that converts text into rubbish for a human and possibly un-       Scheduled sampling for sequence prediction with recurrent
                         1                                                                                                                                      1


                                                                                                                       Frac. of perturbed examples w/ same
                       0.95                                                                                                                                   0.95


                                                                                                                          label as unperturbed examples
                                                                                                                                                                                            Rmin = .7                    Rmin = .4
                        0.9                                                                                                                                    0.9
                       0.85                        Rmin = .7                                                                                                  0.85
          BLEU Score
                        0.8                                                                       Rmin = .4                                                    0.8
                       0.75                                                                                                                                   0.75
                                  Rmin = 1.0
                        0.7                                                                                                                                    0.7                                          Rmin = .1
                                                                                  Rmin = .1                                                                                       Rmin = 1.0
                       0.65                                                                                                                                   0.65
                        0.6                                                                                                                                    0.6
                       0.55                                                                                                                                   0.55
                        0.5                                                                                                                                    0.5
                              0           2             4                 6              8              10        12                                                  0       2         4           6            8         10        12
                                                                      Radius Max                                                                                                               Radius Max


                                                                  (a)                                                                                                                       (b)

Figure 3: Variation of BLEU score (left) and fraction of perturbed examples that have same label as unperturbed text (right) for different
values of Rmin and Rmax using anchor points (AP) algorithm.

                                                                                                                                                               1
                                                                                                                                                                                                    Rmin = .1


                                                                                                                       Frac. of perturbed examples w/ same
                         1                                                                                                                                    0.9


                                                                                                                          label as unperturbed examples
                       0.95                                                                                                                                   0.8
                        0.9            Rmin= .4
                                                                                                                                                              0.7
                       0.85                                                                                                                                   0.6
          BLEU Score


                        0.8
                                                                Rmin = .1                                                                                     0.5                                       Rmin = .7
                       0.75
                                                                                                                                                              0.4
                        0.7                                                                                                                                                                                              Rmin = .4
                       0.65                                                                                                                                   0.3

                        0.6                                                                                                                                   0.2
                       0.55          Rmin = 1.0                                                                                                               0.1                                           Rmin = 1.0
                                                                                  Rmin = .7
                        0.5                                                                                                                                    0
                              0            2                4                 6               8              10                                                      0        2        4            6            8         10        12
                                                                  Radius Max                                                                                                           Radius Max


                                                                  (a)                                                                                                                       (b)

Figure 4: Variation of BLEU score (left) and fraction of perturbed examples that have same label as unperturbed text (right) for different
values of Rmin and Rmax using reverse engineering (RE) algorithm.

                         1                                                                                                                                           1
                                                                                                                        Frac. of perturbed examples w/ same


                       0.95                                                                                                                                         0.9
                                                                                                                           label as unperturbed examples


                        0.9                                                                                                                                         0.8
                       0.85                                                                                                                                         0.7
          BLEU Score


                        0.8                                                                                                                                         0.6
                       0.75                                                                                                                                         0.5
                        0.7                                                                                                                                         0.4
                       0.65                                                                                                                                         0.3
                        0.6                                                                                                                                         0.2
                       0.55                                                                                                                                         0.1
                        0.5                                                                                                                                          0
                              0                5                 10                 15             20             25                                                      0       5            10           15            20         25
                                                   Number of Words to Perturb                                                                                                         Number of Words to Perturb


                                                                  (a)                                                                                                                       (b)

Figure 5: Variation of BLEU score (top) and fraction of perturbed examples that have same label as unperturbed text (bottom) for different
values of Rmin and Rmax using white-box gradient-based approach.


neural networks. In Advances in Neural Information Pro-                                                                conference on machine learning and knowledge discovery
cessing Systems, 1171–1179.                                                                                            in databases, 387–402. Springer.

Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.;                                                         Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever,
Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion attacks                                                           I.; and Abbeel, P. 2016. Infogan: Interpretable representa-
against machine learning at test time. In Joint European                                                               tion learning by information maximizing generative adver-
sarial nets. In Advances in neural information processing          Phrases and their Compositionality. In Burges, C. J. C.;
systems, 2172–2180.                                                Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger,
Ebrahimi, J.; Rao, A.; Lowd, D.; and Dou, D. 2018. Hot-            K. Q., eds., Advances in Neural Information Processing Sys-
flip: White-box adversarial examples for text classification.      tems 26, 3111–3119. Curran Associates, Inc.
In Proceedings of the 56th Annual Meeting of the Associa-          Mirza, M., and Osindero, S. 2014. Conditional generative
tion for Computational Linguistics, ACL 2018, Melbourne,           adversarial nets. arXiv preprint arXiv:1411.1784.
Australia, July 15-20, 2018, Volume 2: Short Papers, 31–36.        Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik,
Fedus, W.; Goodfellow, I.; and Dai, A. M. 2018. Maskgan:           Z. B.; and Swami, A. 2016. The limitations of deep learning
Better text generation via filling in the . arXiv preprint         in adversarial settings. In Security and Privacy (EuroS&P),
arXiv:1801.07736.                                                  2016 IEEE European Symposium on, 372–387. IEEE.
Gao, J.; Lanchantin, J.; Soffa, M. L.; and Qi, Y. 2018. Black-     Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
box generation of adversarial text sequences to evade deep         BLEU: A Method for Automatic Evaluation of Machine
learning classifiers. arXiv preprint arXiv:1801.04354.             Translation. In Proceedings of the 40th Annual Meeting on
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;              Association for Computational Linguistics, ACL ’02, 311–
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.         318. Stroudsburg, PA, USA: Association for Computational
2014. Generative Adversarial Nets. In Ghahramani, Z.;              Linguistics.
Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger,          Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
K. Q., eds., Advances in Neural Information Processing Sys-        Global Vectors for Word Representation. 1532–1543. Asso-
tems 27, 2672–2680. Curran Associates, Inc.                        ciation for Computational Linguistics.
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014.               Sethi, T. S., and Kantardzic, M. 2018. Data driven ex-
Explaining and harnessing adversarial examples. CoRR               ploratory attacks on black box classifiers in adversarial do-
abs/1412.6572.                                                     mains. Neurocomputing 289:129–143.
Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; and            Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style
Wierstra, D. 2015. Draw: A recurrent neural network for im-        transfer from non-parallel text by cross-alignment. In Ad-
age generation. In Bach, F., and Blei, D., eds., Proceedings       vances in Neural Information Processing Systems, 6830–
of the 32nd International Conference on Machine Learning,          6841.
volume 37 of Proceedings of Machine Learning Research,             Subramanian, S.; Rajeswar, S.; Dutil, F.; Pal, C.; and
1462–1471. Lille, France: PMLR.                                    Courville, A. 2017. Adversarial generation of natural lan-
Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; and Wang, J.          guage. In Proceedings of the 2nd Workshop on Representa-
2017. Long text generation via adversarial training with           tion Learning for NLP, 241–251.
leaked information. arXiv preprint arXiv:1709.08624.               Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Se-
Huszár, F. 2015. How (not) to train your generative model:        quence generative adversarial nets with policy gradient. In
Scheduled sampling, likelihood, adversary? arXiv preprint          Proceedings of the Thirty-First AAAI Conference on Artifi-
arXiv:1511.05101.                                                  cial Intelligence, February 4-9, 2017, San Francisco, Cali-
                                                                   fornia, USA., 2852–2858.
Iyyer, M.; Wieting, J.; Gimpel, K.; and Zettlemoyer,
L. 2018. Adversarial example generation with syn-                  Zhang, Y., and Wallace, B. 2017. A sensitivity analysis of
tactically controlled paraphrase networks. arXiv preprint          (and practitioners guide to) convolutional neural networks
arXiv:1804.06059.                                                  for sentence classification. In Proceedings of the Eighth In-
                                                                   ternational Joint Conference on Natural Language Process-
Jia, R., and Liang, P. 2017. Adversarial examples for              ing (Volume 1: Long Papers), volume 1, 253–263.
evaluating reading comprehension systems. arXiv preprint
                                                                   Zhang, Y.; Gan, Z.; Fan, K.; Chen, Z.; Henao, R.; Shen, D.;
arXiv:1707.07328.
                                                                   and Carin, L. 2017. Adversarial feature matching for text
Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T.             generation. In International Conference on Machine Learn-
2016. Bag of Tricks for Efficient Text Classification.             ing, 4006–4015.
arXiv:1607.01759 [cs].
                                                                   Zhang, H.; Xu, T.; and Li, H. 2017. Stackgan: Text to
Liang, B.; Li, H.; Su, M.; Bian, P.; Li, X.; and Shi, W. 2018.     photo-realistic image synthesis with stacked generative ad-
Deep text classification can be fooled. In Proceedings of the      versarial networks. In IEEE International Conference on
Twenty-Seventh International Joint Conference on Artificial        Computer Vision, ICCV 2017, Venice, Italy, October 22-29,
Intelligence, IJCAI-18, 4208–4215. International Joint Con-        2017, 5908–5916.
ferences on Artificial Intelligence Organization.
Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; and Khu-
danpur, S. 2010. Recurrent neural network based language
model. In Eleventh Annual Conference of the International
Speech Communication Association.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013. Distributed Representations of Words and