Gray-box Techniques for Adversarial Text Generation Prithviraj Dasgupta, Joseph Collins and Anna Buhman∗ Abstract classifier-based learning algorithm is that they are suscepti- ble to adversarial attacks where a malicious adversary could We consider the problem of adversarial text generation in the send corrupted or poisoned data to the classifier that results context of cyber-security tasks such as email spam filtering and text classification for sentiment analysis on social media in incorrect classification, or, for more calculated attacks, sites. In adversarial text generation, an adversary attempts to use corrupted data to alter the classification decisions of the perturb valid text data to generate adversarial text such that classifier. Both these attacks could compromise the opera- the adversarial text ends up getting mis-classified by a ma- tion of the classifier, making it behave in an unintended, pos- chine classifier. Many existing techniques for perturbing text sibly insecure and unsafe manner. As an example, a compro- data use gradient-based or white-box methods, where the ad- mised email classifier could end up categorizing valid email versary observes the gradient of the loss function from the messages as spam (false positives) while delivering spam classifier for a given input sample, and uses this information email messages as non-spam (false negatives). An impor- to strategically select portions of the text to perturb. On the tant defense against adversarial attacks is to build a model other hand, black-box methods where the adversary does not of the adversary including the data that the adversary gener- have access to the gradient of the loss function from the clas- sifier and has to probe the classifier with different input sam- ates, and use this model as a virtual adversary to improve the ples to generate successful adversarial samples, have been learner’s robustness to adversarial attacks. Towards this ob- used less often for generating adversarial text. In this paper, jective, in this paper, we investigate adversarial data gener- we integrate black-box methods where the adversary has a ation techniques that could be used to model corrupted data limited budget of the number of probes to the classifier, with generated by a virtual adversary. Our work focuses on ad- white-box, gradient-based methods, and evaluate the effec- versarial text data generation as many of the aforementioned tiveness of the adversarially generated text in misleading a cyber-security tasks operate mainly on text data. deep network classifier model. Adversarial data generation techniques can be broadly classified into two categories called white-box and black- Introduction box. In white-box techniques, the adversary has access to Machine learning-based systems are currently used widely information of the classifier, such as the parameters of for different cyber-security tasks including email spam fil- the model of the classifier, e.g., weights in a deep neural tering, network intrusion detection, sentiment analysis of network-based classifier, or, to internal calculations of the posts on social media sites like Twitter, and validating au- classifier, such as the gradients of the loss function calcu- thenticity of news posts on social media sites. Most of these lated by the classifier on a sample. In contrast, in black- systems use supervised learning techniques where a learn- box techniques, the adversary does not have any informa- ing algorithm called a classifier is trained to categorize data tion about the classifier and treats it like a black-box, by into multiple categories. For instance, an automated classi- passing data samples as queries to the classifier and observ- fier for an email spam filter classifies incoming email into ing the classifier’s output category or label for each sample. spam and non-spam categories. A critical vulnerability of Additionally, in most practical black-box settings, the adver- sary can send only a finite number of queries to the classi- ∗ P. Dasgupta and A. Buhman are with the Computer Science fier due to the adversary’s budget limitations in generating Dept., University of Nebraska at Omaha, USA. e-mail: {pdasgupta, samples, or, due to restrictions in the number of queries ac- abuhman}@unomaha.edu. J. Collins is with the Distributed Sys- cepted by the classifier. While white-box techniques such as tems Section, Naval Research Laboratory, Washington D.C., USA. gradient-based methods (Liang et al. 2018) have been pro- e-mail: joseph.collins@nrl.navy.mil posed for generating adversarial text, black box methods for Copyright ⃝ c by the paper’s authors. Copying permitted for private and academic purposes. In: Joseph Collins, Prithviraj Dasgupta, generating adversarial text (Gao et al. 2018) are less investi- Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo- gated in literature. In this paper, we have describe gray-box sium on Adversary-Aware Learning Techniques and Trends in Cy- techniques for generating adversarial text where gradient- bersecurity, Arlington, VA, USA, 18-19 October, 2018, published based white-box techniques for generating adversarial text at http://ceur-ws.org are combined with budget-limited, black-box techniques for generating adversarial samples. We have evaluated the ad- criminator or classifier and provided to the generator or ad- versarial text data generated using our proposed technique versary. The adversary treats this information as a feedback using the DBPedia dataset that contains short articles on signal from the classifier about the quality of its adversari- 14 different categories extracted from Wikipedia, in terms ally generated text, and adapts future adversarial generations of the difference from the original data used as a basis to of words or phrases in the text towards improving the qual- generate the adversarial data, as well as the effectiveness of ity of its generated text. Additional methods for generating the adversarial data in fooling a classifier into making mis- adversarial text are described in (Iyyer et al. 2018), (Jia and classifications. Our results show that using gray-box tech- Liang 2017), (Shen et al. 2017). Following gradient-based niques the divergence in the perturbed text from the original, techniques for image perturbation (Goodfellow, Shlens, and unperturbed text increases with the amount of perturbation, Szegedy 2014), (Ebrahimi et al. 2018), (Liang et al. 2018) and, more perturbation of original text results in changing have used gradient-based methods to identify tokens in text the label of the perturbed text with respect to the original that have the highest gradient and consequently most influ- text more often. ence in determining the text’s label in the classifier. These tokens are then strategically modified so that the text with Related Work modified tokens results in a different label when classified, Adversarial data generation techniques have received con- as compared to the original text’s label. Both these meth- siderable attention from researchers in the recent past. The ods methods require the gradient information for the unper- main concept in adversarial data generation is to add slight turbed text when it is classified by the classifier to perturb the noise to valid data using a perturbation technique so that the text. In contrast, in (Gao et al. 2018), Gao et al. use a similar corrupted data still appears legitimate to a human but gets idea, but instead of selecting words or characters to perturb mis-classified by a machine classifier. Most research on ad- based on classifier-calculated gradients, they first rank words versarial data generation has focused on image data, includ- in a piece of text to be perturbed based on a score assigned ing gradient-based methods for perturbing valid data (Biggio to the word’s context, followed by changing the spelling of et al. 2013), (Goodfellow, Shlens, and Szegedy 2014), (Pa- the highest ranked words. Their technique is relevant when pernot et al. 2016), recurrent neural networks (Gregor et al. the gradient information from the classifier might not be 2015), and generative adversarial networks (GANs) (Good- readily available to perturb text, for example, in an online fellow et al. 2014) and its extensions (Mirza and Osindero classification service where the classifier can be accessed 2014), (Chen et al. 2016), (Arjovsky, Chintala, and Bot- only as a black box. Sethi and Kantardzic (Sethi and Kan- tou 2017). However, rather than generating synthetic data tardzic 2018), also described black-box methods for gener- towards misleading a machine classifier, the objective of ating adversarial numerical and text data while considering GAN-enabled data generation has been to create synthetic budget limitations of the adversary in generating adversar- data that looks convincing to humans. ial data. Our work is complimentary to these gradient-based Adversarial Text Generation. For adverarial image gen- and black-box approaches as it compares the effectiveness eration, image characteristics like RGB pixel values, HSV of generating adversarial text by integrating gradient-based values, brightness, contrast, etc., are real-valued numbers methods with concepts used in budget-limited black-box that can be manipulated using mathematical operations to methods. generate perturbed images that can fool a machine classifer while remaining imperceptible to humans. In contrast, for Adversarial Text Perturbation perturbing text, adding a real value to a numerical represen- One of the main functions of an adversary in an adversarial tation of a word, character, or token in an embedding space, learning setting is to generate, and possibly mislabel, adver- might result in a new word that might either be nonsense, sarial text to fool the classifier. Adversarial text is usually or, might not fit in with the context of the unperturbed, orig- generated by starting from valid text and changing certain inal text - in both cases, the adversarial text can be easily words or characters in it. This process is also referrred to flagged by both a machine classifier and a human. To ad- to as perturbing text. We define adversarial text perturba- dress this shortcoming, instead of directly using image per- tion using the following formalization: Let V denote a vo- turbation techniques to generate adversarial text, researchers cabulary of English words and X denote a corpus of En- have proposed using auto-encoders (Bengio et al. 2013) and glish text. {Xi } ⊂ X denotes a dataset consisting of En- recurrent neural networks (RNN) with long short term mem- glish text samples, where Xi denotes the i-th sample. Fur- ory (LSTM) architecture to generate text as sequences of to- (1) (2) (W ) (j) ther, Xi = (Xi , Xi , ..., Xi ), where Xi denotes the kens (Mikolov et al. 2010), albeit with limitations when gen- j-th word in Xi and W is the maximum number of words erating adversarial text that looks realistic to humans (Ben- (j) gio et al. 2015), (Huszár 2015). Recently, GAN-based meth- of a text sample. Each word Xi is represented by a feature (j) ods have been shown to be successful for adversarial text vector Xi = f1:F generated using some word embedding generation. In adverarial text generation GANs, a feedback like word2Vec (Mikolov et al. 2013), Global Vector (GloVe) signal (Yu et al. 2017), (Zhang et al. 2017) (Subramanian (Pennington, Socher, and Manning 2014), or fastText (Joulin et al. 2017) such as a reward value within a reinforcement et al. 2016), where F is the embedding space dimension. learning framework (Fedus, Goodfellow, and Dai 2018) or Each sample Xi belongs to a class with label l ∈ L, where high-level features of text identified by the classifier called L is a finite set of class labels. For notational convenience, leaked information (Guo et al. 2017), is evaluated by the dis- we assume that l = y(Xi ), where y : X → L is a rela- tion that ascertains ground truth label l for Xi . A classifier C is also used to determine a label of Xi , the classifier’s output is given by yC : X → L. We say that the classi- fier classifies Xi correctly only when yC (Xi ) = y(Xi ). A valid example Xi is perturbed by altering a subset of words (j) {Xi } ∈ Xi using a perturbation strategy π. The perturba- (j) tion strategy π : X → X modifies the j-th word Xi to (j) (j) word X̃i ∈ V, 1 ≤ j ≤ W . Let nπ = |{X̃i }| denote the number of words perturbed by the perturbation strategy π and X̃i,nπ denote text Xi after perturbing nπ words in it. Finally, let ∆ : X ×X → [0, 1] denote a divergence measure between two pieces of text with ∆(Xi , Xi ) = 0. Within this formalization, the objective of the adversary is to determine a minimal perturbation n∗π to Xi satisfying: n∗π = argnπ min ∆(Xi , X̃i,nπ ) s.t. yC (X̃i,nπ ) ̸= y(Xi ), and, yC (Xi ) = y(Xi ) (1) The objective function in Equation 1 finds the number of words to perturb that gives the minimum divergence be- tween the original and perturbed text, while the last two con- straints respectively ensure that the classifier mis-classifies the perturbed text, X̃i,nπ , giving it a different label than the Figure 1: Example of white-box adversarial text generation via ground truth y(Xi ), but the classifier correctly classifies the perturbing words using gradient information. (Top): The gradient original text Xi . In the next section, we describe different of the loss function of the classifier from classifying original text perturbation strategies we have used to perturb valid text and is available to the adversary. (Left): Adverary uses this information generate adversarial text. to select and perturb two words from original text. (Bottom): The perturbed text, when classified, yields a different label (’Person’) than original text (’Company’). Perturbation Methods for Text Data An adversarial perturbation method adds a certain amount of noise to unperturbed data to generate perturbed data. There δL are two main questions in a perturbation method: how much x∗ = arg min j δx(j) noise to add, and where in the data to add the noise. For the ∑F δfk δL first question, the two types of perturbation methods, white- = arg min box and black-box, both add a random, albeit small amount j k=1 δx(j) δfk of noise. But these two types of methods differ in their ap- proach to answer the second question. Below, we describe ∑ F δL the white-box and black-box perturbation methods, and pro- = arg min wj,k , (2) j δfk k=1 pose a gray-box method that combines the advantages of the δL two. where L is the loss function from the classifier, δf k is the gradient of the loss function for the k-th feature of a White-box Gradient-based Perturbation. White-box word, wj,k is the weight in the embedding layer neural net- methods first query the classifier with the unperturbed text work connecting the j-th word to its k-th feature in embed- and observe the gradient of the loss function of the classi- ding space. x∗ is then perturbed by replacing it with x̃(j) , fier’s output from the query for the different words in the the word from the vocabulary that has the smallest positive queried text. Because the loss function expresses the dif- gradient of loss function, and, consequently, has the least in- ference of the classifier’s determined label from the ground fluence in changing the label determined by the classifier. truth label for the queried text, the word in the text that cor- Mathematically, x̃(j) is given by: responds to the most negative gradient or change of the loss ∑ δL δL function is most influential in changing the label for the text x̃(j) = arg min wj,k , s.t. >0 (3) j δfk δfk determined by the classifier. This word is selected as the k word to perturb. Mathematically, the selected word, x∗ , hav- The general schematic of the white-box gradient-based ing the most negative gradient is given by: perturbation is shown in Figure 1. Black-box Perturbation In contrast to white-box meth- 1 1 word 3500 Frac. of perturbed examples w/ same ods, black-box methods select the perturbation position ran- 0.9 20 words label as unperturbed examples 3000 domly within unperturbed text. Consequently, black-box 0.8 methods could yield adversarial examples that are less ef- Number of Samples 0.7 2500 30 words fective in misguiding the classifier than white-box generated 0.6 2000 examples. Below, we describe two black-box methods used 0.5 to perturb text. 0.4 1500 • Black-box, Budget-Limited Perturbation (Anchor 0.3 1000 Points). The anchor points (AP) method, proposed 0.2 500 in (Sethi and Kantardzic 2018), is prescribed when the ad- 0.1 versary has a limited budget and can send fewer queries to 0 0 0.2 0.4 0.6 0.8 1 the classifier. The adversary starts with a set of valid, un- BLEU Score 10 words perturbed samples drawn randomly from the unperturbed dataset. AP adversarial data generation proceeds in two stages called exploration and exploitation, each with its respective budget. In the exploration stage, the adversary Figure 2: Variation of the ratio of correct classifications for differ- uses one unit of exploration budget to randomly select ent values of BLEU score of the perturbed text. one sample from the unperturbed set and and adds to it a perturbation vector drawn from a normal distribution N (0, R), where R ∈ [Rmin , Rmax ) is a perturbation ra- generate larger and more diverse set of adversarial examples dius. The perturbed sample is sent as a query to the clas- with fewer queries or probes to the classifier than white-box sifier and if it is categorized with the same label as the methods. With this insight, we propose to combine white- original sample, it is retained as a candidate adversar- box and black-box methods into a gray-box method to in- ial example. The perturbation radius is adjusted propor- vestigate perturbation methods that can generate effective tional to the fraction of candidate adversarial examples yet diverse adversarial examples. In the gray-box method, and these steps are repeated until the exploration budget instead of drawing unperturbed samples for the black-box is expended. During the exploit phase, the adversary cre- methods randomly from the original dataset, we first use the ates an adversarial example by generating a convex com- white-box method to generate a small seed set of perturbed bination of a pair of randomly selected candidate adver- examples from samples that are randomly-drawn from the sarial examples created during exploration phase. Gener- original dataset. This seed set of perturbed examples is then ating each adversarial example consumes one unit of ad- used to create a larger set of perturbed examples using the versary’s exploitation budget. AP and RE black-box methods. • Black-box, Budget-Lenient Peturbation (Reverse En- gineering). The reverse engineering (RE) attack, also pro- Experimental Evaluation posed in (Sethi and Kantardzic 2018), is accomplished Word Embedding and Classifier Network. We have eval- once again in two stages called exploration and exploita- uated our proposed techniques on the DBPedia dataset. tion. The adversary once again starts with a set of valid, The dataset contains 700, 000 short articles extracted from unperturbed samples drawn randomly from the unper- Wikipedia and categorized into 14 different categories. For turbed dataset. In the exploration stage, the adversary generating word embeddings from English words we have trains its own stand-in classifier representing the real clas- used a widely used vector representation format called sifier. To get training data for its stand-in, it generates Word2Vec (Mikolov et al. 2013). Word2Vec trains a neural sample adversarial data by generating a random vector in network on a vocabulary of 150, 000 English words, each a direction that is orthogonal to the difference vector be- word in the vocabulary is given a unique one-hot identi- tween a pair of valid samples taken from the unperturbed fier. Word2Vec gives a feature vector in a 300-dimension set. The adversarial example is generated by adding the space for each word in the vocabulary. We have used a pub- random vector and the average of the two valid samples licly available, pre-trained Word2Vec model for generating used to generate the random vector. The adversarial ex- word embeddings for our experiments. Following recent re- ample and its category obtained by sending the example search (Yu et al. 2017) that showed that the prefix of a long as a query to the classifier, are recorded as training data piece of text is important for determining its label, we have and used to train the adversary’s stand-in classifier. In the assumed each query is limited to the first 40 words in the exploitation stage, the stand-in classifier is used to gener- text. Before sending a query text to the classifier, words in ate samples that are predicted by it to produce a desired the text are given a unique integer identifier, words not in the classification. The anchor points method described above vocabulary are given an id of zero. The classifier used for our could be used for the exploitation stage. The reader is re- experiments is based on (Zhang and Wallace 2017)1 . It uses ferred to (Sethi and Kantardzic 2018) further details about a deep neural network with three convolutional layers with the AP and RE algorithms. filter sizes of 3, 4, and 5 respectively. A rectified linear unit Gray-box Perturbation Despite their limitation of gener- 1 Code available at https://github.com/dennybritz/cnn-text- ating less effective adversarial text, black-box methods can classification-tf (ReLU) activation function and max pooling are used with intelligible for the machine classifier as well. each layer. The pooled outputs are combined by concatenat- For our next set of experiments we analyzed the effect ing them followed by a dropout layer with keep probability of the main algorithm parameters in the AP and RE al- 0.5. The output of the dropout layer is a probability distri- gorithms, the maximum and minimum perturbation radii, bution indicating the confidence level of the classifier for Rmin and Rmax on the amount of perturbation measured the query text over the different data categories. The cate- in terms of BLEU score, and the fraction of samples that re- gory with the maximum confidence level is output as the tain the same label as the original. Results are shown in Fig- category of the query text. The loss function used for calcu- ures 3 through 5. We observe that for both AP (Figure 3(a)) lating gradients of the inupt is implemented using softmax and RE (Figure 4(a)) algorithms, as the perturbation radius cross entropy. The gradients are backpropagated through the increases, the BLEU score reduces, which corroborates the classifier’s network to calculate the gradients of the different fact that the degree of perturbation increases the divergence features at the input. The feature gradients are again back- between the original and perturbed text. Correspondingly, propagated through the Word2Vec network to calculate the the fraction of perturbed text that retains the same label as gradients of the words, as shown in Figure 1. the original text generally decreases with the increase in per- For evaluating the effectiveness of the perturbation tech- turbation radius for both AP and RE, in Figure 3(b) and nique we have used the Bilingual Evaluation Understudy Figure 4(b) respectively. Figure 5 shows the BLEU score (BLEU) score (Papineni et al. 2002). BLEU is a widely used and fraction of samples that retain same label as the original evaluation metric for automated machine translation of text versus the number of words perturbed using the white-box and has recently been used to measure the difference be- gradient-based approach, where the words with highest neg- tween adversarially generated synthetic text with original ative gradient were replaced with words with smallest posi- text (Zhang, Xu, and Li 2017). It compares the similarity tive gradient calculated using Equations 2 and 3. As before, between a machine translation of text and a professional hu- we observe that more perturbation results in lower BLEU man translation without considering grammar or whether the scores implying more divergent text after pertubation. More text makes sense. We have used BLEU-4 that scores sets of perturbation also reduces the fraction of perturbed samples four consecutive words, or 4-grams. When two pieces of text that retain the same label as the original text. are identical their BLEU score is 1.0, and as the dissimilarity increases, the BLEU score approaches 0. Conclusions and Future Work In our first experiment, we analyzed the effect of the In this paper we investigated gray-box techniques for gen- amount of perturbation of different text samples measured erating adversarial text as a combination of white-box in terms of BLEU score (x-axis) on the number of samples gradient-based techniques and black-box techniques. We that are assigned the same label before and after perturbation validated the correct behavior of the gray-box techniques in (y-axis). Perturbed text was generated using the white-box generating perturbed text while showing that more perturba- gradient-based method where the features of the words with tion results in greater divergence as well as greater degree of the most negative gradients, calculated using Equation 2, label-changes in the perturbed text with respect to the orig- were perturbed with a small random amount. The number of inal text. In the future, we plan to compare the proposed words to perturb was varied over {1, 10, 20, 30}. For each gray-box adversarial text generation methods with GAN- of these perturbation amounts, a batch of 1000 text samples based and RNN-based synthetic text generation. While sin- were perturbed, and results were averaged over 4 batches. gle words are usually treated as the basic lexical unit, it BLEU scores of perturbed text was binned into intervals of would be interesting to analyze how character-level pertur- 0.1 by rounding each BLEU score to the first place of deci- bations and word sequence or n-gram-level perturbations af- mal. Results shown in Figure 2, illustrate that as the BLEU fect adversarial text generation. We are also interested in get- score decreases (perturbed text becomes more different from ting a better understanding of how to determine a critical or the original text), the fraction of samples that retain the same minimal amount of perturbation that would be successful in label before and after perturbation also decreases. This result generating adversarial text. We envisage that this work and appears intuitive because the more different a piece of text its extensions will generate interesting results that would en- is from its original, unperturbed version, the less likely it is able a better understanding and novel means for adversarial to retain the same label. However, for BLUE score of 0.4 text generation, and methods to make machine classifiers ro- and lower, we observe that the fraction of samples retaining bust against adversarial text-based attacks. the same label as the original slightly increases when 10 or more words are perturbed. This is possibly due to the fact References that when the perturbed text is very different from the origi- Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein nal text, most of the perturbed words are out of context from gan. arXiv preprint arXiv:1701.07875. the original text and the text appears as nonsense to a human reader. The machine classifier however is confounded into Bengio, Y.; Yao, L.; Alain, G.; and Vincent, P. 2013. Gen- labeling the nonsensical, perturbed text with the same label eralized denoising auto-encoders as generative models. In as the original text, possibly due to one or more of the un- Advances in Neural Information Processing Systems, 899– perturbed words. Further investigation into this issue would 907. lead to a better understanding of the degree of perturbation Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015. that converts text into rubbish for a human and possibly un- Scheduled sampling for sequence prediction with recurrent 1 1 Frac. of perturbed examples w/ same 0.95 0.95 label as unperturbed examples Rmin = .7 Rmin = .4 0.9 0.9 0.85 Rmin = .7 0.85 BLEU Score 0.8 Rmin = .4 0.8 0.75 0.75 Rmin = 1.0 0.7 0.7 Rmin = .1 Rmin = .1 Rmin = 1.0 0.65 0.65 0.6 0.6 0.55 0.55 0.5 0.5 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Radius Max Radius Max (a) (b) Figure 3: Variation of BLEU score (left) and fraction of perturbed examples that have same label as unperturbed text (right) for different values of Rmin and Rmax using anchor points (AP) algorithm. 1 Rmin = .1 Frac. of perturbed examples w/ same 1 0.9 label as unperturbed examples 0.95 0.8 0.9 Rmin= .4 0.7 0.85 0.6 BLEU Score 0.8 Rmin = .1 0.5 Rmin = .7 0.75 0.4 0.7 Rmin = .4 0.65 0.3 0.6 0.2 0.55 Rmin = 1.0 0.1 Rmin = 1.0 Rmin = .7 0.5 0 0 2 4 6 8 10 0 2 4 6 8 10 12 Radius Max Radius Max (a) (b) Figure 4: Variation of BLEU score (left) and fraction of perturbed examples that have same label as unperturbed text (right) for different values of Rmin and Rmax using reverse engineering (RE) algorithm. 1 1 Frac. of perturbed examples w/ same 0.95 0.9 label as unperturbed examples 0.9 0.8 0.85 0.7 BLEU Score 0.8 0.6 0.75 0.5 0.7 0.4 0.65 0.3 0.6 0.2 0.55 0.1 0.5 0 0 5 10 15 20 25 0 5 10 15 20 25 Number of Words to Perturb Number of Words to Perturb (a) (b) Figure 5: Variation of BLEU score (top) and fraction of perturbed examples that have same label as unperturbed text (bottom) for different values of Rmin and Rmax using white-box gradient-based approach. neural networks. In Advances in Neural Information Pro- conference on machine learning and knowledge discovery cessing Systems, 1171–1179. in databases, 387–402. Springer. Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion attacks I.; and Abbeel, P. 2016. Infogan: Interpretable representa- against machine learning at test time. In Joint European tion learning by information maximizing generative adver- sarial nets. In Advances in neural information processing Phrases and their Compositionality. In Burges, C. J. C.; systems, 2172–2180. Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, Ebrahimi, J.; Rao, A.; Lowd, D.; and Dou, D. 2018. Hot- K. Q., eds., Advances in Neural Information Processing Sys- flip: White-box adversarial examples for text classification. tems 26, 3111–3119. Curran Associates, Inc. In Proceedings of the 56th Annual Meeting of the Associa- Mirza, M., and Osindero, S. 2014. Conditional generative tion for Computational Linguistics, ACL 2018, Melbourne, adversarial nets. arXiv preprint arXiv:1411.1784. Australia, July 15-20, 2018, Volume 2: Short Papers, 31–36. Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik, Fedus, W.; Goodfellow, I.; and Dai, A. M. 2018. Maskgan: Z. B.; and Swami, A. 2016. The limitations of deep learning Better text generation via filling in the . arXiv preprint in adversarial settings. In Security and Privacy (EuroS&P), arXiv:1801.07736. 2016 IEEE European Symposium on, 372–387. IEEE. Gao, J.; Lanchantin, J.; Soffa, M. L.; and Qi, Y. 2018. Black- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. box generation of adversarial text sequences to evade deep BLEU: A Method for Automatic Evaluation of Machine learning classifiers. arXiv preprint arXiv:1801.04354. Translation. In Proceedings of the 40th Annual Meeting on Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Association for Computational Linguistics, ACL ’02, 311– Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 318. Stroudsburg, PA, USA: Association for Computational 2014. Generative Adversarial Nets. In Ghahramani, Z.; Linguistics. Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: K. Q., eds., Advances in Neural Information Processing Sys- Global Vectors for Word Representation. 1532–1543. Asso- tems 27, 2672–2680. Curran Associates, Inc. ciation for Computational Linguistics. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Sethi, T. S., and Kantardzic, M. 2018. Data driven ex- Explaining and harnessing adversarial examples. CoRR ploratory attacks on black box classifiers in adversarial do- abs/1412.6572. mains. Neurocomputing 289:129–143. Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; and Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style Wierstra, D. 2015. Draw: A recurrent neural network for im- transfer from non-parallel text by cross-alignment. In Ad- age generation. In Bach, F., and Blei, D., eds., Proceedings vances in Neural Information Processing Systems, 6830– of the 32nd International Conference on Machine Learning, 6841. volume 37 of Proceedings of Machine Learning Research, Subramanian, S.; Rajeswar, S.; Dutil, F.; Pal, C.; and 1462–1471. Lille, France: PMLR. Courville, A. 2017. Adversarial generation of natural lan- Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; and Wang, J. guage. In Proceedings of the 2nd Workshop on Representa- 2017. Long text generation via adversarial training with tion Learning for NLP, 241–251. leaked information. arXiv preprint arXiv:1709.08624. Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Se- Huszár, F. 2015. How (not) to train your generative model: quence generative adversarial nets with policy gradient. In Scheduled sampling, likelihood, adversary? arXiv preprint Proceedings of the Thirty-First AAAI Conference on Artifi- arXiv:1511.05101. cial Intelligence, February 4-9, 2017, San Francisco, Cali- fornia, USA., 2852–2858. Iyyer, M.; Wieting, J.; Gimpel, K.; and Zettlemoyer, L. 2018. Adversarial example generation with syn- Zhang, Y., and Wallace, B. 2017. A sensitivity analysis of tactically controlled paraphrase networks. arXiv preprint (and practitioners guide to) convolutional neural networks arXiv:1804.06059. for sentence classification. In Proceedings of the Eighth In- ternational Joint Conference on Natural Language Process- Jia, R., and Liang, P. 2017. Adversarial examples for ing (Volume 1: Long Papers), volume 1, 253–263. evaluating reading comprehension systems. arXiv preprint Zhang, Y.; Gan, Z.; Fan, K.; Chen, Z.; Henao, R.; Shen, D.; arXiv:1707.07328. and Carin, L. 2017. Adversarial feature matching for text Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. generation. In International Conference on Machine Learn- 2016. Bag of Tricks for Efficient Text Classification. ing, 4006–4015. arXiv:1607.01759 [cs]. Zhang, H.; Xu, T.; and Li, H. 2017. Stackgan: Text to Liang, B.; Li, H.; Su, M.; Bian, P.; Li, X.; and Shi, W. 2018. photo-realistic image synthesis with stacked generative ad- Deep text classification can be fooled. In Proceedings of the versarial networks. In IEEE International Conference on Twenty-Seventh International Joint Conference on Artificial Computer Vision, ICCV 2017, Venice, Italy, October 22-29, Intelligence, IJCAI-18, 4208–4215. International Joint Con- 2017, 5908–5916. ferences on Artificial Intelligence Organization. Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; and Khu- danpur, S. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed Representations of Words and