=Paper=
{{Paper
|id=Vol-2269/FSS-18_paper_36
|storemode=property
|title=Adversarial Training on Word-Char Embedding
|pdfUrl=https://ceur-ws.org/Vol-2269/FSS-18_paper_36.pdf
|volume=Vol-2269
|authors=Abebaw Tadesse,Joseph B. Collins
|dblpUrl=https://dblp.org/rec/conf/aaaifs/TadesseC18
}}
==Adversarial Training on Word-Char Embedding==
Adversarial Training on Word–Char Embedding Abebaw Tadesse∗ and Joseph B. Collins† Abstract linear w.r.t. the input are vulnerable to adversarial exam- ples which are typically generated by simple linear, but In this work we propose a robust adversarial training model carefully tuned perturbations of the input dataset. Addition- on hybrid word–char embeddings as developed in (Rei, Crichton, and Pyysalo 2016) based on the recent works of ally, in (Goodfellow, Shlens, and Szegedy 2014), it was (Miyato, Dai, and Goodfellow 2016). The proposed neu- demonstrated that adversarial training improves model per- ral training model addresses the existing critical issues with formance at least in image classification tasks. In (Miyato, word–only embeddings which includes: poor vector repre- Dai, and Goodfellow 2016), the authors used adversarial sentation for rare words and no representation for unseen and virtual adversarial (semi-supervised) training to improve words and the lack of proper mechanism to incorporate a text or RNN models. Though word vector embeddings, morphene–level informations that are not shared with the in general, yield high quality vector representation for fre- whole dictionary which, subsequently, leads to poor qual- quently seen words, they tend to produce poor quality word ity embeddings and hence low quality examples/adversarial vectors for less frequent words and no embedding at all examples. We present description of the proposed adversar- for previously unseen words (out of vocabulary represen- ial training model/architecture and addresses the implemen- tation aspects at the word–char level. Our preliminary result tation), and character–level information is not shared with on sequence labeling task on the First Certificate in English the whole dictionary (Rei, Crichton, and Pyysalo 2016). As (FCE-PUBLIC) dataset (Yannakoudakis, Briscoe, and Med- a result, most of the time the generated example either does lock 2011) shows an improvement in accuracy of adversarial not change because there is no neighbor near enough, or else (regularized) training on word-char embedding over the base- the perturbed context is not adversarial enough. In this work, line word-char embedding as well as on individual word/char- we attempt to address these issues through hybrid implanta- only and concatenated embeddings, as expected. The prelim- tion of word–char embedding under the settings described in inary results also show that perturbation at word-char level (Rei, Crichton, and Pyysalo 2016) to develop a neural learn- yields better accuracy as compared to individual word-only ing scheme for the generation and exploitation of adversarial and char-only perturbations. examples in Natural Language Processing (NLP) contexts. We implement the proposed adversarial training model on Introduction LSTM based hybrid word-char embedding on a sequence labeling task on the FCE–PUBLIC (First Certificate in En- In this article we investigate the impact of adversarial glish) dataset (Yannakoudakis, Briscoe, and Medlock 2011). training (Miyato, Dai, and Goodfellow 2016) on hybrid In section 2, a brief description of our proposed adversarial word-char embeddings, as developed in (Rei, Crichton, and training model based on word–char embeddings followed by Pyysalo 2016), on performance of Long Short-Term Mem- preliminary experimental results and discussions in Section ory (LSTM) based neural training models (Hochreiter and 3. Schmidhuber 1997). In (Szegedy et al. 2013) and (Goodfel- low, Shlens, and Szegedy 2014), it was shown that current neural models, particularly those that are linear or semi– Adversarial Training on the Word–Char ∗ Embedding Architecture A. Tadesse is with the Mathematics Dept., Langston Univer- sity, Langston, Oklahoma, USA. e-mail: abebaw@langston.edu. Word embeddings, in general, yield high quality distribu- † J. Collins is with the Information Technology Division, tional vector representation for frequently seen words, with Naval Research Laboratory, Washington D.C., USA. e-mail: semantically and functionally similar words having similar joseph.collins@nrl.navy.mil representations. However, they tend to produce poor qual- Copyright c by the papers authors. Copying permitted for private and academic purposes. In: Joseph Collins, Prithviraj Dasgupta, ity word vectors for less frequent words and no embed- Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo- ding at all for previously unseen words (Out of Vocabu- sium on Adversary-Aware Learning Techniques and Trends in Cy- lary words). Furthermore, there is no mechanism to exploit bersecurity, Arlington, VA, USA, 18-19 October, 2018, published character–level patterns and sentimental words that are com- at http://ceur-ws.org monly unseen words in sentimental datasets such as Twit- plementing the weight vector z = σ(Wz(3) tanh(Wz(1) xw + Wz(2) cw )), (1) (1) (2) (3) where Wz , Wz , Wz respectively are weight matrices for calculating z, and σ is the sigmoid function. The hybrid embedding vector x̃w (x̃ in Figure 1) will then be expressed as the z–weighted sum of xw (x in Figure 1) and cw (m in Figure 1), given by x̃w = z ∗ xw + (1 − z) ∗ cw (2) (point–wise multiplication). The bidirectional LSTM real- ization of the character based word embedding m (Figure 1) −→ ← − is given by m = tanh(Wm h∗ ) where h∗ = [h∗R ; h∗L ] where −→ ←− the h∗R and h∗L are the extreme left and right hidden vec- tors (resp.) from each of the two LSTM components, namely − → −−→ ← − ←−− h∗i = LST M (ci , h∗i−1 ) and h∗i = LST M (ci , h∗i+1 ), i = Figure 1: A bi–directional LSTM based hybrid word–char embed- 1, ...length(w). Furthermore, the attention–based architec- ding (Extracted from (Rei, Crichton, and Pyysalo 2016)) ture requires that the learned features in both word vectors xw and cw align. This will need to be incorporated as extra constraint on the loss function to encourage this agreement by optimizing T X J˜ = J + gk (1 − cos(cwk xwk )), (3) k=1 where J is the original embedding cost and J˜ is the mod- ified cost function and gk is defined as gk (wk ) = 0 for wk = OOV (Out Of Vocabulary words) and gk (wk ) = 1 otherwise, k = 1, .., T (T is the size of the input se- quence (text)). Adversarial perturbation will then be ap- plied on x˜w , as implemented in (Rei, Crichton, and Pyysalo 2016) to generate its adversarial counterpart, x̃adv w , given adv adv ∇x̃w J(y|x̃w ,θ) by x̃adv w = x̃ w + r˜ w where r˜ w = ||∇x̃w J(y|x̃w ,θ)||2 , Figure 2: The proposed Architecture for Adversarial training on J(x̃w , θ) is the loss function (the negative loss likelihood LSTM–based word–char embedding function − log(p(y|x, θ)) for a classifier), θ is the param- eter of the model (which should be viewed as a constant throughout the adversarial example generation process) and ter datasets 1 , and no immunity to typos (Rei, Crichton, and is the perturbation parameter. This needs to be done dy- Pyysalo 2016). Consequently, the quality of adversarial ex- namically for each word vector x̃w to generate the needed amples generated using word-level only embeddings will in- adversarial examples. The aggregated adversarial perturba- herit these weaknesses. In an attempt to address these criti- tion on the concatenated sequence s (the labeled input text) cal issues we propose adversarial training on a bi–directional of the (normalized) embedding vectors [x1 , x2 , ..., xT ] is de- LSTM–based hybrid word-char architecture [ Rei, Crichton, ∇s J(y|s,θ) fined as r˜s adv = ||∇ s J(y|s,θ)||2 and it’s corresponding adver- and Pyysalo2016)] as described in equation 1 below: In the word–char embedding settings (Rei, Crichton, and sarial loss is defined as Pyysalo 2016), a given word w will have dual vector repre- N 1 X sentations, namely xw and cw as modeled in word2vec and Jadv (θ) = − J(yn , sn + r̃nadv , θ) (4) N n=1 the bidirectional char LSTM embeddings respectively. The hybrid architecture has a gating mechanism, also referred which will ensure robustness to the specified adversarial per- to as attention, which allows the model dynamically decide turbation. which level of information to tune into for each such word Here N denotes the number of labeled examples, w in the dataset. s1 , s2 , ...sN are the input sequence of texts with correspond- This will be achieved through two additional layers im- ing labels y1 , y2 , ...yn . For virtual adversarial training (semi- 1 supervised training), following the formalism in (Miyato et Ashby, Charless, TensorFlow tutorial-analyzing Tweet’s al. 2015), we define the virtual adversarial perturbation as sentiment with character Level LSTM’s, Deep Learning Blog, https://charlesashby.github.io/2017/06/05/sentiment-analyssi- ∇s+d KL[p(., s, θ)][p(., s + d, θ)] withchar-lstm/ r̃vadv = (5) ||∇s+d KL[p(., s, θ)][p(., s + d, θ)]||2 where KL[p][q] denotes the KL divergence between distri- Table 2. Table 1 presents performance of the regularized butions p and q. (α = 0.5) adversarial training on word-char embedding on The associated virtual adversarial loss will then be defined the dataset. The F0.5 –Score metric was used as an evaluation by criterion as established in earlier works (Rei, Crichton, and N0 Pyysalo 2016). The preliminary results (Table 1) shows an 1 X improvement in accuracy of adversarial (regularized) train- Jvadv (θ) = 0 KL[p(., sn , θ)][p(., sn + r̃nvadv , θ)] N n=1 ing on word-char embedding over both the baseline word- (6) char embedding as well as on individual word/char-only and where r˜n vadv is the adversarial perturbation for the nth text concatenated embeddings. Table 2 presents comparative ac- (unlabeled) and N 0 is the number of such unlabeled texts curacy results of the regularized adversarial training at the (examples). three representations levels (namely, word-only, char-only The crucial distinction between the adversarial (super- and word-char). These preliminary results show that pertur- vised) and the virtual adversarial (unsupervised) is that the bation at word-char level yields better accuracy as compared perturbation (equation 5) and the loss function (equation 6) to individual word-only and char-only perturbation. Adver- do not depend on the input labels which makes it applicable sarial training at word-char level (Table 1 and Table 2) also to unlabeled examples. (semi-supervised adversarial train- performs better as compared to random perturbations as ex- ing). Furthermore, to regularize the flow of adversarial ex- pected. amples we use (Miyato, Dai, and Goodfellow 2016) the reg- ularized adversarial loss Table 1: Performance of Regularized Adversarial Training on Word-Char Embedding on FCE- PUBLIC Dataset. (F0.5 –Scores) ˜ θ) = αJ(x, θ) + (1 − α)J(x + r̃ , θ) J(x, adv (7) Word embedding: Word-Only Char-Only Word-Char(Concact.) Word-Char (attention) Devt. Test Devt. Test Devt. Test. Devt. Test (where 0 ≤ α ≤ 1 is the regularizing parameter) which will Baseline: 49.57 46.91 41.45 37.50 51.88 48.24 50.08 47.78 Random Perturbation: 52.24 48.49 52.99 49.63 53.01 50.01 52.92 49.74 effectively make them resist and keep up with the current Adv. Training( Regularized): 54.82 51.07 46.61 42.00 55.99 52.87 57.14 53.55 version of the model. The main of the paper is the proposal and preliminary testing of the adversarial training architec- ture on hybrid word–char embedding based on the exist- ing framework (word–char embedding and adversarial train- Table 2: Comparisons of regularized adversarial trainings at var- ing for semi–supervised text classification) as developed in ious perturbation levels (modes) on FCE–PUBLIC dataset. (F0.5 – (Rei, Crichton, and Pyysalo 2016) and (Miyato, Dai, and Scores) Pertubation Modes: Word-Only Perturb Char-Only Perturb Word-Char(conc.) Perturb Word-Char(attn.) perturb Goodfellow 2016). In the next section, we present the ex- Devt. Test Devt. Test Devt. Test. Devt. Test perimental settings and some preliminary results on a neu- Random Perturbation: 53.70 50.33 Adv. Training (Regularized): 52.79 49.15 53.07 49.74 53.58 49.30 53.01 50.01 55.99 52.87 52.92 49.74 57.14 53.55 ral sequence labeling task on FCE–PUBLIC dataset (Yan- nakoudakis, Briscoe, and Medlock 2011). Conclusion Experiments on FCE–PUBLIC Dataset This work seeks to develop improved adversarial training The FCE–PUBLIC (for Error detection) dataset (Yan- model acting on word–char embeddings. It is well known nakoudakis, Briscoe, and Medlock 2011) (Rei and Yan- that word–only/char–only embeddings have a major draw- nakoudakis 2016) consists of 1141 examination Scripts for backs in handling rare/unseen words and character–level in- training, 97 examination Scripts for testing, 6 examination formation which subsequently leads to poor representation scripts for outliers experiments and 80 randomly selected of valid and hence adversarial examples. The proposed ad- scripts for developmental set. Tokens that have been anno- versarial training model is intended to overcome these chal- tated with an error tag are labeled as incorrect (i), otherwise, lenges by applying the adversarial perturbation on word– they are labeled as correct (c). The data is organized in a the char embeddings. It is envisioned that the proposed model Conference on Natural Language Learning (CoNLL) tab- along with adversarial regularization (i.e, fine tuning the separated format. Each line contains one token, followed parameter α) will bring significant improvements over the by a tab and then the error label. With CoNLL format the existing word–only/char–only adversarial training architec- dataset has 452833 train, 34599 developmental and 41477 tures. We performed some preliminary numerical experi- test tokens. The total number of parameter count for the ments on the impact of regularized adversarial training on three representation are 2972052 (Word–based), 3452052 word–char embedding on a neural sequence labeling task on (Char concat) and 3152352 (Char attention) of which only the FCE-PUBLIC dataset. Our preliminary result shows an a small fraction of the embeddings are utilized at every it- improvement in accuracy of adversarial (regularized) train- eration. We performed the proposed adversarial trainings ing on word-char embedding over both the baseline word- on Sequence Labeling (bidirectional LSTM) on word-char char embedding as well as on individual word/char-only and embedding on the FCE–PUBLIC Dataset 2 . The prelimi- concatenated embeddings. These preliminary results also nary experimental results are briefly shown in Table 1 and show that perturbation at word-char level yields a better ac- 2 We adopted here Tensorflow implementation of se- curacy as compared to individual word-only and char-only quence labeling on FCE–PUBLIC dataset available at perturbation. Further testing of the model need to be per- https://github.com/marekrei/sequence-labeler. formed on several representative neural sequence labeling and text classification tasks and various datasets. Acknowledgment We would like to thank Dr. Prithviraj Dasgupta, Dr. Ira S Moskowitz and Espiritu Hugo for their invaluable com- ments, suggestions and technical help during the progress of the research and the development of the paper. References Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain- ing and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780. Miyato, T.; Maeda, S.-i.; Koyama, M.; Nakae, K.; and Ishii, S. 2015. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677. Miyato, T.; Dai, A. M.; and Goodfellow, I. 2016. Adversar- ial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. Rei, M., and Yannakoudakis, H. 2016. Compositional se- quence labeling models for error detection in learner writ- ing. arXiv preprint arXiv:1607.06153. Rei, M.; Crichton, G. K.; and Pyysalo, S. 2016. Attend- ing to characters in neural sequence labeling models. arXiv preprint arXiv:1611.04361. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing proper- ties of neural networks. arXiv preprint arXiv:1312.6199. Yannakoudakis, H.; Briscoe, T.; and Medlock, B. 2011. A new dataset and method for automatically grading esol texts. In Proceedings of the 49th Annual Meeting of the As- sociation for Computational Linguistics: Human Language Technologies-Volume 1, 180–189. Association for Compu- tational Linguistics.