=Paper=
{{Paper
|id=Vol-2269/FSS-18_paper_36
|storemode=property
|title=Adversarial Training on Word-Char Embedding
|pdfUrl=https://ceur-ws.org/Vol-2269/FSS-18_paper_36.pdf
|volume=Vol-2269
|authors=Abebaw Tadesse,Joseph B. Collins
|dblpUrl=https://dblp.org/rec/conf/aaaifs/TadesseC18
}}
==Adversarial Training on Word-Char Embedding==
<pdf width="1500px">https://ceur-ws.org/Vol-2269/FSS-18_paper_36.pdf</pdf>
<pre>
                            Adversarial Training on Word–Char Embedding


                                         Abebaw Tadesse∗ and Joseph B. Collins†


                           Abstract                                linear w.r.t. the input are vulnerable to adversarial exam-
                                                                   ples which are typically generated by simple linear, but
  In this work we propose a robust adversarial training model      carefully tuned perturbations of the input dataset. Addition-
  on hybrid word–char embeddings as developed in (Rei,
  Crichton, and Pyysalo 2016) based on the recent works of
                                                                   ally, in (Goodfellow, Shlens, and Szegedy 2014), it was
  (Miyato, Dai, and Goodfellow 2016). The proposed neu-            demonstrated that adversarial training improves model per-
  ral training model addresses the existing critical issues with   formance at least in image classification tasks. In (Miyato,
  word–only embeddings which includes: poor vector repre-          Dai, and Goodfellow 2016), the authors used adversarial
  sentation for rare words and no representation for unseen        and virtual adversarial (semi-supervised) training to improve
  words and the lack of proper mechanism to incorporate            a text or RNN models. Though word vector embeddings,
  morphene–level informations that are not shared with the         in general, yield high quality vector representation for fre-
  whole dictionary which, subsequently, leads to poor qual-        quently seen words, they tend to produce poor quality word
  ity embeddings and hence low quality examples/adversarial        vectors for less frequent words and no embedding at all
  examples. We present description of the proposed adversar-       for previously unseen words (out of vocabulary represen-
  ial training model/architecture and addresses the implemen-
  tation aspects at the word–char level. Our preliminary result
                                                                   tation), and character–level information is not shared with
  on sequence labeling task on the First Certificate in English    the whole dictionary (Rei, Crichton, and Pyysalo 2016). As
  (FCE-PUBLIC) dataset (Yannakoudakis, Briscoe, and Med-           a result, most of the time the generated example either does
  lock 2011) shows an improvement in accuracy of adversarial       not change because there is no neighbor near enough, or else
  (regularized) training on word-char embedding over the base-     the perturbed context is not adversarial enough. In this work,
  line word-char embedding as well as on individual word/char-     we attempt to address these issues through hybrid implanta-
  only and concatenated embeddings, as expected. The prelim-       tion of word–char embedding under the settings described in
  inary results also show that perturbation at word-char level     (Rei, Crichton, and Pyysalo 2016) to develop a neural learn-
  yields better accuracy as compared to individual word-only       ing scheme for the generation and exploitation of adversarial
  and char-only perturbations.                                     examples in Natural Language Processing (NLP) contexts.
                                                                   We implement the proposed adversarial training model on
                       Introduction                                LSTM based hybrid word-char embedding on a sequence
                                                                   labeling task on the FCE–PUBLIC (First Certificate in En-
In this article we investigate the impact of adversarial           glish) dataset (Yannakoudakis, Briscoe, and Medlock 2011).
training (Miyato, Dai, and Goodfellow 2016) on hybrid              In section 2, a brief description of our proposed adversarial
word-char embeddings, as developed in (Rei, Crichton, and          training model based on word–char embeddings followed by
Pyysalo 2016), on performance of Long Short-Term Mem-              preliminary experimental results and discussions in Section
ory (LSTM) based neural training models (Hochreiter and            3.
Schmidhuber 1997). In (Szegedy et al. 2013) and (Goodfel-
low, Shlens, and Szegedy 2014), it was shown that current
neural models, particularly those that are linear or semi–             Adversarial Training on the Word–Char
   ∗                                                                          Embedding Architecture
      A. Tadesse is with the Mathematics Dept., Langston Univer-
sity, Langston, Oklahoma, USA. e-mail: abebaw@langston.edu.        Word embeddings, in general, yield high quality distribu-
    †
      J. Collins is with the Information Technology Division,      tional vector representation for frequently seen words, with
Naval Research Laboratory, Washington D.C., USA. e-mail:           semantically and functionally similar words having similar
joseph.collins@nrl.navy.mil                                        representations. However, they tend to produce poor qual-
Copyright c by the papers authors. Copying permitted for private
and academic purposes. In: Joseph Collins, Prithviraj Dasgupta,    ity word vectors for less frequent words and no embed-
Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo-     ding at all for previously unseen words (Out of Vocabu-
sium on Adversary-Aware Learning Techniques and Trends in Cy-      lary words). Furthermore, there is no mechanism to exploit
bersecurity, Arlington, VA, USA, 18-19 October, 2018, published    character–level patterns and sentimental words that are com-
at http://ceur-ws.org                                              monly unseen words in sentimental datasets such as Twit-
                                                                  plementing the weight vector
                                                                             z = σ(Wz(3) tanh(Wz(1) xw + Wz(2) cw )),                   (1)
                                                                         (1)  (2)  (3)
                                                                  where Wz , Wz , Wz respectively are weight matrices
                                                                  for calculating z, and σ is the sigmoid function. The hybrid
                                                                  embedding vector x̃w (x̃ in Figure 1) will then be expressed
                                                                  as the z–weighted sum of xw (x in Figure 1) and cw (m in
                                                                  Figure 1), given by
                                                                                     x̃w = z ∗ xw + (1 − z) ∗ cw                        (2)
                                                                  (point–wise multiplication). The bidirectional LSTM real-
                                                                  ization of the character based word embedding m (Figure 1)
                                                                                                                −→ ← −
                                                                  is given by m = tanh(Wm h∗ ) where h∗ = [h∗R ; h∗L ] where
                                                                       −→        ←−
                                                                  the h∗R and h∗L are the extreme left and right hidden vec-
                                                                  tors (resp.) from each of the two LSTM components, namely
                                                                  −
                                                                  →                   −−→         ←
                                                                                                  −                ←−−
                                                                  h∗i = LST M (ci , h∗i−1 ) and h∗i = LST M (ci , h∗i+1 ), i =
Figure 1: A bi–directional LSTM based hybrid word–char embed-     1, ...length(w). Furthermore, the attention–based architec-
ding (Extracted from (Rei, Crichton, and Pyysalo 2016))           ture requires that the learned features in both word vectors
                                                                  xw and cw align. This will need to be incorporated as extra
                                                                  constraint on the loss function to encourage this agreement
                                                                  by optimizing
                                                                                             T
                                                                                             X
                                                                               J˜ = J +            gk (1 − cos(cwk xwk )),              (3)
                                                                                             k=1

                                                                  where J is the original embedding cost and J˜ is the mod-
                                                                  ified cost function and gk is defined as gk (wk ) = 0 for
                                                                  wk = OOV (Out Of Vocabulary words) and gk (wk ) = 1
                                                                  otherwise, k = 1, .., T (T is the size of the input se-
                                                                  quence (text)). Adversarial perturbation will then be ap-
                                                                  plied on x˜w , as implemented in (Rei, Crichton, and Pyysalo
                                                                  2016) to generate its adversarial counterpart, x̃adv       w , given
                                                                                            adv             adv        ∇x̃w J(y|x̃w ,θ)
                                                                  by x̃adv
                                                                       w     =   x̃ w + r˜
                                                                                         w       where   r˜
                                                                                                          w       =  ||∇x̃w J(y|x̃w ,θ)||2 ,
Figure 2: The proposed Architecture for Adversarial training on   J(x̃w , θ) is the loss function (the negative loss likelihood
LSTM–based word–char embedding                                    function − log(p(y|x, θ)) for a classifier), θ is the param-
                                                                  eter of the model (which should be viewed as a constant
                                                                  throughout the adversarial example generation process) and
ter datasets 1 , and no immunity to typos (Rei, Crichton, and      is the perturbation parameter. This needs to be done dy-
Pyysalo 2016). Consequently, the quality of adversarial ex-       namically for each word vector x̃w to generate the needed
amples generated using word-level only embeddings will in-        adversarial examples. The aggregated adversarial perturba-
herit these weaknesses. In an attempt to address these criti-     tion on the concatenated sequence s (the labeled input text)
cal issues we propose adversarial training on a bi–directional    of the (normalized) embedding vectors [x1 , x2 , ..., xT ] is de-
LSTM–based hybrid word-char architecture [ Rei, Crichton,                               ∇s J(y|s,θ)
                                                                  fined as r˜s adv = ||∇  s J(y|s,θ)||2
                                                                                                        and it’s corresponding adver-
and Pyysalo2016)] as described in equation 1 below:
   In the word–char embedding settings (Rei, Crichton, and        sarial loss is defined as
Pyysalo 2016), a given word w will have dual vector repre-                                          N
                                                                                              1 X
sentations, namely xw and cw as modeled in word2vec and                       Jadv (θ) = −          J(yn , sn + r̃nadv , θ)             (4)
                                                                                              N n=1
the bidirectional char LSTM embeddings respectively. The
hybrid architecture has a gating mechanism, also referred         which will ensure robustness to the specified adversarial per-
to as attention, which allows the model dynamically decide        turbation.
which level of information to tune into for each such word           Here N denotes the number of labeled examples,
w in the dataset.                                                 s1 , s2 , ...sN are the input sequence of texts with correspond-
   This will be achieved through two additional layers im-        ing labels y1 , y2 , ...yn . For virtual adversarial training (semi-
    1
                                                                  supervised training), following the formalism in (Miyato et
      Ashby, Charless, TensorFlow tutorial-analyzing Tweet’s      al. 2015), we define the virtual adversarial perturbation as
sentiment with character Level LSTM’s, Deep Learning Blog,
https://charlesashby.github.io/2017/06/05/sentiment-analyssi-                          ∇s+d KL[p(., s, θ)][p(., s + d, θ)]
withchar-lstm/                                                           r̃vadv =                                                       (5)
                                                                                     ||∇s+d KL[p(., s, θ)][p(., s + d, θ)]||2
where KL[p][q] denotes the KL divergence between distri-             Table 2. Table 1 presents performance of the regularized
butions p and q.                                                     (α = 0.5) adversarial training on word-char embedding on
   The associated virtual adversarial loss will then be defined      the dataset. The F0.5 –Score metric was used as an evaluation
by                                                                   criterion as established in earlier works (Rei, Crichton, and
                    N0                                               Pyysalo 2016). The preliminary results (Table 1) shows an
                 1 X                                                 improvement in accuracy of adversarial (regularized) train-
  Jvadv (θ) = 0         KL[p(., sn , θ)][p(., sn + r̃nvadv , θ)]
                N n=1                                                ing on word-char embedding over both the baseline word-
                                                               (6)   char embedding as well as on individual word/char-only and
where r˜n vadv is the adversarial perturbation for the nth text      concatenated embeddings. Table 2 presents comparative ac-
(unlabeled) and N 0 is the number of such unlabeled texts            curacy results of the regularized adversarial training at the
(examples).                                                          three representations levels (namely, word-only, char-only
   The crucial distinction between the adversarial (super-           and word-char). These preliminary results show that pertur-
vised) and the virtual adversarial (unsupervised) is that the        bation at word-char level yields better accuracy as compared
perturbation (equation 5) and the loss function (equation 6)         to individual word-only and char-only perturbation. Adver-
do not depend on the input labels which makes it applicable          sarial training at word-char level (Table 1 and Table 2) also
to unlabeled examples. (semi-supervised adversarial train-           performs better as compared to random perturbations as ex-
ing). Furthermore, to regularize the flow of adversarial ex-         pected.
amples we use (Miyato, Dai, and Goodfellow 2016) the reg-
ularized adversarial loss
                                                                     Table 1: Performance of Regularized Adversarial Training on
                                                                     Word-Char Embedding on FCE- PUBLIC Dataset. (F0.5 –Scores)
        ˜ θ) = αJ(x, θ) + (1 − α)J(x + r̃ , θ)
        J(x,                                      adv
                                                          (7)        Word embedding:               Word-Only   Char-Only                  Word-Char(Concact.)        Word-Char (attention)
                                                                                                  Devt. Test Devt. Test                   Devt. Test.                Devt. Test
(where 0 ≤ α ≤ 1 is the regularizing parameter) which will           Baseline:                    49.57 46.91 41.45 37.50                 51.88 48.24                50.08 47.78
                                                                     Random Perturbation:         52.24 48.49 52.99 49.63                 53.01 50.01                52.92 49.74
effectively make them resist and keep up with the current            Adv. Training( Regularized): 54.82 51.07 46.61 42.00                 55.99 52.87                57.14 53.55
version of the model. The main of the paper is the proposal
and preliminary testing of the adversarial training architec-
ture on hybrid word–char embedding based on the exist-
ing framework (word–char embedding and adversarial train-            Table 2: Comparisons of regularized adversarial trainings at var-
ing for semi–supervised text classification) as developed in         ious perturbation levels (modes) on FCE–PUBLIC dataset. (F0.5 –
(Rei, Crichton, and Pyysalo 2016) and (Miyato, Dai, and              Scores)
                                                                     Pertubation Modes:           Word-Only Perturb   Char-Only Perturb   Word-Char(conc.) Perturb   Word-Char(attn.) perturb
Goodfellow 2016). In the next section, we present the ex-                                         Devt. Test          Devt. Test          Devt. Test.                Devt. Test

perimental settings and some preliminary results on a neu-           Random Perturbation:         53.70 50.33
                                                                     Adv. Training (Regularized): 52.79 49.15
                                                                                                                      53.07 49.74
                                                                                                                      53.58 49.30
                                                                                                                                          53.01 50.01
                                                                                                                                          55.99 52.87
                                                                                                                                                                     52.92 49.74
                                                                                                                                                                     57.14 53.55
ral sequence labeling task on FCE–PUBLIC dataset (Yan-
nakoudakis, Briscoe, and Medlock 2011).
                                                                                                                Conclusion
       Experiments on FCE–PUBLIC Dataset
                                                                     This work seeks to develop improved adversarial training
The FCE–PUBLIC (for Error detection) dataset (Yan-                   model acting on word–char embeddings. It is well known
nakoudakis, Briscoe, and Medlock 2011) (Rei and Yan-                 that word–only/char–only embeddings have a major draw-
nakoudakis 2016) consists of 1141 examination Scripts for            backs in handling rare/unseen words and character–level in-
training, 97 examination Scripts for testing, 6 examination          formation which subsequently leads to poor representation
scripts for outliers experiments and 80 randomly selected            of valid and hence adversarial examples. The proposed ad-
scripts for developmental set. Tokens that have been anno-           versarial training model is intended to overcome these chal-
tated with an error tag are labeled as incorrect (i), otherwise,     lenges by applying the adversarial perturbation on word–
they are labeled as correct (c). The data is organized in a the      char embeddings. It is envisioned that the proposed model
Conference on Natural Language Learning (CoNLL) tab-                 along with adversarial regularization (i.e, fine tuning the
separated format. Each line contains one token, followed             parameter α) will bring significant improvements over the
by a tab and then the error label. With CoNLL format the             existing word–only/char–only adversarial training architec-
dataset has 452833 train, 34599 developmental and 41477              tures. We performed some preliminary numerical experi-
test tokens. The total number of parameter count for the             ments on the impact of regularized adversarial training on
three representation are 2972052 (Word–based), 3452052               word–char embedding on a neural sequence labeling task on
(Char concat) and 3152352 (Char attention) of which only             the FCE-PUBLIC dataset. Our preliminary result shows an
a small fraction of the embeddings are utilized at every it-         improvement in accuracy of adversarial (regularized) train-
eration. We performed the proposed adversarial trainings             ing on word-char embedding over both the baseline word-
on Sequence Labeling (bidirectional LSTM) on word-char               char embedding as well as on individual word/char-only and
embedding on the FCE–PUBLIC Dataset 2 . The prelimi-                 concatenated embeddings. These preliminary results also
nary experimental results are briefly shown in Table 1 and           show that perturbation at word-char level yields a better ac-
   2
     We adopted here Tensorflow implementation of se-                curacy as compared to individual word-only and char-only
quence labeling on FCE–PUBLIC dataset available at                   perturbation. Further testing of the model need to be per-
https://github.com/marekrei/sequence-labeler.                        formed on several representative neural sequence labeling
and text classification tasks and various datasets.

                   Acknowledgment
We would like to thank Dr. Prithviraj Dasgupta, Dr. Ira
S Moskowitz and Espiritu Hugo for their invaluable com-
ments, suggestions and technical help during the progress of
the research and the development of the paper.

                        References
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain-
ing and harnessing adversarial examples. arXiv preprint
arXiv:1412.6572.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
memory. Neural computation 9(8):1735–1780.
Miyato, T.; Maeda, S.-i.; Koyama, M.; Nakae, K.; and Ishii,
S. 2015. Distributional smoothing with virtual adversarial
training. arXiv preprint arXiv:1507.00677.
Miyato, T.; Dai, A. M.; and Goodfellow, I. 2016. Adversar-
ial training methods for semi-supervised text classification.
arXiv preprint arXiv:1605.07725.
Rei, M., and Yannakoudakis, H. 2016. Compositional se-
quence labeling models for error detection in learner writ-
ing. arXiv preprint arXiv:1607.06153.
Rei, M.; Crichton, G. K.; and Pyysalo, S. 2016. Attend-
ing to characters in neural sequence labeling models. arXiv
preprint arXiv:1611.04361.
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing proper-
ties of neural networks. arXiv preprint arXiv:1312.6199.
Yannakoudakis, H.; Briscoe, T.; and Medlock, B. 2011.
A new dataset and method for automatically grading esol
texts. In Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics: Human Language
Technologies-Volume 1, 180–189. Association for Compu-
tational Linguistics.

</pre>