=Paper= {{Paper |id=Vol-2769/27 |storemode=property |title=Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems |pdfUrl=https://ceur-ws.org/Vol-2769/paper_27.pdf |volume=Vol-2769 |authors=Samuel Louvan,Bernardo Magnini |dblpUrl=https://dblp.org/rec/conf/clic-it/LouvanM20 }} ==Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems== https://ceur-ws.org/Vol-2769/paper_27.pdf
                Simple Data Augmentation for Multilingual NLU in
                        Task Oriented Dialogue Systems

                   Samuel Louvan                                Bernardo Magnini
                 University of Trento                        Fondazione Bruno Kessler
               Fondazione Bruno Kessler                        magnini@fbk.eu
                 slouvan@fbk.eu


                     Abstract                         (Şahin and Steedman, 2018; Vania et al., 2019),
                                                      and text classification (Wei and Zou, 2019; Ku-
    Data augmentation has shown potential in          mar et al., 2020). As for slot filling (SF) and
    alleviating data scarcity for Natural Lan-        intent classification (IC), a number of DA meth-
    guage Understanding (e.g. slot filling and        ods have been proposed to generate synthetic ut-
    intent classification) in task-oriented dia-      terances using sequence to sequence models (Hou
    logue systems. As prior work has been             et al., 2018; Zhao et al., 2019), Conditional Vari-
    mostly experimented on English datasets,          ational Auto Encoder (Yoo et al., 2019), or pre-
    we focus on five different languages, and         trained NLG models (Peng et al., 2020). To date,
    consider a setting where limited data are         most of the DA methods are evaluated on English
    available. We investigate the effectiveness       and it is not clear whether the same finding apply
    of non-gradient based augmentation meth-          to other languages.
    ods, involving simple text span substitu-              In this paper, we study the effectiveness of
    tions and syntactic manipulations. Our            DA on several non-English datasets for NLU in
    experiments show that (i) augmentation            task-oriented dialogue systems. We experiment
    is effective in all cases, particularly for       with existing lightweight, non-gradient based, DA
    slot filling; and (ii) it is beneficial for a     methods from Louvan and Magnini (2020) that
    joint intent-slot model based on multilin-        produces varying slot values through substitution
    gual BERT, both for limited data settings         and sentence structure manipulation by leveraging
    and when full training data is used.              syntactic information from a dependency parser.
                                                      We evaluate the DA methods on NLU datasets
1   Introduction                                      from five languages: Italian, Hindi, Turkish, Span-
Natural Language Understanding (NLU) in task-         ish, and Thai. The contributions of our paper are
oriented dialogue systems is responsible for pars-    as follows:
ing user utterances to extract the intent of the      1. We assess the applicability of DA methods for
user and the arguments of the intent (i.e. slots)         NLU in task-oriented dialogue systems in five
into a semantic representation, typically a seman-        languages.
tic frame (Tur and De Mori, 2011). For example,       2. We demonstrate that simple DA can improve
the utterance “Play Jeff Pilson on Youtube” has the       performance on all languages despite different
intent P LAY M USIC and “Youtube” as value for the        characteristic of the languages.
slot S ERVICE. As more skills are added to the dia-   3. We show that a large pre-trained multilingual
logue system, the NLU model frequently needs to           BERT (M-B ERT) (Devlin et al., 2019) can still
be updated to scale to new domains and languages,         benefit from DA, in particular for slot filling.
a situation which typically becomes problematic
                                                      2    Slot Filling and Intent Classification
when labeled data are limited (data scarcity).
   One way to combat data scarcity is through         The NLU component of a task-oriented dialogue
data augmentation (DA) techniques performing          system is responsible in a parsing user utterance
label preserving operations to produce auxiliary      into a semantic representation, such as semantic
training data. Recently, DA has shown potential
                                                           Copyright c 2020 for this paper by its authors. Use per-
in tasks such as machine translation (Fadaee et       mitted under Creative Commons License Attribution 4.0 In-
al., 2017), constituency and dependency parsing       ternational (CC BY 4.0)
Figure 1: Augmentation operations performed on an utterance, “Quali film animati stanno proiettando al
cinema piu vicino” (“Which animated films are showing at the nearest cinema”). The utterance is taken
from the Italian SNIPS dataset.


frame. The semantic frame conveys information,                   methods from Louvan and Magnini (2020), that
namely the user intent and the corresponding argu-               has shown promising results on English datasets.
ments of the intent. Extracting such information                 We describe the augmentation operations in the
involves slot filling (SF) and intent classification             following sections.
(IC) tasks.
   Given an input utterance of n tokens, x =                     3.1   Slot Substitution (S LOT-S UB)
(x1 , x2 , .., xn ), the system needs to assign a partic-        S LOT-S UB (Figure 1 left) performs augmentation
ular intent y intent for the whole utterance x and the           by substituting a particular text span (slot-value
corresponding slots that are mentioned in the utter-             pair) in an utterance with a different text span that
ance y slot = (y1slot , y2slot , .., ynslot ). In practice, IC   is semantically consistent i.e., the slot label is the
is typically modeled as text classification and SF               same. For example, in the utterance “Quali film
as a sequence tagging problem. As an example,                    animati stanno proiettando al cinema più vicino”,
for the utterance “Play Jeff Pilson on Youtube”,                 one of the spans that can be substituted is the
y intent is P LAY M USIC, as the intent of the user is           slot value pair (più vicino, SPATIAL RELATION).
to ask the system to play a song from a musician                 Then, we collect other spans in D in which the
and y slot = ( O, B-ARTIST, I-ARTIST,                            slot values are different, but the slot label is the
O, B-SERVICE ) in which the artist is “Jeff Pil-                 same. For instance, we found the substitute can-
son” and the service is “Youtube””. Slot labels                  didates SP 0 = {(“distanza a piedi”, SPATIAL RE -
are in BIO format: B indicates the start of a slot               LATION ), (“lontano”, SPATIAL RELATION ), (“nel
span, I the inside of a span while O denotes that                quartiere”, SPATIAL RELATION), . . . }, and then
the word does not belong to any slot. Recent ap-                 we sample one span to replace the original span in
proaches for SF and IC are based on neural net-                  the utterance.
work methods that models SF and IC jointly (Goo
et al., 2018; Chen et al., 2019) by sharing model                3.2   C ROP and ROTATE
parameter among both tasks.
                                                                 In order to produce sentence variations, we apply
3    Data Augmentation (DA) Methods                              the crop and rotate operations proposed in Şahin
                                                                 and Steedman (2018), which manipulate the sen-
DA aims to perform semantically preserving trans-                tence structure through its dependency parse tree.
formations on the training data D to produce aux-                The goal of C ROP (Figure 1 middle) is to simplify
iliary data D0 . The union of D and D0 is then                   the sentence so that it focuses on a particular frag-
used to train a particular NLU model. For each                   ment (e.g. subject/object) by removing other frag-
utterance in D, we produce N augmented utter-                    ments in the sentence. C ROP uses the dependency
ances by applying a specific augmentation opera-                 tree to identify the fragment and then remove it
tion. We adopt a subset of existing augmentation                 and its children from the dependency tree.
                                           #Label               #Utterances (D)                   #Augmented Utterances (D0 )
        Dataset      Language         #slot    #intent      #train    #dev       #test     #S LOT-S UB       #C ROP   #ROTATE
        SNIPS-IT     Italian              39         7          574     700       698                5,404    1,431        1,889
        ATIS-HI      Hindi                73        17          176     440       893                1,286      460          472
        ATIS-TR      Turkish              70        17           99     248       715                  144      161          194
        FB-ES        Spanish              11        12          361   1,983     3,043                1,455      769        1,028
        FB-TH        Thai                  8        10          215   1,235     1,692                  781        -            -

Table 1: Statistics on the datasets. #train indicates our limited training data setup (10% of full training
data). D0 is produced by tuning the number of augmentations per utterance (N ) on the dev set.

                                SNIPS-IT                 ATIS-HI                ATIS-TR                  FB-ES            FB-TH
    Model     DA
                               Slot        Intent    Slot        Intent       Slot       Intent      Slot    Intent    Slot       Intent
    M-B ERT   None             78.25       94.99    69.57        86.57      64.36        78.98       84.13    97.68    56.06      89.80
                                      †                     †                        †                                        †
              S LOT-S UB   81.97           94.93    72.44        87.29      66.60        79.85       84.27   97.72    59.68       91.42†
              C ROP        80.12†          94.60    70.04        86.92      65.11        79.48       83.85   98.08†      -           -
              ROTATE       79.24†          95.37    70.69        87.60†     65.20        80.06       83.28   98.20†      -           -
              C OMBINE     81.27†          95.00    72.13†       86.93      66.68†       81.12†      83.67   97.94       -           -

Table 2: Performance comparison of the baseline and augmentation methods on the test set. F1 score is
used for slot filling and accuracy for intent classification. Scores are the average of 10 different runs. †
indicates statistically significant improvement over the baseline (p-value < 0.05 according to Wilcoxon
signed rank test).


   The ROTATE (Figure 1 right) operation is per-                          multilingual BERT (M-B ERT), which is trained
formed by moving a particular fragment (includ-                           on 104 languages. During training, M-B ERT is
ing subject/object) around the root of the tree, typ-                     fine tuned on the slot filling and intent classi-
ically the verb in the sentence. For each operation,                      fication tasks. Given a sentence representation
all possible combinations are generated, and one                          x = ([CLS] t1 t2 . . . tL ), we use the hidden state
of them is picked randomly as the augmented sen-                          h[CLS] to predict the intent, and hti to predict the
tence. Both C ROP and ROTATE rely on the univer-                          slot label. As for DA methods, in addition to the
sal dependency labels (Nivre et al., 2017) to iden-                       methods described in Section 2, we add one con-
tify relevant fragments, such as NSUBJ (nominal                           figuration C OMBINE, which combines the result
subject), DOBJ (direct object), OBJ (object), IOBJ                        of S LOT-S UB and ROTATE, as ROTATE obtains
(indirect object).                                                        better results than C ROP on the development set.
                                                                          Settings. The model is trained with the
4     Experiments
                                                                          BertAdam optimizer for 30 epochs with early
Our primary goal is to verify the effectiveness                           stopping. The learning rate is set to 10−5 and
of data augmentation on Italian, Hindi, Turkish,                          batch size is 16. All the hyperparameters are
Spanish and Thai NLU datasets with limited la-                            listed in Appendix A. For S LOT-S UB the number
beled data. To this end, we compare the per-                              of augmentation per sentence N is tuned on the
formance of a baseline NLU model trained on                               development set. To produce the dependency
the original training data (D) with a NLU model                           tree, we parse the sentence using Stanza (Qi
that incorporates the augmented data as additional                        et al., 2020). For both C ROP and ROTATE we
training instances (D + D0 ). To simulate the lim-                        follow the default hyperparameters from Şahin
ited labeled data situation we randomly sample                            and Steedman (2018). We did not experiment
10% of the training data for each dataset.                                with Thai for C ROP and ROTATE as Thai is not
                                                                          supported by Stanza. The number of augmented
Baseline and Data Augmentation (DA) Meth-                                 sentences (D0 ) for each method is listed in Table
ods. We use the state of the art BERT-based joint                         1. For evaluation metric, we use the standard
intent slot filling model (Chen et al., 2019) as                          CoNLL script to compute F1 score for slot filling
the baseline model. We leverage the pre-trained                           and accuracy for intent classification.
Datasets. For the Italian language, we use the
data from Bellomaria et al. (2019), translated from
the English SNIPS dataset (Coucke et al., 2018).
SNIPS has been widely used for evaluating NLU
models and consists of utterances in multiple do-
mains. As for Hindi and Turkish, we use the ATIS
dataset from Upadhyay et al. (2018), derived from
Hemphill et al. (1990). ATIS is a well known
NLU dataset on flight domain. As for Spanish and
Thai we use the FB dataset from Schuster et al.
(2019) that contains utterances in alarm, weather,
and reminder domains. The overall statistics of the       Figure 2: Improvement (∆F 1) obtained by S LOT-
datasets are shown in Table 1.                            S UB (SS) on different training data size. Positive
5   Results                                               numbers mean that the model with SS yields gain.

The overall results reported in Table 2 show that
applying DA improves performance on slot filling          increase the training size, the benefit of S LOT-S UB
and intent classification across all languages. In        is decreasing for all datasets. For some datasets,
particular, for SF, the S LOT-S UB method yields          namely ATIS-HI and FB-ES, S LOT-S UB can cause
the best result, while for IC, ROTATE obtains             performance drop for larger data size, although it
better performance compared to C ROP in most              is reasonably small (less than 1 F1 point). FB-TH
cases. These results are consistent with the finding      consistently benefits from S LOT-S UB even when
from Louvan and Magnini (2020) on the English             full training data is used. Until which training data
dataset, where S LOT-S UB improves SF and C ROP           size the improvement is significant vary across
or ROTATE improve IC. In general, ROTATE is bet-          datasets2 . For SNIPS-IT, improvement is clear for
ter than C ROP for most cases on IC, and we think         all training data size and they are statistically sig-
this is because C ROP may change the intent of the        nificant up until the training data size is 80%. For
original sentence. Intents typically depend on the        ATIS-HI improvements are significant until data
occurrence of specific slots, so when the cropped         size of 40%. As for FB datasets, improvements
part is a slot-value, it may change the sentence’s        are significant only until the training data size is
overall semantics.                                        10%. Overall, we can see that S LOT-S UB is ef-
   We can see that languages with different typo-         fective for cases where data is scarce (5%, 10%),
logical features (e.g. subject/verb/object order-         while it is still relatively robust for larger data size
ing)1 benefit from ROTATE operation for IC. This          on all datasets.
result suggests that augmentation can produce use-
ful noise (regularization) for the model to allevi-
ate overfitting when labeled data is limited. When
we use C OMBINE, it still helps the performance
of both SF and IC, although the improvements are
not as high as when only one of the augmentation
method is applied. The only language that gets
the benefits the most from C OMBINE is Turkish.
We hypothesize that as Turkish has a more flexi-
ble word order than the other languages it benefits
the most when ROTATE is performed.                        Figure 3: Gain (∆F 1) obtained by S LOT-S UB
Performance on varying data size. To better               (SS) on various number of augmented sentence
understand the effectiveness of S LOT-S UB, we            (N). Positive numbers mean that the model with
perform further analysis on different training data       SS yields gain.
size (see Figure 2). Overall, we observe that as we
   1                                                         2
     Italian, Spanish, and Thai are SVO languages while        For more details of the p-value of the statistical tests
Hindi and Turkish are SOV languages.                      please refer to Appendix B
Performance on different numbers of augmen-                     during model training. Recent work from Peng et
tation per utterance (N ). We examine the ef-                   al. (2020) make use of Transformer (Vaswani et
fect of a larger number of augmentations per utter-             al., 2017) based pre-trained NLG namely GPT-2
ance (N ) to the model performance, specifically                (Radford et al., 2019), and fine-tune it to slot fill-
for SF (see Figure 3). For FB-ES, similarly to the              ing dataset to produce synthetic utterances. We
results in Table 2, increasing N does not affect the            consider these deep learning based approaches as
performance. For the other datasets, increasing                 heavyweight as they often require several stages
N brings performance improvement. For ATIS-                     in the augmentation process namely generating
HI, SNIPS-IT, and FB-TH the trend is that, as                   augmentation candidates, ranking and filtering the
we increase N , performance goes up and plateau.                candidates before producing the final augmented
For ATIS-TR, changing N does not really affect                  data. Consequently, the computation time of these
the gain of the performance as the performance                  approaches is generally more expensive as sepa-
trend is quite steady across number of augmenta-                rate training is required to train the augmentation
tions. For most combinations of N in each dataset               and joint SF-IC models. Recent work from Lou-
(except FB-ES), the difference between the per-                 van and Magnini (2020) apply a set of lightweight
formance of model that using S LOT-S UB and the                 methods in which most of the augmentation meth-
model that does not use S LOT-S UB is significant               ods do not require model training. The augmen-
3.                                                              tation methods focus on varying the slot values
                                                                through substitution mechanisms and varying sen-
6    Related Work                                               tence structure through dependency tree manipu-
                                                                lation. While the methods are relatively simple
Data augmentation methods that has been pro-
                                                                it obtains competitive results with deep learning
posed in NLP aims to automatically produce ad-
                                                                based approaches on the standard English slot fill-
ditional training data through different kinds of
                                                                ing benchmark datasets namely ATIS (Hemphill
methods ranging from simple word substitution
                                                                et al., 1990), SNIPS (Coucke et al., 2018), and FB
(Wei and Zou, 2019) to more complex methods
                                                                (Schuster et al., 2019) datasets.
that aims to produce semantically preserving sen-
                                                                   Existing methods mostly evaluate their ap-
tence generation (Hou et al., 2018; Gao et al.,
                                                                proaches on English datasets, and little work has
2020). In the context of slot filling and intent clas-
                                                                been done on other languages. Our work focuses
sification, recent augmentation methods typically
                                                                on investigating the effect of data augmentation on
apply deep learning models to produce augmented
                                                                five non-English languages. We apply a subset of
utterances.
                                                                lightweight augmentation methods from Louvan
   Hou et al. (2018) proposes a two-stages meth-
                                                                and Magnini (2020) that do not require separate
ods to produce the delexicalized utterances gen-
                                                                model training to produce augmentation data.
eration and slot values realization. Their method
is based on a sequence to sequence based model                  7   Conclusion
(Sutskever et al., 2014) to produce a paraphrase
of an utterance with its slot values placeholder                We evaluate the effectiveness of data augmenta-
(delexicalized) for a given intent. For the slot                tion for slot filling and intent classification tasks
values lexicalization, they use the slot values in              in five typologically diverse languages. Our re-
the training data that occur in similar contexts.               sults show that by applying simple augmentation,
Zhao et al. (2019) trains a sequence to sequence                namely slot values substitutions and dependency
model with training instances that consist of a pair            tree manipulations, we can obtain substantial im-
of atomic templates of dialogue acts and its sen-               provement in most cases when only small amount
tence realization. Yoo et al. (2019) proposes a                 of training data is available. We also show that a
solution by extending Variational Auto Encoder                  large pre-trained multilingual BERT benefits from
(VAE) (Kingma and Welling, 2014) into a Con-                    data augmentation.
ditional VAE (CVAE) to generate synthetic utter-
ances. The CVAE controls the utterance genera-                  Acknowledgments
tion by conditioning on the intent and slot labels              We thank Valentina Bellomaria for providing the
   3
     For more details of the p-value of the statistical tests   Italian SNIPS dataset. We thank Clara Vania for
please refer to Appendix B                                      the feedback on the early draft of the paper.
References                                                  Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu.
                                                              2018. Sequence-to-sequence data augmentation for
Valentina Bellomaria, Giuseppe Castellucci, Andrea            dialogue language understanding. In Proceedings of
  Favalli, and Raniero Romagnoli. 2019. Almawave-             the 27th International Conference on Computational
  slu: A new dataset for SLU in italian. In Raffaella         Linguistics, pages 1234–1245, Santa Fe, New Mex-
  Bernardi, Roberto Navigli, and Giovanni Semeraro,           ico, USA, August. Association for Computational
  editors, Proceedings of the Sixth Italian Conference        Linguistics.
  on Computational Linguistics, Bari, Italy, Novem-
  ber 13-15, 2019, volume 2481 of CEUR Workshop             Diederik P. Kingma and Max Welling. 2014. Auto-
  Proceedings. CEUR-WS.org.                                   encoding variational bayes. In Yoshua Bengio and
Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert                 Yann LeCun, editors, 2nd International Conference
  for joint intent classification and slot filling. arXiv     on Learning Representations, ICLR 2014, Banff, AB,
  preprint arXiv:1902.10909.                                  Canada, April 14-16, 2014, Conference Track Pro-
                                                              ceedings.
Alice Coucke, Alaa Saade, Adrien Ball, Théodore
  Bluche, Alexandre Caulier, David Leroy, Clément          Varun Kumar, Ashutosh Choudhary, and Eunah Cho.
  Doumouro, Thibault Gisselbrecht, Francesco Calt-            2020. Data augmentation using pre-trained trans-
  agirone, Thibaut Lavril, Maël Primet, and Joseph           former models. arXiv preprint arXiv:2003.02245.
  Dureau. 2018. Snips voice platform: an embedded
  spoken language understanding system for private-         Samuel Louvan and Bernardo Magnini. 2020. Sim-
  by-design voice interfaces. ArXiv, abs/1805.10190.          ple is better! lightweight data augmentation for low
                                                              resource slot filling and intent classification. arXiv
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                 preprint https://arxiv.org/abs/2009.03695. PACLIC
   Kristina Toutanova. 2019. BERT: Pre-training of            2020 - The 34th Pacific Asia Conference on Lan-
   deep bidirectional transformers for language under-        guage, Information and Computation.
   standing. In Proceedings of the 2019 Conference of
   the North American Chapter of the Association for        Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene An-
   Computational Linguistics: Human Language Tech-            tonsen, Maria Jesus Aranzabe, Masayuki Asahara,
   nologies, Volume 1 (Long and Short Papers), pages          Luma Ateyah, Mohammed Attia, Aitziber Atutxa,
   4171–4186, Minneapolis, Minnesota, June. Associ-           Liesbeth Augustinus, et al. 2017. Universal depen-
   ation for Computational Linguistics.                       dencies 2.1.
Marzieh Fadaee, Arianna Bisazza, and Christof Monz.         Baolin Peng, Chenguang Zhu, Michael Zeng, and Jian-
 2017. Data augmentation for low-resource neural              feng Gao. 2020. Data augmentation for spoken lan-
 machine translation. In Regina Barzilay and Min-             guage understanding via pretrained models. CoRR,
 Yen Kan, editors, Proceedings of the 55th Annual             abs/2004.13952.
 Meeting of the Association for Computational Lin-
 guistics, ACL 2017, Vancouver, Canada, July 30 -           Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
 August 4, Volume 2: Short Papers, pages 567–573.             and Christopher D. Manning. 2020. Stanza: A
 Association for Computational Linguistics.                   python natural language processing toolkit for many
Silin Gao, Yichi Zhang, Zhijian Ou, and Zhou Yu.              human languages. In Proceedings of the 58th An-
   2020. Paraphrase augmented task-oriented dialog            nual Meeting of the Association for Computational
   generation. In Dan Jurafsky, Joyce Chai, Natalie           Linguistics: System Demonstrations, pages 101–
   Schluter, and Joel R. Tetreault, editors, Proceedings      108, Online, July. Association for Computational
   of the 58th Annual Meeting of the Association for          Linguistics.
   Computational Linguistics, ACL 2020, Online, July
   5-10, 2020, pages 639–649. Association for Compu-        Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
   tational Linguistics.                                      Dario Amodei, and Ilya Sutskever. 2019. Language
                                                              models are unsupervised multitask learners.
Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li
  Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-             Gözde Gül Şahin and Mark Steedman. 2018. Data
  Nung Chen. 2018. Slot-gated modeling for joint              augmentation via dependency tree morphing for
  slot filling and intent prediction. In Proceedings of       low-resource languages. In Proceedings of the 2018
  the 2018 Conference of the North American Chap-             Conference on Empirical Methods in Natural Lan-
  ter of the Association for Computational Linguistics:       guage Processing, pages 5004–5009, Brussels, Bel-
  Human Language Technologies, Volume 2 (Short                gium, October-November. Association for Compu-
  Papers), pages 753–757.                                     tational Linguistics.

Charles T. Hemphill, John J. Godfrey, and George R.         Sebastian Schuster, Sonal Gupta, Rushin Shah, and
  Doddington. 1990. The ATIS spoken language                  Mike Lewis. 2019. Cross-lingual transfer learning
  systems pilot corpus. In Speech and Natural Lan-            for multilingual task oriented dialog. In Proceed-
  guage: Proceedings of a Workshop Held at Hidden             ings of the 2019 Conference of the North American
  Valley, Pennsylvania, USA, June 24-27, 1990. Mor-           Chapter of the Association for Computational Lin-
  gan Kaufmann.                                               guistics: Human Language Technologies, Volume 1
  (Long and Short Papers), pages 3795–3805, Min-               2019, The Ninth AAAI Symposium on Educational
  neapolis, Minnesota, June. Association for Compu-            Advances in Artificial Intelligence, EAAI 2019, Hon-
  tational Linguistics.                                        olulu, Hawaii, USA, January 27 - February 1, 2019,
                                                               pages 7402–7409. AAAI Press.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
   Sequence to sequence learning with neural net-            Zijian Zhao, Su Zhu, and Kai Yu. 2019. Data augmen-
   works. In Zoubin Ghahramani, Max Welling,                    tation with atomic templates for spoken language
   Corinna Cortes, Neil D. Lawrence, and Kilian Q.              understanding. In Kentaro Inui, Jing Jiang, Vin-
   Weinberger, editors, Advances in Neural Informa-             cent Ng, and Xiaojun Wan, editors, Proceedings of
   tion Processing Systems 27: Annual Conference                the 2019 Conference on Empirical Methods in Nat-
   on Neural Information Processing Systems 2014,               ural Language Processing and the 9th International
   December 8-13 2014, Montreal, Quebec, Canada,                Joint Conference on Natural Language Processing,
   pages 3104–3112.                                             EMNLP-IJCNLP 2019, Hong Kong, China, Novem-
                                                                ber 3-7, 2019, pages 3635–3641. Association for
Gokhan Tur and Renato De Mori. 2011. Spoken lan-                Computational Linguistics.
  guage understanding: Systems for extracting seman-
  tic information from speech. John Wiley & Sons.

Shyam Upadhyay, Manaal Faruqui, Gokhan Tür,
  Hakkani-Tür Dilek, and Larry Heck. 2018. (al-
  most) zero-shot cross-lingual spoken language un-
  derstanding. In 2018 IEEE International Confer-
  ence on Acoustics, Speech and Signal Processing
  (ICASSP), pages 6034–6038. IEEE.

Clara Vania, Yova Kementchedjhieva, Anders Søgaard,
  and Adam Lopez. 2019. A systematic comparison
  of methods for low-resource dependency parsing
  on genuinely low-resource languages. In Kentaro
  Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, edi-
  tors, Proceedings of the 2019 Conference on Empir-
  ical Methods in Natural Language Processing and
  the 9th International Joint Conference on Natural
  Language Processing, EMNLP-IJCNLP 2019, Hong
  Kong, China, November 3-7, 2019, pages 1105–
  1116. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Kaiser, and Illia Polosukhin. 2017. Attention
  is all you need. In Isabelle Guyon, Ulrike von
  Luxburg, Samy Bengio, Hanna M. Wallach, Rob
  Fergus, S. V. N. Vishwanathan, and Roman Garnett,
  editors, Advances in Neural Information Processing
  Systems 30: Annual Conference on Neural Informa-
  tion Processing Systems 2017, 4-9 December 2017,
  Long Beach, CA, USA, pages 5998–6008.

Jason W. Wei and Kai Zou. 2019. EDA: easy data aug-
   mentation techniques for boosting performance on
   text classification tasks. In Kentaro Inui, Jing Jiang,
   Vincent Ng, and Xiaojun Wan, editors, Proceedings
   of the 2019 Conference on Empirical Methods in
   Natural Language Processing and the 9th Interna-
   tional Joint Conference on Natural Language Pro-
   cessing, EMNLP-IJCNLP 2019, Hong Kong, China,
   November 3-7, 2019, pages 6381–6387. Association
   for Computational Linguistics.

Kang Min Yoo, Youhyun Shin, and Sang-goo Lee.
  2019. Data augmentation for spoken language un-
  derstanding via joint variational generation. In The
  Thirty-Third AAAI Conference on Artificial Intelli-
  gence, AAAI 2019, The Thirty-First Innovative Ap-
  plications of Artificial Intelligence Conference, IAAI
Appendix A. Hyperparameters

        Hyperparameter        Value
        Learning rate         10−5
        Dropout               0.1
        Mini-batch size       16
        Optimizer             BertAdam
        Number of epoch       30
        Early stopping        10                             Dataset   Training Size (%)      p-value
                                                            ATIS-HI           5            0.04311444678
        N                     Tuned on {2, 5, 10}
        Max rotation          3                                               10           0.005062032126
        Max crop              3
                                                                              20           0.04311444678

Table 3: List of hyperparameters used for the                                 40           0.04311444678
BERT model and data augmentation methods                                      80            0.1380107376
                                                                             100            0.2733216783
Appendix B. Statistical Significance                        ATIS-TR           5             0.224915884
                                                                              10           0.005062032126
                                                                              20            0.7150006547
         Dataset       Nb Aug         p-value
        ATIS-TR          2       0.005062032126                               40            0.1797124949

                         5       0.01251531869                                80            0.1797124949

                         10      0.006910429808                              100            0.1797124949

                         20       0.5001842571              SNIPS-IT          5            0.04311444678

                         25      0.07961580146                                10           0.005062032126

        ATIS-HI          2        0.1097446387                                20           0.04311444678

                         5       0.005062032126                               40           0.04311444678

                         10      0.005062032126                               80           0.04311444678

                         20      0.04311444678                               100           0.04311444678

                         25      0.04311444678               FB-ES            5            0.04311444678

        SNIPS-IT         2       0.005062032126                               10           0.02831405495

                         5       0.005062032126                               20            0.1797124949

                         10      0.005062032126                               40            0.1755543028

                         20      0.04311444678                                80            0.1380107376

                         25      0.04311444678                               100            0.1797124949

         FB-ES           2        0.0663160313               FB-TH            5            0.04311444678

                         5       0.02831405495                                10           0.005062032126

                         10      0.09260069782                                20            0.1797124949

                         20       0.3452310718                                40            0.1797124949

                         25      0.07961580146                                80            0.1797124949

         FB-TH           2       0.03665792867                               100             0.10880943

                         5       0.005062032126
                                                        Table 4: The p-values of statistical tests on the ex-
                         10      0.005062032126
                                                        periments on Figure 2.
                         20      0.04311444678
                         25      0.04311444678

Table 5: The p-values of statistical tests on the ex-
periments on Figure 3