=Paper= {{Paper |id=Vol-2769/27 |storemode=property |title=Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems |pdfUrl=https://ceur-ws.org/Vol-2769/paper_27.pdf |volume=Vol-2769 |authors=Samuel Louvan,Bernardo Magnini |dblpUrl=https://dblp.org/rec/conf/clic-it/LouvanM20 }} ==Simple Data Augmentation for Multilingual NLU in Task Oriented Dialogue Systems== https://ceur-ws.org/Vol-2769/paper_27.pdf

Simple Data Augmentation for Multilingual NLU in
Task Oriented Dialogue Systems

Samuel Louvan Bernardo Magnini
University of Trento Fondazione Bruno Kessler
Fondazione Bruno Kessler magnini@fbk.eu
slouvan@fbk.eu

Abstract (Şahin and Steedman, 2018; Vania et al., 2019),
and text classification (Wei and Zou, 2019; Ku-
Data augmentation has shown potential in mar et al., 2020). As for slot filling (SF) and
alleviating data scarcity for Natural Lan- intent classification (IC), a number of DA meth-
guage Understanding (e.g. slot filling and ods have been proposed to generate synthetic ut-
intent classification) in task-oriented dia- terances using sequence to sequence models (Hou
logue systems. As prior work has been et al., 2018; Zhao et al., 2019), Conditional Vari-
mostly experimented on English datasets, ational Auto Encoder (Yoo et al., 2019), or pre-
we focus on five different languages, and trained NLG models (Peng et al., 2020). To date,
consider a setting where limited data are most of the DA methods are evaluated on English
available. We investigate the effectiveness and it is not clear whether the same finding apply
of non-gradient based augmentation meth- to other languages.
ods, involving simple text span substitu- In this paper, we study the effectiveness of
tions and syntactic manipulations. Our DA on several non-English datasets for NLU in
experiments show that (i) augmentation task-oriented dialogue systems. We experiment
is effective in all cases, particularly for with existing lightweight, non-gradient based, DA
slot filling; and (ii) it is beneficial for a methods from Louvan and Magnini (2020) that
joint intent-slot model based on multilin- produces varying slot values through substitution
gual BERT, both for limited data settings and sentence structure manipulation by leveraging
and when full training data is used. syntactic information from a dependency parser.
We evaluate the DA methods on NLU datasets
1 Introduction from five languages: Italian, Hindi, Turkish, Span-
Natural Language Understanding (NLU) in task- ish, and Thai. The contributions of our paper are
oriented dialogue systems is responsible for pars- as follows:
ing user utterances to extract the intent of the 1. We assess the applicability of DA methods for
user and the arguments of the intent (i.e. slots) NLU in task-oriented dialogue systems in five
into a semantic representation, typically a seman- languages.
tic frame (Tur and De Mori, 2011). For example, 2. We demonstrate that simple DA can improve
the utterance “Play Jeff Pilson on Youtube” has the performance on all languages despite different
intent P LAY M USIC and “Youtube” as value for the characteristic of the languages.
slot S ERVICE. As more skills are added to the dia- 3. We show that a large pre-trained multilingual
logue system, the NLU model frequently needs to BERT (M-B ERT) (Devlin et al., 2019) can still
be updated to scale to new domains and languages, benefit from DA, in particular for slot filling.
a situation which typically becomes problematic
2 Slot Filling and Intent Classification
when labeled data are limited (data scarcity).
One way to combat data scarcity is through The NLU component of a task-oriented dialogue
data augmentation (DA) techniques performing system is responsible in a parsing user utterance
label preserving operations to produce auxiliary into a semantic representation, such as semantic
training data. Recently, DA has shown potential
Copyright c 2020 for this paper by its authors. Use per-
in tasks such as machine translation (Fadaee et mitted under Creative Commons License Attribution 4.0 In-
al., 2017), constituency and dependency parsing ternational (CC BY 4.0)
Figure 1: Augmentation operations performed on an utterance, “Quali film animati stanno proiettando al
cinema piu vicino” (“Which animated films are showing at the nearest cinema”). The utterance is taken
from the Italian SNIPS dataset.

frame. The semantic frame conveys information, methods from Louvan and Magnini (2020), that
namely the user intent and the corresponding argu- has shown promising results on English datasets.
ments of the intent. Extracting such information We describe the augmentation operations in the
involves slot filling (SF) and intent classification following sections.
(IC) tasks.
Given an input utterance of n tokens, x = 3.1 Slot Substitution (S LOT-S UB)
(x1 , x2 , .., xn ), the system needs to assign a partic- S LOT-S UB (Figure 1 left) performs augmentation
ular intent y intent for the whole utterance x and the by substituting a particular text span (slot-value
corresponding slots that are mentioned in the utter- pair) in an utterance with a different text span that
ance y slot = (y1slot , y2slot , .., ynslot ). In practice, IC is semantically consistent i.e., the slot label is the
is typically modeled as text classification and SF same. For example, in the utterance “Quali film
as a sequence tagging problem. As an example, animati stanno proiettando al cinema più vicino”,
for the utterance “Play Jeff Pilson on Youtube”, one of the spans that can be substituted is the
y intent is P LAY M USIC, as the intent of the user is slot value pair (più vicino, SPATIAL RELATION).
to ask the system to play a song from a musician Then, we collect other spans in D in which the
and y slot = ( O, B-ARTIST, I-ARTIST, slot values are different, but the slot label is the
O, B-SERVICE ) in which the artist is “Jeff Pil- same. For instance, we found the substitute can-
son” and the service is “Youtube””. Slot labels didates SP 0 = {(“distanza a piedi”, SPATIAL RE -
are in BIO format: B indicates the start of a slot LATION ), (“lontano”, SPATIAL RELATION ), (“nel
span, I the inside of a span while O denotes that quartiere”, SPATIAL RELATION), . . . }, and then
the word does not belong to any slot. Recent ap- we sample one span to replace the original span in
proaches for SF and IC are based on neural net- the utterance.
work methods that models SF and IC jointly (Goo
et al., 2018; Chen et al., 2019) by sharing model 3.2 C ROP and ROTATE
parameter among both tasks.
In order to produce sentence variations, we apply
3 Data Augmentation (DA) Methods the crop and rotate operations proposed in Şahin
and Steedman (2018), which manipulate the sen-
DA aims to perform semantically preserving trans- tence structure through its dependency parse tree.
formations on the training data D to produce aux- The goal of C ROP (Figure 1 middle) is to simplify
iliary data D0 . The union of D and D0 is then the sentence so that it focuses on a particular frag-
used to train a particular NLU model. For each ment (e.g. subject/object) by removing other frag-
utterance in D, we produce N augmented utter- ments in the sentence. C ROP uses the dependency
ances by applying a specific augmentation opera- tree to identify the fragment and then remove it
tion. We adopt a subset of existing augmentation and its children from the dependency tree.
#Label #Utterances (D) #Augmented Utterances (D0 )
Dataset Language #slot #intent #train #dev #test #S LOT-S UB #C ROP #ROTATE
SNIPS-IT Italian 39 7 574 700 698 5,404 1,431 1,889
ATIS-HI Hindi 73 17 176 440 893 1,286 460 472
ATIS-TR Turkish 70 17 99 248 715 144 161 194
FB-ES Spanish 11 12 361 1,983 3,043 1,455 769 1,028
FB-TH Thai 8 10 215 1,235 1,692 781 - -

Table 1: Statistics on the datasets. #train indicates our limited training data setup (10% of full training
data). D0 is produced by tuning the number of augmentations per utterance (N ) on the dev set.

SNIPS-IT ATIS-HI ATIS-TR FB-ES FB-TH
Model DA
Slot Intent Slot Intent Slot Intent Slot Intent Slot Intent
M-B ERT None 78.25 94.99 69.57 86.57 64.36 78.98 84.13 97.68 56.06 89.80
† † † †
S LOT-S UB 81.97 94.93 72.44 87.29 66.60 79.85 84.27 97.72 59.68 91.42†
C ROP 80.12† 94.60 70.04 86.92 65.11 79.48 83.85 98.08† - -
ROTATE 79.24† 95.37 70.69 87.60† 65.20 80.06 83.28 98.20† - -
C OMBINE 81.27† 95.00 72.13† 86.93 66.68† 81.12† 83.67 97.94 - -

Table 2: Performance comparison of the baseline and augmentation methods on the test set. F1 score is
used for slot filling and accuracy for intent classification. Scores are the average of 10 different runs. †
indicates statistically significant improvement over the baseline (p-value < 0.05 according to Wilcoxon
signed rank test).

The ROTATE (Figure 1 right) operation is per- multilingual BERT (M-B ERT), which is trained
formed by moving a particular fragment (includ- on 104 languages. During training, M-B ERT is
ing subject/object) around the root of the tree, typ- fine tuned on the slot filling and intent classi-
ically the verb in the sentence. For each operation, fication tasks. Given a sentence representation
all possible combinations are generated, and one x = ([CLS] t1 t2 . . . tL ), we use the hidden state
of them is picked randomly as the augmented sen- h[CLS] to predict the intent, and hti to predict the
tence. Both C ROP and ROTATE rely on the univer- slot label. As for DA methods, in addition to the
sal dependency labels (Nivre et al., 2017) to iden- methods described in Section 2, we add one con-
tify relevant fragments, such as NSUBJ (nominal figuration C OMBINE, which combines the result
subject), DOBJ (direct object), OBJ (object), IOBJ of S LOT-S UB and ROTATE, as ROTATE obtains
(indirect object). better results than C ROP on the development set.
Settings. The model is trained with the
4 Experiments
BertAdam optimizer for 30 epochs with early
Our primary goal is to verify the effectiveness stopping. The learning rate is set to 10−5 and
of data augmentation on Italian, Hindi, Turkish, batch size is 16. All the hyperparameters are
Spanish and Thai NLU datasets with limited la- listed in Appendix A. For S LOT-S UB the number
beled data. To this end, we compare the per- of augmentation per sentence N is tuned on the
formance of a baseline NLU model trained on development set. To produce the dependency
the original training data (D) with a NLU model tree, we parse the sentence using Stanza (Qi
that incorporates the augmented data as additional et al., 2020). For both C ROP and ROTATE we
training instances (D + D0 ). To simulate the lim- follow the default hyperparameters from Şahin
ited labeled data situation we randomly sample and Steedman (2018). We did not experiment
10% of the training data for each dataset. with Thai for C ROP and ROTATE as Thai is not
supported by Stanza. The number of augmented
Baseline and Data Augmentation (DA) Meth- sentences (D0 ) for each method is listed in Table
ods. We use the state of the art BERT-based joint 1. For evaluation metric, we use the standard
intent slot filling model (Chen et al., 2019) as CoNLL script to compute F1 score for slot filling
the baseline model. We leverage the pre-trained and accuracy for intent classification.
Datasets. For the Italian language, we use the
data from Bellomaria et al. (2019), translated from
the English SNIPS dataset (Coucke et al., 2018).
SNIPS has been widely used for evaluating NLU
models and consists of utterances in multiple do-
mains. As for Hindi and Turkish, we use the ATIS
dataset from Upadhyay et al. (2018), derived from
Hemphill et al. (1990). ATIS is a well known
NLU dataset on flight domain. As for Spanish and
Thai we use the FB dataset from Schuster et al.
(2019) that contains utterances in alarm, weather,
and reminder domains. The overall statistics of the Figure 2: Improvement (∆F 1) obtained by S LOT-
datasets are shown in Table 1. S UB (SS) on different training data size. Positive
5 Results numbers mean that the model with SS yields gain.

The overall results reported in Table 2 show that
applying DA improves performance on slot filling increase the training size, the benefit of S LOT-S UB
and intent classification across all languages. In is decreasing for all datasets. For some datasets,
particular, for SF, the S LOT-S UB method yields namely ATIS-HI and FB-ES, S LOT-S UB can cause
the best result, while for IC, ROTATE obtains performance drop for larger data size, although it
better performance compared to C ROP in most is reasonably small (less than 1 F1 point). FB-TH
cases. These results are consistent with the finding consistently benefits from S LOT-S UB even when
from Louvan and Magnini (2020) on the English full training data is used. Until which training data
dataset, where S LOT-S UB improves SF and C ROP size the improvement is significant vary across
or ROTATE improve IC. In general, ROTATE is bet- datasets2 . For SNIPS-IT, improvement is clear for
ter than C ROP for most cases on IC, and we think all training data size and they are statistically sig-
this is because C ROP may change the intent of the nificant up until the training data size is 80%. For
original sentence. Intents typically depend on the ATIS-HI improvements are significant until data
occurrence of specific slots, so when the cropped size of 40%. As for FB datasets, improvements
part is a slot-value, it may change the sentence’s are significant only until the training data size is
overall semantics. 10%. Overall, we can see that S LOT-S UB is ef-
We can see that languages with different typo- fective for cases where data is scarce (5%, 10%),
logical features (e.g. subject/verb/object order- while it is still relatively robust for larger data size
ing)1 benefit from ROTATE operation for IC. This on all datasets.
result suggests that augmentation can produce use-
ful noise (regularization) for the model to allevi-
ate overfitting when labeled data is limited. When
we use C OMBINE, it still helps the performance
of both SF and IC, although the improvements are
not as high as when only one of the augmentation
method is applied. The only language that gets
the benefits the most from C OMBINE is Turkish.
We hypothesize that as Turkish has a more flexi-
ble word order than the other languages it benefits
the most when ROTATE is performed. Figure 3: Gain (∆F 1) obtained by S LOT-S UB
Performance on varying data size. To better (SS) on various number of augmented sentence
understand the effectiveness of S LOT-S UB, we (N). Positive numbers mean that the model with
perform further analysis on different training data SS yields gain.
size (see Figure 2). Overall, we observe that as we
1 2
Italian, Spanish, and Thai are SVO languages while For more details of the p-value of the statistical tests
Hindi and Turkish are SOV languages. please refer to Appendix B
Performance on different numbers of augmen- during model training. Recent work from Peng et
tation per utterance (N ). We examine the ef- al. (2020) make use of Transformer (Vaswani et
fect of a larger number of augmentations per utter- al., 2017) based pre-trained NLG namely GPT-2
ance (N ) to the model performance, specifically (Radford et al., 2019), and fine-tune it to slot fill-
for SF (see Figure 3). For FB-ES, similarly to the ing dataset to produce synthetic utterances. We
results in Table 2, increasing N does not affect the consider these deep learning based approaches as
performance. For the other datasets, increasing heavyweight as they often require several stages
N brings performance improvement. For ATIS- in the augmentation process namely generating
HI, SNIPS-IT, and FB-TH the trend is that, as augmentation candidates, ranking and filtering the
we increase N , performance goes up and plateau. candidates before producing the final augmented
For ATIS-TR, changing N does not really affect data. Consequently, the computation time of these
the gain of the performance as the performance approaches is generally more expensive as sepa-
trend is quite steady across number of augmenta- rate training is required to train the augmentation
tions. For most combinations of N in each dataset and joint SF-IC models. Recent work from Lou-
(except FB-ES), the difference between the per- van and Magnini (2020) apply a set of lightweight
formance of model that using S LOT-S UB and the methods in which most of the augmentation meth-
model that does not use S LOT-S UB is significant ods do not require model training. The augmen-
3. tation methods focus on varying the slot values
through substitution mechanisms and varying sen-
6 Related Work tence structure through dependency tree manipu-
lation. While the methods are relatively simple
Data augmentation methods that has been pro-
it obtains competitive results with deep learning
posed in NLP aims to automatically produce ad-
based approaches on the standard English slot fill-
ditional training data through different kinds of
ing benchmark datasets namely ATIS (Hemphill
methods ranging from simple word substitution
et al., 1990), SNIPS (Coucke et al., 2018), and FB
(Wei and Zou, 2019) to more complex methods
(Schuster et al., 2019) datasets.
that aims to produce semantically preserving sen-
Existing methods mostly evaluate their ap-
tence generation (Hou et al., 2018; Gao et al.,
proaches on English datasets, and little work has
2020). In the context of slot filling and intent clas-
been done on other languages. Our work focuses
sification, recent augmentation methods typically
on investigating the effect of data augmentation on
apply deep learning models to produce augmented
five non-English languages. We apply a subset of
utterances.
lightweight augmentation methods from Louvan
Hou et al. (2018) proposes a two-stages meth-
and Magnini (2020) that do not require separate
ods to produce the delexicalized utterances gen-
model training to produce augmentation data.
eration and slot values realization. Their method
is based on a sequence to sequence based model 7 Conclusion
(Sutskever et al., 2014) to produce a paraphrase
of an utterance with its slot values placeholder We evaluate the effectiveness of data augmenta-
(delexicalized) for a given intent. For the slot tion for slot filling and intent classification tasks
values lexicalization, they use the slot values in in five typologically diverse languages. Our re-
the training data that occur in similar contexts. sults show that by applying simple augmentation,
Zhao et al. (2019) trains a sequence to sequence namely slot values substitutions and dependency
model with training instances that consist of a pair tree manipulations, we can obtain substantial im-
of atomic templates of dialogue acts and its sen- provement in most cases when only small amount
tence realization. Yoo et al. (2019) proposes a of training data is available. We also show that a
solution by extending Variational Auto Encoder large pre-trained multilingual BERT benefits from
(VAE) (Kingma and Welling, 2014) into a Con- data augmentation.
ditional VAE (CVAE) to generate synthetic utter-
ances. The CVAE controls the utterance genera- Acknowledgments
tion by conditioning on the intent and slot labels We thank Valentina Bellomaria for providing the
3
For more details of the p-value of the statistical tests Italian SNIPS dataset. We thank Clara Vania for
please refer to Appendix B the feedback on the early draft of the paper.
References Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu.
2018. Sequence-to-sequence data augmentation for
Valentina Bellomaria, Giuseppe Castellucci, Andrea dialogue language understanding. In Proceedings of
Favalli, and Raniero Romagnoli. 2019. Almawave- the 27th International Conference on Computational
slu: A new dataset for SLU in italian. In Raffaella Linguistics, pages 1234–1245, Santa Fe, New Mex-
Bernardi, Roberto Navigli, and Giovanni Semeraro, ico, USA, August. Association for Computational
editors, Proceedings of the Sixth Italian Conference Linguistics.
on Computational Linguistics, Bari, Italy, Novem-
ber 13-15, 2019, volume 2481 of CEUR Workshop Diederik P. Kingma and Max Welling. 2014. Auto-
Proceedings. CEUR-WS.org. encoding variational bayes. In Yoshua Bengio and
Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert Yann LeCun, editors, 2nd International Conference
for joint intent classification and slot filling. arXiv on Learning Representations, ICLR 2014, Banff, AB,
preprint arXiv:1902.10909. Canada, April 14-16, 2014, Conference Track Pro-
ceedings.
Alice Coucke, Alaa Saade, Adrien Ball, Théodore
Bluche, Alexandre Caulier, David Leroy, Clément Varun Kumar, Ashutosh Choudhary, and Eunah Cho.
Doumouro, Thibault Gisselbrecht, Francesco Calt- 2020. Data augmentation using pre-trained trans-
agirone, Thibaut Lavril, Maël Primet, and Joseph former models. arXiv preprint arXiv:2003.02245.
Dureau. 2018. Snips voice platform: an embedded
spoken language understanding system for private- Samuel Louvan and Bernardo Magnini. 2020. Sim-
by-design voice interfaces. ArXiv, abs/1805.10190. ple is better! lightweight data augmentation for low
resource slot filling and intent classification. arXiv
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and preprint https://arxiv.org/abs/2009.03695. PACLIC
Kristina Toutanova. 2019. BERT: Pre-training of 2020 - The 34th Pacific Asia Conference on Lan-
deep bidirectional transformers for language under- guage, Information and Computation.
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene An-
Computational Linguistics: Human Language Tech- tonsen, Maria Jesus Aranzabe, Masayuki Asahara,
nologies, Volume 1 (Long and Short Papers), pages Luma Ateyah, Mohammed Attia, Aitziber Atutxa,
4171–4186, Minneapolis, Minnesota, June. Associ- Liesbeth Augustinus, et al. 2017. Universal depen-
ation for Computational Linguistics. dencies 2.1.
Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Baolin Peng, Chenguang Zhu, Michael Zeng, and Jian-
2017. Data augmentation for low-resource neural feng Gao. 2020. Data augmentation for spoken lan-
machine translation. In Regina Barzilay and Min- guage understanding via pretrained models. CoRR,
Yen Kan, editors, Proceedings of the 55th Annual abs/2004.13952.
Meeting of the Association for Computational Lin-
guistics, ACL 2017, Vancouver, Canada, July 30 - Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,
August 4, Volume 2: Short Papers, pages 567–573. and Christopher D. Manning. 2020. Stanza: A
Association for Computational Linguistics. python natural language processing toolkit for many
Silin Gao, Yichi Zhang, Zhijian Ou, and Zhou Yu. human languages. In Proceedings of the 58th An-
2020. Paraphrase augmented task-oriented dialog nual Meeting of the Association for Computational
generation. In Dan Jurafsky, Joyce Chai, Natalie Linguistics: System Demonstrations, pages 101–
Schluter, and Joel R. Tetreault, editors, Proceedings 108, Online, July. Association for Computational
of the 58th Annual Meeting of the Association for Linguistics.
Computational Linguistics, ACL 2020, Online, July
5-10, 2020, pages 639–649. Association for Compu- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
tational Linguistics. Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners.
Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li
Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun- Gözde Gül Şahin and Mark Steedman. 2018. Data
Nung Chen. 2018. Slot-gated modeling for joint augmentation via dependency tree morphing for
slot filling and intent prediction. In Proceedings of low-resource languages. In Proceedings of the 2018
the 2018 Conference of the North American Chap- Conference on Empirical Methods in Natural Lan-
ter of the Association for Computational Linguistics: guage Processing, pages 5004–5009, Brussels, Bel-
Human Language Technologies, Volume 2 (Short gium, October-November. Association for Compu-
Papers), pages 753–757. tational Linguistics.

Charles T. Hemphill, John J. Godfrey, and George R. Sebastian Schuster, Sonal Gupta, Rushin Shah, and
Doddington. 1990. The ATIS spoken language Mike Lewis. 2019. Cross-lingual transfer learning
systems pilot corpus. In Speech and Natural Lan- for multilingual task oriented dialog. In Proceed-
guage: Proceedings of a Workshop Held at Hidden ings of the 2019 Conference of the North American
Valley, Pennsylvania, USA, June 24-27, 1990. Mor- Chapter of the Association for Computational Lin-
gan Kaufmann. guistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 3795–3805, Min- 2019, The Ninth AAAI Symposium on Educational
neapolis, Minnesota, June. Association for Compu- Advances in Artificial Intelligence, EAAI 2019, Hon-
tational Linguistics. olulu, Hawaii, USA, January 27 - February 1, 2019,
pages 7402–7409. AAAI Press.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net- Zijian Zhao, Su Zhu, and Kai Yu. 2019. Data augmen-
works. In Zoubin Ghahramani, Max Welling, tation with atomic templates for spoken language
Corinna Cortes, Neil D. Lawrence, and Kilian Q. understanding. In Kentaro Inui, Jing Jiang, Vin-
Weinberger, editors, Advances in Neural Informa- cent Ng, and Xiaojun Wan, editors, Proceedings of
tion Processing Systems 27: Annual Conference the 2019 Conference on Empirical Methods in Nat-
on Neural Information Processing Systems 2014, ural Language Processing and the 9th International
December 8-13 2014, Montreal, Quebec, Canada, Joint Conference on Natural Language Processing,
pages 3104–3112. EMNLP-IJCNLP 2019, Hong Kong, China, Novem-
ber 3-7, 2019, pages 3635–3641. Association for
Gokhan Tur and Renato De Mori. 2011. Spoken lan- Computational Linguistics.
guage understanding: Systems for extracting seman-
tic information from speech. John Wiley & Sons.

Shyam Upadhyay, Manaal Faruqui, Gokhan Tür,
Hakkani-Tür Dilek, and Larry Heck. 2018. (al-
most) zero-shot cross-lingual spoken language un-
derstanding. In 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), pages 6034–6038. IEEE.

Clara Vania, Yova Kementchedjhieva, Anders Søgaard,
and Adam Lopez. 2019. A systematic comparison
of methods for low-resource dependency parsing
on genuinely low-resource languages. In Kentaro
Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, edi-
tors, Proceedings of the 2019 Conference on Empir-
ical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong
Kong, China, November 3-7, 2019, pages 1105–
1116. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention
is all you need. In Isabelle Guyon, Ulrike von
Luxburg, Samy Bengio, Hanna M. Wallach, Rob
Fergus, S. V. N. Vishwanathan, and Roman Garnett,
editors, Advances in Neural Information Processing
Systems 30: Annual Conference on Neural Informa-
tion Processing Systems 2017, 4-9 December 2017,
Long Beach, CA, USA, pages 5998–6008.

Jason W. Wei and Kai Zou. 2019. EDA: easy data aug-
mentation techniques for boosting performance on
text classification tasks. In Kentaro Inui, Jing Jiang,
Vincent Ng, and Xiaojun Wan, editors, Proceedings
of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Pro-
cessing, EMNLP-IJCNLP 2019, Hong Kong, China,
November 3-7, 2019, pages 6381–6387. Association
for Computational Linguistics.

Kang Min Yoo, Youhyun Shin, and Sang-goo Lee.
2019. Data augmentation for spoken language un-
derstanding via joint variational generation. In The
Thirty-Third AAAI Conference on Artificial Intelli-
gence, AAAI 2019, The Thirty-First Innovative Ap-
plications of Artificial Intelligence Conference, IAAI
Appendix A. Hyperparameters

Hyperparameter Value
Learning rate 10−5
Dropout 0.1
Mini-batch size 16
Optimizer BertAdam
Number of epoch 30
Early stopping 10 Dataset Training Size (%) p-value
ATIS-HI 5 0.04311444678
N Tuned on {2, 5, 10}
Max rotation 3 10 0.005062032126
Max crop 3
20 0.04311444678

Table 3: List of hyperparameters used for the 40 0.04311444678
BERT model and data augmentation methods 80 0.1380107376
100 0.2733216783
Appendix B. Statistical Significance ATIS-TR 5 0.224915884
10 0.005062032126
20 0.7150006547
Dataset Nb Aug p-value
ATIS-TR 2 0.005062032126 40 0.1797124949

5 0.01251531869 80 0.1797124949

10 0.006910429808 100 0.1797124949

20 0.5001842571 SNIPS-IT 5 0.04311444678

25 0.07961580146 10 0.005062032126

ATIS-HI 2 0.1097446387 20 0.04311444678

5 0.005062032126 40 0.04311444678

10 0.005062032126 80 0.04311444678

20 0.04311444678 100 0.04311444678

25 0.04311444678 FB-ES 5 0.04311444678

SNIPS-IT 2 0.005062032126 10 0.02831405495

5 0.005062032126 20 0.1797124949

10 0.005062032126 40 0.1755543028

20 0.04311444678 80 0.1380107376

25 0.04311444678 100 0.1797124949

FB-ES 2 0.0663160313 FB-TH 5 0.04311444678

5 0.02831405495 10 0.005062032126

10 0.09260069782 20 0.1797124949

20 0.3452310718 40 0.1797124949

25 0.07961580146 80 0.1797124949

FB-TH 2 0.03665792867 100 0.10880943

5 0.005062032126
Table 4: The p-values of statistical tests on the ex-
10 0.005062032126
periments on Figure 2.
20 0.04311444678
25 0.04311444678

Table 5: The p-values of statistical tests on the ex-
periments on Figure 3