Almawave-SLU: a New Dataset for SLU in Italian

    Valentina Bellomaria, Giuseppe Castellucci, Andrea Favalli and Raniero Romagnoli

                                         Language Technology Lab
                                                Almawave srl
                               [first name initial].[last name]@almawave.it


                      Abstract

    The widespread use of conversational
    and question answering systems made it
    necessary improve the performances of                   Figure 1: An example of Slot Filling in IOB for-
    speaker intent detection and understand-                mat for a sentence with intent PlayMusic.
    ing of related semantic slots, i.e., Spo-
    ken Language Understanding (SLU). Of-                   for SLU. Their efficacy strongly depends on the
    ten, these tasks are approached with su-                availability of labeled data. There are various ap-
    pervised learning methods, which needs                  proaches to the production of labeled data, de-
    considerable labeled datasets. This paper1              pending on the intricacy of the problem, on the
    presents the first Italian dataset for SLU in           characteristics of the data, and on the available re-
    voice assistants scenario. It is the product            sources (e.g., annotators, time and budget). When
    of a semi-automatic procedure and is used               the reuse of existing public data is not feasible,
    as a benchmark of various open source and               manual labeling should be accomplished, eventu-
    commercial systems.                                     ally by automating part of the labeling process.
                                                               In this work, we present the first public dataset
1   Introduction                                            for the Italian language for SLU. It is generated by
Conversational interfaces, e.g., Google’s Home or           a semi-automatic procedure from an existing En-
Amazon’s Alexa, are becoming pervasive in daily             glish dataset annotated with intents and slots. We
life. As an important part of any conversation, lan-        have translated the sentences into Italian and re-
guage understanding aims at extracting the mean-            ported the annotations based on a token span algo-
ing a partner is trying to convey. Spoken Language          rithm. Then, the translation, spans and consistency
Understanding (SLU) plays a fundamental role in             of the entities in Italian have been manually vali-
such a scenario. Generally speaking, in SLU a               dated. Finally, the dataset is used as benchmark
spoken utterance is first transcribed, then semantic        for NLU systems. In particular, we will compare
information is extracted. Language understanding,           a recent state-of-the-art (SOTA) approach (Castel-
i.e., extracting a semantic “frame” from a tran-            lucci et al., 2019) with Rasa (ras, 2019) taken
scribed user utterance, typically involves: i) Intent       from the open source world, IBM Watson Assis-
Detection (ID) and ii) Slot Filling (SF) (Tur et al.,       tant (wat, 2019), Google DialogFlow (dia, 2019)
2010). The former makes the classification of a             and, finally, Microsoft LUIS (msl, 2019), some
user utterance into an intent, i.e., the purpose of         commercial solutions in use.
the user. The latter finds what are the “arguments”            Following, in section 2 related works will be
of such intent. As an example, let us consider              discussed; In section 3 the dataset generation will
Figure 1, where the user asks for playing a song            be discussed. Section 4 we will present the ex-
(Intent=PlayMusic) (with or without you,                    periments. Finally, in section 5 we will draw the
Slot=song) of an artist (U2, Slot=artist).                  conclusions.
Usually, supervised learning methods are adopted
                                                            2   Related Work
    1
      Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0    SLU has been addressed in the Natural Language
International (CC BY 4.0).                                  Processing community mainly in the English lan-
guage. A well-known dataset used to demonstrate               intents and 39 slots have been annotated. Table 2
and benchmark various NLU algorithms is Air-                  shows the distribution of examples for each intent.
line Travel Information System (ATIS) (Hemphill
et al., 1990) dataset, which consists of spoken               3.1       Translation and Annotation
queries on flight related information. In (Braun              In a first phase, we translated each English exam-
et al., 2017) three dataset for Intent classification         ple in Italian by using the Translator Text API: part
task were presented. AskUbuntu Corpus and Web                 of the Microsoft Azure Cognitive Services. In or-
Application Corpus were extracted from Stack-                 der to create a more valuable resource in Italian,
Exchange and the third one, i.e., Chatbot Cor-                we also performed an automatic substitution of the
pus, was originated from a Telegram chatbot. The              names of movies, movie theatres, books, restau-
newer multi-intent dataset SNIPS (Coucke et al.,              rants and of the locations with some Italian coun-
2018) is the starting point for the work presented            terpart. First, we collected from the Web a set E
in this paper. An alternative approach to manual or           of about 20, 000 Italian versions of such entities;
semi-automatic labeling is the one proposed by the            then, we substituted each entity in the sentences
data scientists of the Snorkel project with Snorkel           of the dataset with one randomly chosen from E.
Drybell (Bach et al., 2018) that aims at automating              After the translation, an automatic annotation
the labeling through the use of data programming.             was performed. The intent associated with the En-
Other works have explored the possibility of cre-             glish sentence has been copied to its Italian coun-
ating datasets in a language starting from datasets           terpart. Slots have been transferred by aligning
in other languages, such as (Jabaian et al., 2010)            the source and target tokens4 and by copying the
and (Stepanov et al., 2013). Regarding the Italian            corresponding slot annotation. In case of excep-
language two main works can be pointed out (Ray-              tions, e.g., multiple alignments on the same token
mond et al., 2008; Vanzo et al., 2016). Our work              or missing alignment, we left the token without
differs mainly in the application domain (i.e., we            annotation.
focus on the voice assistants scenario). In particu-
                                                              3.2       Human Revision
lar, (Raymond et al., 2008) mainly focuses on di-
alogues in a customer service scenario; (Vanzo et             In a second phase, the dataset was divided into 6
al., 2016) focuses on Human-Robot interaction.                different sets, each containing about 1, 190 sen-
                                                              tences. Each set was assigned to 2 annotators5 ,
3   Almawave-SLU: A new dataset for                           and each was asked to review the translation from
    Italian SLU                                               English to Italian and the reliability of the auto-
We created the new dataset 2 starting from the                matic annotation. The guideline was to consider
SNIPS dataset (Coucke et al., 2018), which is in              a valid annotation when both the alignment and
English. It contains 14, 484 annotated examples3              the semantic slots were correct. Moreover, also a
with respect to 7 intents and 39 slots. In table 1 an         semantic consistency check was performed: e.g.,
excerpt of the dataset is shown. We started from              served dish and restaurant type or city and region
this dataset as: i) it contains a reasonable amount           or song and singer. The 2 annotators have been
of examples; ii) it is multi-domain; iii) we believe          used to cross-check the annotations, in order to
it could represent a more realistic setting in today’s        provide more reliable revisions. When the 2 an-
voice assistants scenario.                                    notators disagreed, the annotations have been val-
    We performed a semi-automatic procedure con-              idated by a third different annotator.
sisting of two phases: an automatic transla-                     During the validation phase some interesting
tion with contextual alignment of intents and                 phenomena emerged. 6 For example, there have
slots; a manual validation of the translations                been cases of inconsistency between the restau-
and annotations. The resulting dataset, i.e.,                 rant name and the type of served dish when the
Almawave-SLU, has fewer training examples, a                  name of the restaurant mentioned the kind of food
total of 7, 142 and the same number of validation             served, e.g., "Prenota un tavolo da Pizza Party per
and test examples of the original dataset. Again, 7           mangiare noodles". There were also wrong asso-
                                                              ciations between the type of restaurant and service
    2
      The Almawave-SLU dataset is available for download.
                                                                    4
To obtain it, please send an e-mail to the authors.                   The alignment was provided by the Translator API.
    3                                                               5
      There are 13084, 700 and 700 for training, validation           A total of 6 annotators were available.
                                                                    6
and test, respectively.                                               Some inconsistencies were in the original dataset
        AddToPlaylist                  Add the song virales de siempre by the cary brothers to my gym playlist.
        BookRestaurant                 I want to book a top-rated brasserie for 7 people.
        GetWeather                     What kind of weather will be in Ukraine one minute from now?
        PlayMusic                      Play Subconscious Lobotomy from Jennifer Paull.
        RateBook                       Rate The children of Niobe 1 out of 6 points.
        SearchCreativeWork             Looking for a creative work called Plant Ecology
        SearchScreeningEvent           Is Bartok the Magnificent playing at seven AM?

Table 1: Examples from the SNIPS dataset. The first column indicates the intent, the second columns
contains an example.


requested, e.g, "Prenota nell’area piscina per 4            3.3    Automatic Translation Analysis
persone in un camion-ristorante". A truck restau-           In many cases, machine translation lacked context
rant is actually a van equipped for fast-food in the        awareness: this isn’t an easy task due to phenom-
street. Again, among the cases of unlikely asso-            ena as polysemy, homonymy, metaphors and id-
ciations resulting from automatic replacement, the          ioms. There can be problems of lexical ambigui-
inconsistency between temperatures and cities is            ties when a word has more than one meaning and
mentioned, in cases like "snow in the Sahara". An-          can produce wrong interpretations. For example,
other type of problem occured when the same slot            the verb "to play" can mean “spend time doing
was used to identify very different objects. For            enjoyable things”, such as “using toys and taking
example, for the intent SearchCreativeWork, the             part in games”, “perform music” or “perform the
slot object_name was used for paintings, games,             part of a character”.
movies, etc... We can observe and analyze a cou-
                                                               Human intervention occurred to maintain the
ple of examples for this intent: Can you find me
                                                            meaning of the text dependent on cultural and situ-
the work, The Curse of Oak Island ? and Can
                                                            ational contexts. Different translation errors were
you find me, Hey Man ?. The first example con-
                                                            modified by the annotators. For example, the au-
tains The Curse of Oak Island, that is a television
                                                            tomatic translation of the sentence Play Have You
series and the second refers to Hey Man that is a
                                                            Met Miss Jones by Nicole from Google Music.
music album, but both are labeled as object_name,
                                                            was Gioca hai incontrato Miss Jones di Nicole da
where the object_type are different and not speci-
                                                            Google Music., but the correct Italian version is
fied. In all these cases, the annotators were asked
                                                            Riproduci Have You Met Miss Jones di Nicole da
to correct the sentences and the annotations, ac-
                                                            Google Music.. In this case the wrong translation
cordingly. Again, in the case of BookRestaurant
                                                            of the verb play causes a meaningless sentence.
intent a manual revision was made when in the
                                                               Often, translation errors are due to the presence
same sentence the city and state coexist: to make
                                                            of prepositions, that have the same function in Ital-
the data more relevant to the Italian language, the
                                                            ian as they do in English. Unfortunately, these
region relative to the city is changed, e.g, "I need
                                                            cannot be directly translated. Each preposition is
a table for 5 at a highly rated gastropub in Saint
                                                            represented by a group of related senses, some of
Paul, MN" is translated and adapted for Italian in
                                                            which are very close and similar while others are
"Vorrei prenotare un tavolo per 5 in un gastropub
                                                            rather weak and distant. For example, the Ital-
molto apprezzato a Biella, Piemonte".
                                                            ian preposition “di” can have six different English
                                                            counterparts – of, by, about, from, at, and than.
                                                            For example, in the SNIPS dataset the sentence I
                                                            need a table for 2 on feb. 18 at Main Deli Steak
                        Train   Train-R   Valid   Test      House was translated as Ho bisogno di un tavolo
 AddToPlayList          744      185      100     124
 BookRestaurant         967      250      100      92
                                                            per 2 su Feb. 18 presso Main Deli Steak House.
 GetWeather             791      195      100     104       Here, the translation of “on” is wrong: the correct
 PlayMusic              972      240      100      86       Italian version should translate it as “il”. Another
 RateBook               765      181      100      80       example with wrong preposition translation is the
 SearchCreativeWork     752      172      100     107
 SearchScreeningEvent   751      202      100     107       sentence “What will the weather be one month
                                                            from now in Chad ?’, the automatic translation of
Table 2: Almawave-SLU Datasets statistics.                  “one month from now” is “un mese da ora” but the
Train-R is the reducted training set.                       correct translation is “tra un mese”.
   Common errors were in the translation of tem-       lows software developers to embed a virtual as-
poral expression, that are different between Italian   sistant, that use Watson AI machine learning and
and English. For example the translation of the        NLU, in their software. Watson Assistant allows
sentence “Book a table in Fiji for zero a.m” was       customers to protect information gathered through
“Prenotare un tavolo in Fiji per zero a.m" but in      user interaction in a private cloud. It was chosen
Italian “zero a.m” is “mezzanotte”.                    because it was conceived for an industrial market
   Other errors were specific of some intents, as      and for its long tradition in this task.
they tend to have more slangs. For example, the
translation of GetWeather’s sentences was prob-        DialogFlow. Dialogflow (dia, 2019) is a Google
lematic because the main verb is often misinter-       service to build engaging voice and text-based
preted, while in the sentences related to the intent   conversational interfaces, powered by a natu-
BookRestaurant a frequent failure occurred on the      ral language understanding (NLU) engine. Di-
interpretation of prepositions. For example, the       alogflow makes it easy to connect the bot service
sentence “Will it get chilly in North Creek For-       to a number of channels and runs on Google Cloud
est?” was translated as “Otterrà freddo in North       Platform, so it can scale to hundreds of millions of
Creek Forest?”, while the correct translation is       users. DialogFlow was chosen due to its wide dis-
“Farà freddo a North CreekForest?”. In this case,      tribution and ease of use of the interface.
the system misinterpreted the context, assigning to
                                                       Bert-Joint. It is a SOTA approach to SLU
“get” the wrong meaning.
                                                       adopting a joint Deep Learning architecture in an
4     Benchmarking SLU Systems                         attention-based recurrent frameworks (Castellucci
                                                       et al., 2019). It exploits the successful Bidirec-
Nowadays, there are several human-machine in-          tional Encoder Representations from Transform-
teracting platforms, commercial and open source.       ers (BERT) model to pre-train language represen-
Machine learning algorithms enable these systems       tations. In (Castellucci et al., 2019), the authors
to understand natural language utterances, match       extend the BERT model in order to perform the
them to intents, and extract structured data. We de-   two tasks of ID and SF jointly. In particular, two
cided to use the Almawave-SLU dataset with the         classifiers are trained jointly on top of the BERT
following SLU systems.                                 representations by means of a specific loss func-
                                                       tion.
4.1    SLU Systems
RASA. RASA (ras, 2019) is an open source al-           4.2   Experimental Setup
ternative to popular NLP tools for the classifica-     Almawave-SLU has been used for training
tion of intentions and the extraction of entities.     and evaluation of Rasa, Luis, Watson Assis-
Rasa contains a set of high-level APIs to produce      tant, DialogFlow and Bert-Joint. Another evalu-
a language parser through the use of NLP and ML        tion is made on 3 different training datasets, i.e
libraries, via the configuration of the pipeline and   Train-R, of reduced dimensions with respect to
embeddings. It seems to be very fast to train, does    the Almawave-SLU, each about 1, 400 sentences
not require great computing power and, despite         equally distributed on intent.
this, it seems to get excellent results.                  The train/validation/test split used for the evalu-
LUIS. Language Understanding service (msl,             ations is 5, 742 (1, 400 for Train-R), 700 and 700,
2019) allows the construction of applications that     respectively. Regarding Rasa, we used version
can receive input in natural language and extract      1.0.7, and we adopted the standard “supervised
the meaning from it through the use of Machine         embeddings” pipeline, since it is recommended
Learning algorithms. LUIS was chosen as it pro-        in the official documentation. This pipeline con-
vides also an easy-to-use graphical interface ded-     sists of a WhiteSpaceTokenizer, that was modified
icated to less experienced users. For this system      to avoid the filter of punctuation tokens, a Regex
the computation is completely done remotely and        Featurizer, a Conditional Random Field to extract
no configuration is needed.                            entities, a Bag-of-words Featurizer and an Intent
                                                       Classifier. LUIS was tested against the api v2.0,
Watson Assistant. IBM’s Watson Assistant               and the loading of data to train the system with
(wat, 2019) is a white label cloud service that al-    LUIS APP VERSION 0.1. Unfortunately Watson
                                           Eval-1 with Train set         Eval-2 with Train-R set
                     System             Intent    Slot   Sentence      Intent    Slot   Sentence
                     Rasa               96.42 85.40        65.76       93.84 78.58        52.25
                     LUIS               95.99 79.47        50.57       94.46 72.51        35.53
                     Watson Assistant   96.56      -          -        95.03      -         -
                     Dialogflow         95.56 74.62        46.16       93.60 65.23        36.68
                     Bert-Joint          97.6     90.0      77.1       96.13 83.04        65.23

                                 Table 3: Overall scores for Intent and Slot

Assistant supports only English models for the an-           Regarding the ID task, all models are perform-
notations of contextual entities, i.e, slots; there-      ing similarly, but Bert-Joint F1 score is slightly
fore, we have only measured the intents 7 . Re-           higer than others. For SF task, notice that there are
garding DialogFlow, a “Standard” (free) utility has       significant differences between LUIS, DialogFlow
been created with API version 2; the python li-           and Rasa performances.
brary “dialogflow” has been used for the predic-             Finally, Bert-Joint achieved the top score on
tions. 8 . DialogFlow allows the choice between           joint classification, in the assessments with the two
pure ML mode (“ML only”) and hybrid rule-based            different sizes of the dataset. The adaptation of
and ML mode (“match mode”). We chosen ML                  nominal entities in Italian may have amplified the
mode. Regarding the BERT-Joint system, a pre-             problem for the other models.
trained BERT model is adopted, which is avail-
able on the BERT authors website9 . This model            5    Conclusion
is composed of 12-layer and the size of the hid-
                                                          The contributions of this work are two-fold: first,
den state is 768. The multi-head self-attention is
                                                          we presented and released the first Italian SLU
composed of 12 heads for a total of 110M param-
                                                          dataset (Almawave-SLU) in the voice assistants
eters. As suggested in (Castellucci et al., 2019),
                                                          context. It is composed of 7, 142 sentences an-
we adopted a dropout strategy applied to the fi-
                                                          notated with respect to intents and slots, almost
nal hidden states before the intent/slot classifiers.
                                                          equally distributed on the 7 different intents. The
We tuned the following hyper-parameters over the
                                                          effort spent on the construction of this new re-
validation set: (i) number of epochs among (5, 10,
                                                          source, according to the semi-automatic procedure
20, 50); (ii) Dropout keep probability among (0.5,
                                                          described, is about 24 FTE 10 , with an average pro-
0.7 and 0.9). We adopted the Adam optimizer
                                                          duction of about 300 examples per day. We con-
(Kingma and Ba, 2015) with parameters β1 = 0.9,
                                                          sider this effort lower than typical efforts to create
β2 = 0.999, L2 weight decay 0.01 and learning
                                                          linguistic resources from scratch.
rate 2e-5 over batches of size 64.
                                                             Second, we compared some of the most popular
4.3       Experimental Results                            NLU services with this data. The results show they
                                                          all have similar features and performances. How-
In table 3 the performances of the systems are            ever, compared to another specific architecture for
shown. The SF performance is the F1 while the             SLU, i.e., Bert-Joint, they perform worse. It was
ID and Sentence performances are measured with            expected and it demonstrates the Almawave-SLU
the accuracy. We also show an evaluation carried          can be a valuable dataset to train and test SLU sys-
out with models trained on three different split of       tems on the Italian language. In future, we hope to
reduced size derived from the whole dataset. The          continuously improve the data and to extend the
reported value is the average of measurements ob-         dataset.
tained separately on the entire test dataset.
      7
       Refer to Table 3.    Entity feature sup-
                                                          6    Acknowledgment
port      details at  https://cloud.ibm.com/
docs/services/assistant?topic=                            The authors would like to thank to David Alessan-
assistant-language-support                                drini, Silvana De Benedictis, Raffaele Mazzocca,
     8
       https://cloud.google.com/dialogflow/               Roberto Pellegrini and Federico Wolenski for the
docs/reference/rest/v2/projects.agent.
intents#Part                                              support in the annotation, revision and evaluation
     9
       https://storage.googleapis.com/bert\               phases.
_models/2018\_11\_23/multi\_cased\_L-12\
                                                              10
_H-768\_A-12.zip                                                   Full Time Equivalent
References                                                     Evgeny Stepanov, Ilya Kashkarev, Ali Orkan Bayer,
                                                                 Giuseppe Riccardi, and Arindam Ghosh. 2013.
Stephen H. Bach, Daniel Rodriguez, Yintao Liu,                   Language style and domain adaptation for cross-
   Chong Luo, Haidong Shao, Cassandra Xia, Souvik                language slu porting. pages 144–149, 12.
   Sen, Alexander J. Ratner, Braden Hancock, Houman
   Alborzi, Rahul Kuchhal, Christopher Re, and Rob             G. Tur, D. Hakkani-Tur, and L. Heck. 2010. What is
   Malkin. 2018. Snorkel drybell: A case study in de-            left to be understood in atis? In 2010 IEEE Spoken
   ploying weak supervision at industrial scale. CoRR,           Language Technology Workshop, pages 19–24, Dec.
   abs/1812.00417.
                                                               Andrea Vanzo, Danilo Croce, Giuseppe Castellucci,
Daniel Braun, Adrian Hernandez-Mendez, Florian                   Roberto Basili, and Daniele Nardi. 2016. Spoken
  Matthes, and Manfred Langen. 2017. Evaluating                  language understanding for service robotics in ital-
  natural language understanding services for conver-            ian. In Giovanni Adorni, Stefano Cagnoni, Marco
  sational question answering systems. In Proceed-               Gori, and Marco Maratea, editors, AI*IA 2016 Ad-
  ings of the 18th Annual SIGdial Meeting on Dis-                vances in Artificial Intelligence, pages 477–489,
  course and Dialogue, pages 174–185, Saarbrucken,               Cham. Springer International Publishing.
  Germany, August. Association for Computational
  Linguistics.                                                 2019. Ibm watson assistant v1. https://cloud.
                                                                 ibm.com/apidocs/assistant.
Giuseppe Castellucci, Valentina Bellomaria, Andrea
  Favalli, and Raniero Romagnoli. 2019. Multi-
  lingual intent detection and slot filling in a joint bert-
  based model. CoRR, abs/1907.02884.

Alice Coucke, Alaa Saade, Adrien Ball, Theodore
  Bluche, Alexandre Caulier, David Leroy, Clement
  Doumouro, Thibault Gisselbrecht, Francesco Calt-
  agirone, Thibaut Lavril, Mael Primet, and Joseph
  Dureau. 2018. Snips voice platform: an embedded
  spoken language understanding system for private-
  by-design voice interfaces. CoRR, abs/1805.10190.

2019.    Google dialogflow.                   https://
  dialogflow.com.

Charles T. Hemphill, John J. Godfrey, and George R.
  Doddington. 1990. The atis spoken language sys-
  tems pilot corpus. In Speech and Natural Language:
  Proceedings of a Workshop Held at Hidden Valley.

Bassam Jabaian, Laurent Besacier, and Fabrice
  Lefèvre. 2010. Investigating multiple approaches
  for slu portability to a new language. In INTER-
  SPEECH.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
  method for stochastic optimization. In Yoshua Ben-
  gio and Yann LeCun, editors, 3rd International Con-
  ference on Learning Representations, ICLR 2015,
  San Diego, CA, USA, May 7-9, 2015, Conference
  Track Proceedings.

2019.   Microsoft luis on azure. https:
  //azure.microsoft.com/it-it/
  services/cognitive-services/
  language-understanding-intelligent-service/.

2019. Rasa: Open source conversational ai. https:
  //rasa.com/.

Christian Raymond, Kepa Joseba Rodriguez, and
  Giuseppe Riccardi. 2008. Active annotation in the
  LUNA Italian corpus of spontaneous dialogues. In
  LREC 2008.