Almawave-SLU: a New Dataset for SLU in Italian Valentina Bellomaria, Giuseppe Castellucci, Andrea Favalli and Raniero Romagnoli Language Technology Lab Almawave srl [first name initial].[last name]@almawave.it Abstract The widespread use of conversational and question answering systems made it necessary improve the performances of Figure 1: An example of Slot Filling in IOB for- speaker intent detection and understand- mat for a sentence with intent PlayMusic. ing of related semantic slots, i.e., Spo- ken Language Understanding (SLU). Of- for SLU. Their efficacy strongly depends on the ten, these tasks are approached with su- availability of labeled data. There are various ap- pervised learning methods, which needs proaches to the production of labeled data, de- considerable labeled datasets. This paper1 pending on the intricacy of the problem, on the presents the first Italian dataset for SLU in characteristics of the data, and on the available re- voice assistants scenario. It is the product sources (e.g., annotators, time and budget). When of a semi-automatic procedure and is used the reuse of existing public data is not feasible, as a benchmark of various open source and manual labeling should be accomplished, eventu- commercial systems. ally by automating part of the labeling process. In this work, we present the first public dataset 1 Introduction for the Italian language for SLU. It is generated by Conversational interfaces, e.g., Google’s Home or a semi-automatic procedure from an existing En- Amazon’s Alexa, are becoming pervasive in daily glish dataset annotated with intents and slots. We life. As an important part of any conversation, lan- have translated the sentences into Italian and re- guage understanding aims at extracting the mean- ported the annotations based on a token span algo- ing a partner is trying to convey. Spoken Language rithm. Then, the translation, spans and consistency Understanding (SLU) plays a fundamental role in of the entities in Italian have been manually vali- such a scenario. Generally speaking, in SLU a dated. Finally, the dataset is used as benchmark spoken utterance is first transcribed, then semantic for NLU systems. In particular, we will compare information is extracted. Language understanding, a recent state-of-the-art (SOTA) approach (Castel- i.e., extracting a semantic “frame” from a tran- lucci et al., 2019) with Rasa (ras, 2019) taken scribed user utterance, typically involves: i) Intent from the open source world, IBM Watson Assis- Detection (ID) and ii) Slot Filling (SF) (Tur et al., tant (wat, 2019), Google DialogFlow (dia, 2019) 2010). The former makes the classification of a and, finally, Microsoft LUIS (msl, 2019), some user utterance into an intent, i.e., the purpose of commercial solutions in use. the user. The latter finds what are the “arguments” Following, in section 2 related works will be of such intent. As an example, let us consider discussed; In section 3 the dataset generation will Figure 1, where the user asks for playing a song be discussed. Section 4 we will present the ex- (Intent=PlayMusic) (with or without you, periments. Finally, in section 5 we will draw the Slot=song) of an artist (U2, Slot=artist). conclusions. Usually, supervised learning methods are adopted 2 Related Work 1 Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 SLU has been addressed in the Natural Language International (CC BY 4.0). Processing community mainly in the English lan- guage. A well-known dataset used to demonstrate intents and 39 slots have been annotated. Table 2 and benchmark various NLU algorithms is Air- shows the distribution of examples for each intent. line Travel Information System (ATIS) (Hemphill et al., 1990) dataset, which consists of spoken 3.1 Translation and Annotation queries on flight related information. In (Braun In a first phase, we translated each English exam- et al., 2017) three dataset for Intent classification ple in Italian by using the Translator Text API: part task were presented. AskUbuntu Corpus and Web of the Microsoft Azure Cognitive Services. In or- Application Corpus were extracted from Stack- der to create a more valuable resource in Italian, Exchange and the third one, i.e., Chatbot Cor- we also performed an automatic substitution of the pus, was originated from a Telegram chatbot. The names of movies, movie theatres, books, restau- newer multi-intent dataset SNIPS (Coucke et al., rants and of the locations with some Italian coun- 2018) is the starting point for the work presented terpart. First, we collected from the Web a set E in this paper. An alternative approach to manual or of about 20, 000 Italian versions of such entities; semi-automatic labeling is the one proposed by the then, we substituted each entity in the sentences data scientists of the Snorkel project with Snorkel of the dataset with one randomly chosen from E. Drybell (Bach et al., 2018) that aims at automating After the translation, an automatic annotation the labeling through the use of data programming. was performed. The intent associated with the En- Other works have explored the possibility of cre- glish sentence has been copied to its Italian coun- ating datasets in a language starting from datasets terpart. Slots have been transferred by aligning in other languages, such as (Jabaian et al., 2010) the source and target tokens4 and by copying the and (Stepanov et al., 2013). Regarding the Italian corresponding slot annotation. In case of excep- language two main works can be pointed out (Ray- tions, e.g., multiple alignments on the same token mond et al., 2008; Vanzo et al., 2016). Our work or missing alignment, we left the token without differs mainly in the application domain (i.e., we annotation. focus on the voice assistants scenario). In particu- 3.2 Human Revision lar, (Raymond et al., 2008) mainly focuses on di- alogues in a customer service scenario; (Vanzo et In a second phase, the dataset was divided into 6 al., 2016) focuses on Human-Robot interaction. different sets, each containing about 1, 190 sen- tences. Each set was assigned to 2 annotators5 , 3 Almawave-SLU: A new dataset for and each was asked to review the translation from Italian SLU English to Italian and the reliability of the auto- We created the new dataset 2 starting from the matic annotation. The guideline was to consider SNIPS dataset (Coucke et al., 2018), which is in a valid annotation when both the alignment and English. It contains 14, 484 annotated examples3 the semantic slots were correct. Moreover, also a with respect to 7 intents and 39 slots. In table 1 an semantic consistency check was performed: e.g., excerpt of the dataset is shown. We started from served dish and restaurant type or city and region this dataset as: i) it contains a reasonable amount or song and singer. The 2 annotators have been of examples; ii) it is multi-domain; iii) we believe used to cross-check the annotations, in order to it could represent a more realistic setting in today’s provide more reliable revisions. When the 2 an- voice assistants scenario. notators disagreed, the annotations have been val- We performed a semi-automatic procedure con- idated by a third different annotator. sisting of two phases: an automatic transla- During the validation phase some interesting tion with contextual alignment of intents and phenomena emerged. 6 For example, there have slots; a manual validation of the translations been cases of inconsistency between the restau- and annotations. The resulting dataset, i.e., rant name and the type of served dish when the Almawave-SLU, has fewer training examples, a name of the restaurant mentioned the kind of food total of 7, 142 and the same number of validation served, e.g., "Prenota un tavolo da Pizza Party per and test examples of the original dataset. Again, 7 mangiare noodles". There were also wrong asso- ciations between the type of restaurant and service 2 The Almawave-SLU dataset is available for download. 4 To obtain it, please send an e-mail to the authors. The alignment was provided by the Translator API. 3 5 There are 13084, 700 and 700 for training, validation A total of 6 annotators were available. 6 and test, respectively. Some inconsistencies were in the original dataset AddToPlaylist Add the song virales de siempre by the cary brothers to my gym playlist. BookRestaurant I want to book a top-rated brasserie for 7 people. GetWeather What kind of weather will be in Ukraine one minute from now? PlayMusic Play Subconscious Lobotomy from Jennifer Paull. RateBook Rate The children of Niobe 1 out of 6 points. SearchCreativeWork Looking for a creative work called Plant Ecology SearchScreeningEvent Is Bartok the Magnificent playing at seven AM? Table 1: Examples from the SNIPS dataset. The first column indicates the intent, the second columns contains an example. requested, e.g, "Prenota nell’area piscina per 4 3.3 Automatic Translation Analysis persone in un camion-ristorante". A truck restau- In many cases, machine translation lacked context rant is actually a van equipped for fast-food in the awareness: this isn’t an easy task due to phenom- street. Again, among the cases of unlikely asso- ena as polysemy, homonymy, metaphors and id- ciations resulting from automatic replacement, the ioms. There can be problems of lexical ambigui- inconsistency between temperatures and cities is ties when a word has more than one meaning and mentioned, in cases like "snow in the Sahara". An- can produce wrong interpretations. For example, other type of problem occured when the same slot the verb "to play" can mean “spend time doing was used to identify very different objects. For enjoyable things”, such as “using toys and taking example, for the intent SearchCreativeWork, the part in games”, “perform music” or “perform the slot object_name was used for paintings, games, part of a character”. movies, etc... We can observe and analyze a cou- Human intervention occurred to maintain the ple of examples for this intent: Can you find me meaning of the text dependent on cultural and situ- the work, The Curse of Oak Island ? and Can ational contexts. Different translation errors were you find me, Hey Man ?. The first example con- modified by the annotators. For example, the au- tains The Curse of Oak Island, that is a television tomatic translation of the sentence Play Have You series and the second refers to Hey Man that is a Met Miss Jones by Nicole from Google Music. music album, but both are labeled as object_name, was Gioca hai incontrato Miss Jones di Nicole da where the object_type are different and not speci- Google Music., but the correct Italian version is fied. In all these cases, the annotators were asked Riproduci Have You Met Miss Jones di Nicole da to correct the sentences and the annotations, ac- Google Music.. In this case the wrong translation cordingly. Again, in the case of BookRestaurant of the verb play causes a meaningless sentence. intent a manual revision was made when in the Often, translation errors are due to the presence same sentence the city and state coexist: to make of prepositions, that have the same function in Ital- the data more relevant to the Italian language, the ian as they do in English. Unfortunately, these region relative to the city is changed, e.g, "I need cannot be directly translated. Each preposition is a table for 5 at a highly rated gastropub in Saint represented by a group of related senses, some of Paul, MN" is translated and adapted for Italian in which are very close and similar while others are "Vorrei prenotare un tavolo per 5 in un gastropub rather weak and distant. For example, the Ital- molto apprezzato a Biella, Piemonte". ian preposition “di” can have six different English counterparts – of, by, about, from, at, and than. For example, in the SNIPS dataset the sentence I need a table for 2 on feb. 18 at Main Deli Steak Train Train-R Valid Test House was translated as Ho bisogno di un tavolo AddToPlayList 744 185 100 124 BookRestaurant 967 250 100 92 per 2 su Feb. 18 presso Main Deli Steak House. GetWeather 791 195 100 104 Here, the translation of “on” is wrong: the correct PlayMusic 972 240 100 86 Italian version should translate it as “il”. Another RateBook 765 181 100 80 example with wrong preposition translation is the SearchCreativeWork 752 172 100 107 SearchScreeningEvent 751 202 100 107 sentence “What will the weather be one month from now in Chad ?’, the automatic translation of Table 2: Almawave-SLU Datasets statistics. “one month from now” is “un mese da ora” but the Train-R is the reducted training set. correct translation is “tra un mese”. Common errors were in the translation of tem- lows software developers to embed a virtual as- poral expression, that are different between Italian sistant, that use Watson AI machine learning and and English. For example the translation of the NLU, in their software. Watson Assistant allows sentence “Book a table in Fiji for zero a.m” was customers to protect information gathered through “Prenotare un tavolo in Fiji per zero a.m" but in user interaction in a private cloud. It was chosen Italian “zero a.m” is “mezzanotte”. because it was conceived for an industrial market Other errors were specific of some intents, as and for its long tradition in this task. they tend to have more slangs. For example, the translation of GetWeather’s sentences was prob- DialogFlow. Dialogflow (dia, 2019) is a Google lematic because the main verb is often misinter- service to build engaging voice and text-based preted, while in the sentences related to the intent conversational interfaces, powered by a natu- BookRestaurant a frequent failure occurred on the ral language understanding (NLU) engine. Di- interpretation of prepositions. For example, the alogflow makes it easy to connect the bot service sentence “Will it get chilly in North Creek For- to a number of channels and runs on Google Cloud est?” was translated as “Otterrà freddo in North Platform, so it can scale to hundreds of millions of Creek Forest?”, while the correct translation is users. DialogFlow was chosen due to its wide dis- “Farà freddo a North CreekForest?”. In this case, tribution and ease of use of the interface. the system misinterpreted the context, assigning to Bert-Joint. It is a SOTA approach to SLU “get” the wrong meaning. adopting a joint Deep Learning architecture in an 4 Benchmarking SLU Systems attention-based recurrent frameworks (Castellucci et al., 2019). It exploits the successful Bidirec- Nowadays, there are several human-machine in- tional Encoder Representations from Transform- teracting platforms, commercial and open source. ers (BERT) model to pre-train language represen- Machine learning algorithms enable these systems tations. In (Castellucci et al., 2019), the authors to understand natural language utterances, match extend the BERT model in order to perform the them to intents, and extract structured data. We de- two tasks of ID and SF jointly. In particular, two cided to use the Almawave-SLU dataset with the classifiers are trained jointly on top of the BERT following SLU systems. representations by means of a specific loss func- tion. 4.1 SLU Systems RASA. RASA (ras, 2019) is an open source al- 4.2 Experimental Setup ternative to popular NLP tools for the classifica- Almawave-SLU has been used for training tion of intentions and the extraction of entities. and evaluation of Rasa, Luis, Watson Assis- Rasa contains a set of high-level APIs to produce tant, DialogFlow and Bert-Joint. Another evalu- a language parser through the use of NLP and ML tion is made on 3 different training datasets, i.e libraries, via the configuration of the pipeline and Train-R, of reduced dimensions with respect to embeddings. It seems to be very fast to train, does the Almawave-SLU, each about 1, 400 sentences not require great computing power and, despite equally distributed on intent. this, it seems to get excellent results. The train/validation/test split used for the evalu- LUIS. Language Understanding service (msl, ations is 5, 742 (1, 400 for Train-R), 700 and 700, 2019) allows the construction of applications that respectively. Regarding Rasa, we used version can receive input in natural language and extract 1.0.7, and we adopted the standard “supervised the meaning from it through the use of Machine embeddings” pipeline, since it is recommended Learning algorithms. LUIS was chosen as it pro- in the official documentation. This pipeline con- vides also an easy-to-use graphical interface ded- sists of a WhiteSpaceTokenizer, that was modified icated to less experienced users. For this system to avoid the filter of punctuation tokens, a Regex the computation is completely done remotely and Featurizer, a Conditional Random Field to extract no configuration is needed. entities, a Bag-of-words Featurizer and an Intent Classifier. LUIS was tested against the api v2.0, Watson Assistant. IBM’s Watson Assistant and the loading of data to train the system with (wat, 2019) is a white label cloud service that al- LUIS APP VERSION 0.1. Unfortunately Watson Eval-1 with Train set Eval-2 with Train-R set System Intent Slot Sentence Intent Slot Sentence Rasa 96.42 85.40 65.76 93.84 78.58 52.25 LUIS 95.99 79.47 50.57 94.46 72.51 35.53 Watson Assistant 96.56 - - 95.03 - - Dialogflow 95.56 74.62 46.16 93.60 65.23 36.68 Bert-Joint 97.6 90.0 77.1 96.13 83.04 65.23 Table 3: Overall scores for Intent and Slot Assistant supports only English models for the an- Regarding the ID task, all models are perform- notations of contextual entities, i.e, slots; there- ing similarly, but Bert-Joint F1 score is slightly fore, we have only measured the intents 7 . Re- higer than others. For SF task, notice that there are garding DialogFlow, a “Standard” (free) utility has significant differences between LUIS, DialogFlow been created with API version 2; the python li- and Rasa performances. brary “dialogflow” has been used for the predic- Finally, Bert-Joint achieved the top score on tions. 8 . DialogFlow allows the choice between joint classification, in the assessments with the two pure ML mode (“ML only”) and hybrid rule-based different sizes of the dataset. The adaptation of and ML mode (“match mode”). We chosen ML nominal entities in Italian may have amplified the mode. Regarding the BERT-Joint system, a pre- problem for the other models. trained BERT model is adopted, which is avail- able on the BERT authors website9 . This model 5 Conclusion is composed of 12-layer and the size of the hid- The contributions of this work are two-fold: first, den state is 768. The multi-head self-attention is we presented and released the first Italian SLU composed of 12 heads for a total of 110M param- dataset (Almawave-SLU) in the voice assistants eters. As suggested in (Castellucci et al., 2019), context. It is composed of 7, 142 sentences an- we adopted a dropout strategy applied to the fi- notated with respect to intents and slots, almost nal hidden states before the intent/slot classifiers. equally distributed on the 7 different intents. The We tuned the following hyper-parameters over the effort spent on the construction of this new re- validation set: (i) number of epochs among (5, 10, source, according to the semi-automatic procedure 20, 50); (ii) Dropout keep probability among (0.5, described, is about 24 FTE 10 , with an average pro- 0.7 and 0.9). We adopted the Adam optimizer duction of about 300 examples per day. We con- (Kingma and Ba, 2015) with parameters β1 = 0.9, sider this effort lower than typical efforts to create β2 = 0.999, L2 weight decay 0.01 and learning linguistic resources from scratch. rate 2e-5 over batches of size 64. Second, we compared some of the most popular 4.3 Experimental Results NLU services with this data. The results show they all have similar features and performances. How- In table 3 the performances of the systems are ever, compared to another specific architecture for shown. The SF performance is the F1 while the SLU, i.e., Bert-Joint, they perform worse. It was ID and Sentence performances are measured with expected and it demonstrates the Almawave-SLU the accuracy. We also show an evaluation carried can be a valuable dataset to train and test SLU sys- out with models trained on three different split of tems on the Italian language. In future, we hope to reduced size derived from the whole dataset. The continuously improve the data and to extend the reported value is the average of measurements ob- dataset. tained separately on the entire test dataset. 7 Refer to Table 3. Entity feature sup- 6 Acknowledgment port details at https://cloud.ibm.com/ docs/services/assistant?topic= The authors would like to thank to David Alessan- assistant-language-support drini, Silvana De Benedictis, Raffaele Mazzocca, 8 https://cloud.google.com/dialogflow/ Roberto Pellegrini and Federico Wolenski for the docs/reference/rest/v2/projects.agent. intents#Part support in the annotation, revision and evaluation 9 https://storage.googleapis.com/bert\ phases. _models/2018\_11\_23/multi\_cased\_L-12\ 10 _H-768\_A-12.zip Full Time Equivalent References Evgeny Stepanov, Ilya Kashkarev, Ali Orkan Bayer, Giuseppe Riccardi, and Arindam Ghosh. 2013. Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Language style and domain adaptation for cross- Chong Luo, Haidong Shao, Cassandra Xia, Souvik language slu porting. pages 144–149, 12. Sen, Alexander J. Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher Re, and Rob G. Tur, D. Hakkani-Tur, and L. Heck. 2010. What is Malkin. 2018. Snorkel drybell: A case study in de- left to be understood in atis? In 2010 IEEE Spoken ploying weak supervision at industrial scale. CoRR, Language Technology Workshop, pages 19–24, Dec. abs/1812.00417. Andrea Vanzo, Danilo Croce, Giuseppe Castellucci, Daniel Braun, Adrian Hernandez-Mendez, Florian Roberto Basili, and Daniele Nardi. 2016. Spoken Matthes, and Manfred Langen. 2017. Evaluating language understanding for service robotics in ital- natural language understanding services for conver- ian. In Giovanni Adorni, Stefano Cagnoni, Marco sational question answering systems. In Proceed- Gori, and Marco Maratea, editors, AI*IA 2016 Ad- ings of the 18th Annual SIGdial Meeting on Dis- vances in Artificial Intelligence, pages 477–489, course and Dialogue, pages 174–185, Saarbrucken, Cham. Springer International Publishing. Germany, August. Association for Computational Linguistics. 2019. Ibm watson assistant v1. https://cloud. ibm.com/apidocs/assistant. Giuseppe Castellucci, Valentina Bellomaria, Andrea Favalli, and Raniero Romagnoli. 2019. Multi- lingual intent detection and slot filling in a joint bert- based model. CoRR, abs/1907.02884. Alice Coucke, Alaa Saade, Adrien Ball, Theodore Bluche, Alexandre Caulier, David Leroy, Clement Doumouro, Thibault Gisselbrecht, Francesco Calt- agirone, Thibaut Lavril, Mael Primet, and Joseph Dureau. 2018. Snips voice platform: an embedded spoken language understanding system for private- by-design voice interfaces. CoRR, abs/1805.10190. 2019. Google dialogflow. https:// dialogflow.com. Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The atis spoken language sys- tems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley. Bassam Jabaian, Laurent Besacier, and Fabrice Lefèvre. 2010. Investigating multiple approaches for slu portability to a new language. In INTER- SPEECH. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Yoshua Ben- gio and Yann LeCun, editors, 3rd International Con- ference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. 2019. Microsoft luis on azure. https: //azure.microsoft.com/it-it/ services/cognitive-services/ language-understanding-intelligent-service/. 2019. Rasa: Open source conversational ai. https: //rasa.com/. Christian Raymond, Kepa Joseba Rodriguez, and Giuseppe Riccardi. 2008. Active annotation in the LUNA Italian corpus of spontaneous dialogues. In LREC 2008.