The Perfect Recipe: Add SUGAR, Add Data Simone Magnolini1,2 , Vevake Balaraman1,3 , Marco Guerini1 , Bernardo Magnini1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento — Italy 2 AdeptMind Scholar 3 University of Trento, Italy. {magnolini, balaraman, guerini, magnini}@fbk.eu Abstract approcci, e mostriamo i risultati ottenuti nei loro rispettivi run. English. We present the FBK partic- ipation at the EVALITA 2018 Shared Task “SUGAR – Spoken Utterances 1 Introduction Guiding Chef’s Assistant Robots”. There are two peculiar, and challeng- In the last few years, voice controlled systems ing, characteristics of the task: first, have been arising a great interest, both in research the amount of available training data and industrial projects, resulting in many appli- is very limited; second, training con- cations such as Virtual Assistants and Conversa- sists of pairs [audio-utterance, tional Agents. The use of voice controlled systems system-action], without any in- allows to develop solutions for contexts where the termediate representation. Given the user is busy and can not operate with traditional characteristics of the task, we experi- graphical interfaces, such as, for instance, while mented two different approaches: (i) driving a car or while cooking, as suggested by design and implement a neural architec- the SUGAR task. ture that can use as less training data as The traditional approach to Spoken Language possible, and (ii) use a state of art tagging Understanding (SLU) is based on a pipeline that system, and then augment the initial combines several components: training set with synthetically generated • An automatic speech recognizer (ASR), data. In the paper we present the two which is in charge of converting the spoken approaches, and show the results obtained user utterance into a text. by their respective runs. • A Natural Language Understanding (NLU) Italiano. Presentiamo la partecipazione component, which takes as input the ASR di FBK allo shared task “SUGAR – output and produces a set of instructions to be Spoken Utterances Guiding Chef’s As- used to operate on the system backend (e.g. a sistant Robots” a EVALITA 2018. Ci knowledge base). sono due caratteristiche peculiari del task: primo, la quantitá di dati di training é • A Dialogue Manager (DM), which selects the molto limitata; secondo, il training con- appropriate state of the dialogue, based on the siste di coppie [enunciato-audio, context of previous interactions. azione-sistema], senza alcuna rap- presentazione intermedia. Date le carat- • A domain Knowledge Base (KB), which is teristiche del task, abbiamo sperimentato accessed in order to retrieve relevant informa- due approcci diversi: (i) la progettazione e tion for the user request. implementazione di una architettura neu- • An utterance generation component, which rale che riesca ad usare la minor quantitá produces a text in natural language by taking di traning possibile; (ii) l’uso di un sis- the dialogue state and the KB response. tama di tagging allo stato dell’arte, au- mentato con dati generati in modo sin- • Finally, a text-to-speech (TTS) component is tetico. Nel contributo presentiamo i due responsible for generating a spoken response to the user, on the base of the text produced tence after each frame. The user’s goal is to guide by the utterance generation component. the robot to accomplish the same action seen in the frame. The resulting dataset is a list of ut- While the pipeline approach has proven to be terances describing the actions needed to prepare very effective in a large range of task-oriented three different recipes. While utterances are to- applications, in the last years several deep learn- tally free, the commands are selected from a fi- ing architectures have been experimented, result- nite set of possible actions, which may refer ei- ing in a strong push toward so called end-to-end ther to to ingredients or tools. Audio files are approaches (Graves and Jaitly, 2014; Zeghidour recorded in a real acoustic environment, with a mi- et al., 2018). One of the main advantages of crophone posed at about 1 mt of distance from the end-to-end approaches is avoiding the indepen- different speakers. The final corpus contains au- dent training of the various components of the dio files for the three recipes, grouped for each SLU pipeline, this way reducing the need of hu- speaker, and segmented into sentences represent- man annotations and the risk of error propagation ing isolated commands (although few audio files among components. However, despite the encour- may contain multiple actions (e.g. "add while mix- aging results of end-to-end approaches, they still ing")). need significant amount of training data, which are often not available for the task at hand. This 3 Data Pre-processing situation is also true in the SUGAR task, where, as training data are rather limited, end-to-end ap- The SUGAR dataset is constituted by a collection proaches are not directly applicable. of audio files, that needs to be pre-processed in Our contribution at the SUGAR task mainly fo- several ways. The first step is ASR, i.e., tran- cuses on the NLU component, since we make use scription from audio to text. For this step we of an ‘off the shelf’ ASR component. In particu- made use of an external ASR, selected among the lar, we experimented two approaches: (i) the im- ones easily available with a Python implementa- plementation a neural NLU architecture that can tion. We used the Google API, based on a com- use as less training data as possible (described in parative study of the different ASR (Këpuska and Section 4), and (ii) the use of a state of art neu- Bohouta, 2017); we conducted some sample tests ral tagging system, where the initial training data to be sure that the ASR ranking is reasonable also have been augmented with synthetically generated for Italian, and we confirmed our choice. data (described in Section 5 and 6). After this step, we split the dataset into train- ing set, development set and test set; in fact the 2 Task and Data description SUGAR corpus is a unique collection and there In the SUGAR task (Maro et al., 2018) the sys- is no train-dev-test split. Although the train-dev- tem’s goal is to understand a set of command in test split is quite standard, with two round of 80- the context of a voice-controlled robotic agent that 20 split of the dataset (80% of the dataset is the acts as a cooking assistant. In this scenario the user training and development set, which we split 80- can not interact using a "classical" interface be- 20 again, and 20% is the test set), in the SUGAR cause he/she is supposed to be cooking. The train- task we split the dataset in a more complex way. In ing data set is a corpus of annotated utterances; fact, the dataset is composed by only three differ- spoken sentences are annotated only with the ap- ent recipes (i.e. a small amount of ingredients and propriate command for the robot. Transcription similar sequence of operations), and with a classi- from speech to text are not available. cal 80-20 split the training, the development and The corpus is collected in a 3D virtual en- the test sets would have been too different from vironment, designed as a real kitchen, where the final set (the one used to evaluate the system). users give commands to the robot assistant to ac- This is due to the fact this new set is composed by complish some recipes. During data collection new recipes, with new ingredients and new a se- users are inspired by silent cooking videos, which quence of operations. To deal with this peculiar should ensures a more natural spoken production. characteristic, we decided to use the first recipe as Videos are segmented into short portions (frames), test set and the other two as train-dev sets. The fi- that contain a single action, and sequentially nal split of the data resulted in 1142 utterance and showed to users, who have to utter a single sen- command pairs for training, a set of 291 pairs for development and a set of 286 pairs for test. in the current word at time t and the previous hid- Finally we substituted all the prepositions in the den state of the encoder to yield the representation corpus with an apostrophe (e.g. "d’" "l’", "un’") at time t. Formally, with their corresponding form without apostrophe (e.g. "di", "lo", "una"). This substitution helps the ht = GRU (ht−1 , xt ) classifiers to correctly tokenize the utterances. In order to take advantage of the structure of the where xt is the current word at time t and ht−1 is dialogue in the dataset, in every line of the corpus the previous hidden state of the network. The final we added up to three previous interactions. Such hidden state of the network is then passed on to the previous interactions are supposed to be useful to decoder. correctly label a sample, because it is possible that 4.2 Decoder either an ingredient or a verb can appear in a pre- vious utterance, while being implied in the cur- The input sentences, denoted by x1 , x2 , ..xn , are rent utterance. The implication is formalized in represented as memories r1 , r2 , ..rn by using an the dataset, in fact the implied entity (action or embedding matrix R. A query ht at time t is gen- argument) are surrounded by ∗. The decision of erated using a Gated Recurrent Unit (GRU) (Cho having a "conversation history" of a maximum of et al., 2014), that takes as input the previously gen- three utterances is due to a first formalization of erated output word ŷt−1 and the previous query the task, in which the maximum history for every ht−1 . Formally: utterance was set to three previous interactions. Even if this constraint has been relaxed in the fi- ht = GRU (ŷt−1 , ht−1 ) nal version of the task, we kept it in our system. In addition, a sample test on the data confirms the in- The initial query h0 is the final output vector o tuition that usually a history of three utterances is output by the encoder. The query h is then used enough to understand a new utterance. For sake of as the reading head over the memories. At each clarity, we report below a line of the pre-processed time-step t, the model generates two probabilities, dataset: namely Pvocab and Pptr . Pvocab denotes the prob- un filo di olio nella padella # e poi verso lo uovo ability over all the words in the vocabulary and it nella padella # gira la frittata # togli la frittata dal is defined as follows: fuoco Pvocab (ŷt ) = Sof tmax(W ht ) where the first three utterances are the history in reverse order, and the final is the current utterance. where W is the parameter learned during training. 4 System 1: Memory + Pointer Networks The probability over the input words is denoted by Pptr and is calculated using the attention weights The first system presented by FBK is based on a of the MemNN network. Formally: neural model similar to the architecture proposed by (Madotto et al., 2018), which implements a Pptr (ŷt ) = at encoder-decoder approach. The encoder consists at,i = Sof tmax(hTt ri ) of a Gated Recurrent Unit (GRU) (Cho et al., 2014) that encodes the user sentence into a latent By generating two probabilities, Pvocab and representation. The decoder consists of a combi- Pptr , the model learns both how to generate words nation of i) a MemNN that generate tokens from from the output vocabulary and also how to copy the output vocabulary, and ii) a Pointer network words from the input sequence. Though it is possi- (Vinyals et al., 2015) that chooses which token ble to learn a gating function to combine the distri- from the input is to be copied to the output. butions, as used in (Merity et al., 2016), this model uses a hard gate to combine the distributions. A 4.1 Encoder sentinel token $ is added to the input sequence Each word in the input sentence x from the user while training and the pointer network is trained is represented in high-dimension by using an em- to maximize the Pptr probability for tokens that bedding matrix A. These representations are en- should be generated from output vocabulary. If the coded by a Gated Recurrent Unit. The GRU takes sentinel token is chosen by Pptr , then the model switches to Pvocab to generate a token, else the in- allel computation within a sequence. This can in- put token specified by Pptr is chosen as output to- crease dramatically the training time of the sys- ken. Though the MemNN can be modelled with tem without reducing the performance, as shown n hops, the nature of the SUGAR task and sev- in (Gehring et al., 2017). eral experiments that we carried on, showed that The weak point of the system is that it needs a adding more hops is not useful. As a consequence consistent amount of training data to create rea- the model is implemented as a single hop as ex- sonable models. In fact, Fairseq(-py) trained with plained above. only the SUGAR dataset can not converge and We use the pre-trained embeddings from (Bo- gets stuck after some epochs, producing pseudo- janowski et al., 2016) to train the model. random sequences. Due to the small size of the SUGAR training set, combined with its low vari- 5 System 2: Fairseq ability (training data are composed by possible The second system experimented by FBK is based variations of only two recipes), for the system is on the work in (Gehring et al., 2017). In particular, impossible to learn the correct structure of the we make use of the Python implementation of the commands (e.g. balancing the parenthesis) or to toolkit known as Fairseq(-py)1 . The toolkit is im- learn how to generalize arguments. In order plemented using PyTorch, and provides reference to use effectively this system we have expanded implementations of various sequence-to-sequence the SUGAR dataset with data augmentation tech- models. There are configurations for several tasks, niques, presented in Section 6. including translation, language model and stories generation. In our experiment we use the toolkit 6 Data augmentation as a black-box since our goal is to obtain a dataset Overfitting is still an open issue in neural mod- that could be used with this system; hence, we use els, especially in situations of data sparsity. In the the generic model (not designed for any specific realm of NLP, regularization methods are typically task) without fine tuning. Moreover, we do not applied to the network (Srivastava et al., 2014; Le add any specific feature or tuning for the implicit et al., 2015), rather than to the training data. arguments (the ones surrounded by ∗), but we let However, in some application fields, data aug- the system learn the rule by itself. mentation has proven to be fundamental in im- A common approach in sequence learning is proving the performance of neural models when to encode the input sequence with a series of facing insufficient data. The first fields exploring bi-directional recurrent neural networks (RNN); data augmentation techniques were computer vi- this can be done with Long Short-Term Memory sion and speech recognition. In these fields there (LSTM) networks, Gated Recurrent Unit (GRU) now exist well-established techniques for synthe- networks or other types of network, and generate a sizing data. In the former we can cite techniques variable length output with another set of decoder such as rescaling or affine distortions (LeCun et RNNs, not necessarily of the same type, both of al., 1998; Krizhevsky et al., 2012). In the latter, which interface via an attention mechanism (Bah- adding background noise or applying small time danau et al., 2014; Luong et al., 2015). shifts (Deng et al., 2000; Hannun et al., 2014). On the other hand convolutional networks cre- In the realm of NLP tasks, data augmenta- ate representations for fixed size contexts, that can tion has received little attention so far, some no- be seen as a disadvantage compared to the RNNs. table exceptions being feature noising (Wang et However, the context size of the convolutional net- al., 2013) or Kneser-Ney smoothing (Xie et al., work can be expanded by adding new layers on top 2017). Additionally, negative examples generation of each other. This allows to control the maximum has been used in (Guerini et al., 2018). length of dependencies to be modeled. Further- In this paper we build upon the idea of the more, convolutional networks allow paralleliza- aforementioned papers by moving a step forward tion over elements in the sequence, because they and taking advantage of the structured nature of do not need the computations of the previous time the SUGAR task and of some domain/linguistic step. This contrasts with RNNs, which maintain knowledge. In particular, we used the following a hidden state of the entire past that prevents par- methods to expand the vocabulary and the size of 1 https://github.com/pytorch/fairseq. the training data, but applying some substitution strategies to the original data: the inclusion of the history of three previous utter- ances in the process. • most-similar token substitution: based on a similarity mechanisms (i.e. embeddings). 7 Results • synonym token substitution: synonymy re- Actions Arguments lations taken from an online dictionary and Memory + Pointer Networks applied to specific tokens. - Data Augmentation 65.091 30.856 + Data Augmentation 65.396 35.786 • entity substitution: replace entities in the Fine Tuning 66.158 36.102 examples with random entities of the same Fairseq type taken from available gazetteers. + Data Augmentation 66,361 46,221 The first approach implies substituting a to- Table 1: Accuracy of the two experimented ap- ken from a training example with one of the five proaches in recognizing actions and their argu- most similar tokens (chosen at random) found ments. through cosine similarity in the embedding space described in (Pennington et al., 2014). We use Results of the two approaches are reported in the top five candidates in order to add variabil- Table 1. Both approaches obtain a higher accu- ity, since many tokens appeared multiple times in racy in recognizing actions, than in recogniz- the training data. If the token appeared also as an ing arguments. Fairseq trained with augmented argument in the command, it was substituted data is the top performer of the task, outperform- as well, while if it appeared as action it was ing more than 10% of accuracy on arguments left unchanged. This approach was applied with a compared to the others approach. The ablation test probability of 30% on each token of the utterances on Memory + Pointer Networks also show the im- in the training data. portance of data augmentation for tasks with low The second approach has been used over verbs resources, in particular fine tuning the classifier recognized in training utterances using the TextPro with the new data. PoS tagger (Pianta et al., 2008). Such verbs have 8 Conclusion and Future Work been substituted with one possible synonym taken from an electronic dictionary2 . Also in this case, We presented the FBK participation at the the action in the command was kept the same EVALITA 2018 Shared Task “SUGAR – Spo- (in fact the verbs present in the utterance are usu- ken Utterances Guiding Chef’s Assistant Robots”. ally paired with the action in the command). Given the characteristics of the task, we exper- The third approach has been used to substitute in- imented two different approaches: (i) a neural gredients in the text with other random ingredients architecture based on memory and pointer net- from a list of foods (Magnini et al., 2018). In this work, that can use as less training data as pos- case the ingredient has been modified accordingly sible, and (ii) a state of the art tagging system, also in the annotation of the sentence. Fairseq, trained with several augmentation tech- These methodologies allow to generate several niques to expand the initial training set with syn- variants starting from a single sentence. While thetically generated data. This second approach the first approach has been used in isolation, the seems promising and in the future we want to second and the third one have been used together deeper investigate the effect of the different tech- to generate additional artificial training data. Do- niques of data augmentation on the performances. ing so, we obtained two different data sets: the first is composed by 45680 pairs of utterances Acknowledgments and commands (most-similar token applied forty This work has been partially supported by the times per example, 1142 ∗ 40); the second dataset AdeptMind scholarship. contains 500916 pairs (each original sentence got at least each verb replaced 3 times, and for each of these variants, ingredients were randomly substi- References tuted twice), the high number of variants is due to Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 2 http://www.sinonimi-contrari.it/. gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint Minh-Thang Luong, Hieu Pham, and Christopher D arXiv:1409.0473. Manning. 2015. Effective approaches to attention- based neural machine translation. arXiv preprint Piotr Bojanowski, Edouard Grave, Armand Joulin, arXiv:1508.04025. and Tomas Mikolov. 2016. Enriching word vec- tors with subword information. arXiv preprint Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. arXiv:1607.04606. 2018. Mem2seq: Effectively incorporating knowl- edge bases into end-to-end task-oriented dialog sys- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- tems. arXiv preprint arXiv:1804.08217. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Bernardo Magnini, Vevake Balaraman, Mauro Drag- phrase representations using rnn encoder–decoder oni, Marco Guerini, Simone Magnolini, and Valerio for statistical machine translation. In Proceedings of Piccioni. 2018. Ch1: A conversational system to the 2014 Conference on Empirical Methods in Nat- calculate carbohydrates in a meal. In Proceedings ural Language Processing (EMNLP), pages 1724– of the 17th International Conference of the Italian 1734. Association for Computational Linguistics. Association for Artificial Intelligence (AI*IA 2018). Li Deng, Alex Acero, Mike Plumpe, and Xuedong Maria Di Maro, Antonio Origlia, and Francesco Cu- Huang. 2000. Large-vocabulary speech recognition tugno. 2018. Overview of the EVALITA 2018 under adverse acoustic environments. In Sixth Inter- Spoken Utterances Guiding Chef’s Assistant Robots national Conference on Spoken Language Process- (SUGAR) Task. In Tommaso Caselli, Nicole ing. Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evaluation campaign of Nat- Jonas Gehring, Michael Auli, David Grangier, Denis ural Language Processing and Speech tools for Ital- Yarats, and Yann N Dauphin. 2017. Convolu- ian (EVALITA’18), Turin, Italy. CEUR.org. tional sequence to sequence learning. arXiv preprint arXiv:1705.03122. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture Alex Graves and Navdeep Jaitly. 2014. Towards end- models. arXiv preprint arXiv:1609.07843. to-end speech recognition with recurrent neural net- works. In International Conference on Machine Jeffrey Pennington, Richard Socher, and Christopher Learning, pages 1764–1772. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 confer- Marco Guerini, Simone Magnolini, Vevake Balara- ence on empirical methods in natural language pro- man, and Bernardo Magnini. 2018. Toward zero- cessing (EMNLP), pages 1532–1543. shot entity recognition in task-oriented conversa- tional agents. In Proceedings of the 19th Annual Emanuele Pianta, Christian Girardi, and Roberto SIGdial Meeting on Discourse and Dialogue, pages Zanoli. 2008. The textpro tool suite. In 317–326, Melbourne, Australia, July. Bente Maegaard Joseph Mariani Jan Odijk Stelios Piperidis Daniel Tapias Nicoletta Calzolari (Con- Awni Hannun, Carl Case, Jared Casper, Bryan Catan- ference Chair), Khalid Choukri, editor, Proceed- zaro, Greg Diamos, Erich Elsen, Ryan Prenger, ings of the Sixth International Conference on Lan- Sanjeev Satheesh, Shubho Sengupta, Adam Coates, guage Resources and Evaluation (LREC’08), Mar- et al. 2014. Deep speech: Scaling up rakech, Morocco, may. European Language Re- end-to-end speech recognition. arXiv preprint sources Association (ELRA). http://www.lrec- arXiv:1412.5567. conf.org/proceedings/lrec2008/. Veton Këpuska and Gamal Bohouta. 2017. Comparing Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, speech recognition systems (microsoft api, google Ilya Sutskever, and Ruslan Salakhutdinov. 2014. api and cmu sphinx). Journal of Engineering Re- Dropout: a simple way to prevent neural networks search and Application, 7(3):20–24. from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- ton. 2012. Imagenet classification with deep con- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. volutional neural networks. In Advances in neural 2015. Pointer networks. In C. Cortes, N. D. information processing systems, pages 1097–1105. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Quoc V Le, Navdeep Jaitly, and Geoffrey E Hin- Systems 28, pages 2692–2700. Curran Associates, ton. 2015. A simple way to initialize recurrent Inc. networks of rectified linear units. arXiv preprint arXiv:1504.00941. Sida Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. 2013. Feature Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick noising for log-linear structured prediction. In Pro- Haffner. 1998. Gradient-based learning applied to ceedings of the 2013 Conference on Empirical Meth- document recognition. Proceedings of the IEEE, ods in Natural Language Processing, pages 1170– 86(11):2278–2324. 1179. Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aim- ing Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network lan- guage models. arXiv preprint arXiv:1703.02573. Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, and Emmanuel Dupoux. 2018. End-to-end speech recognition from the raw wave- form. Interspeech 2018, Sep.