=Paper=
{{Paper
|id=Vol-1649/63
|storemode=property
|title=Recurrent Neural Networks for Dialogue State Tracking
|pdfUrl=https://ceur-ws.org/Vol-1649/63.pdf
|volume=Vol-1649
|authors=Ondřej Plátek, Petr Bělohlávek, Vojtěch Hudeček, Filip Jurčíček
|dblpUrl=https://dblp.org/rec/conf/itat/PlatekBHJ16
}}
==Recurrent Neural Networks for Dialogue State Tracking==
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 63–67 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 O. Plátek, P. Bělohlávek, V. Hudeček, F. Jurčíček Recurrent Neural Networks for Dialogue State Tracking Ondřej Plátek, Petr Bělohlávek, Vojtěch Hudeček, and Filip Jurčíček Charles University in Prague, Faculty of Mathematics and Physics {oplatek,jurcicek}@ufal.mff.cuni.cz, me@petrbel.cz, vojta.hudecek@gmail.com, http://ufal.mff.cuni.cz/ondrej-platek ... Abstract: This paper discusses models for dialogue state Dial. state n: food:None, area:None, pricerange:None tracking using recurrent neural networks (RNN). We System: What part of town do you have in mind? present experiments on the standard dialogue state track- User: West part of town. ing (DST) dataset, DSTC2 [7]. On the one hand, RNN Dial. state n+1: food:None, area:west, pricerange:None models became the state of the art models in DST, on System: What kind of food would you like? the other hand, most state-of-the-art DST models are only User: Indian turn-based and require dataset-specific preprocessing (e.g. Dial. state n+2: food:Indian, area:west, pricerange:None DSTC2-specific) in order to achieve such results. We im- System: India House is a nice place in the west of town plemented two architectures which can be used in an in- serving tasty Indian food. cremental setting and require almost no preprocessing. We ... compare their performance to the benchmarks on DSTC2 and discuss their properties. With only trivial preprocess- Figure 1: Example of golden annotation of Dialogue Act ing, the performance of our models is close to the state-of- Items (DAIs). The dialogue act items comprise from act the-art results.1 type (all examples have type inform) and slots (food, area, pricerange) and their values (e.g. Indian, west, None). 1 Introduction This paper compares two different RNN architectures for dialogue state tracking (see Section 3). We describe Dialogue state tracking (DST) is a standard and important state-of-the art word-by-word dialogue state tracker archi- task for evaluating task-oriented conversational agents [18, tectures and propose to use a new encoder-decoder archi- 7, 8]. Such agents play the role of a domain expert in a nar- tecture for the DST task (see Section 4.2). row domain, and users ask for information through con- We focus only on the goal slot predictions because the versation in natural language (see the example system and other groups are trivial to predict2 . user responses in Figure 1). A dialogue state tracker sum- We also experiment with re-splitting of the DSTC2 data marizes the dialogue history and maintains a probability because there are considerable differences between the distribution over the (possible) user’s goals (see annotation standard train and test datasets [7]. Since the training, de- in Figure 1). Dialogue agents as introduced in [20] decide velopment and test set data are distributed differently, the about the next action based on the dialogue state distribu- resulting performance difference between training and test tion given by the tracker. User’s goals are expressed in data is rather high. Based on our experiments, we con- a formal language, typically represented as a dialogue act clude that DSTC2 might suggest a too pessimistic view items (DAIs) (see Section 2) and the tracker updates prob- of the state-of-the-art methods in dialogue state tracking ability for each item. The dialogue state is a latent vari- caused by the data distribution mismatch. able [20] and one needs to label the conversations in order to train a dialogue state tracker using supervised learning. It was shown that with a better dialogue state tracker, con- 2 Dialogue state tracking on DSTC2 dataset versation agents achieve better success rate in overall com- pletion of the their task [11]. Dialogue state trackers maintain their believes beliefs about users’ goals by updating probabilities of dialogue 1 Acknowledgment: We thank Mirek Vodolán and Ondřej Dušek history representations. In the DSTC2 dataset, the his- for useful comments. This research was partly funded by the Ministry tory is captured by dialogue act items and their probabili- of Education, Youth and Sports of the Czech Republic under the grant ties. A Dialogue act item is a triple of the following form agreement LK11221, core research funding, grant GAUK 1915/2015, and also partially supported by SVV project number 260 333. We grate- (actionType, slotName, slotValue). fully acknowledge the support of NVIDIA Corporation with the donation The DSTC2 is a standard dataset for DST, and most of of the Tesla K40c GPU used for this research. Computational resources the state-of-the-art systems in DST have reported their per- were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large 2 The slots Requested and Method have accuracies 0.95 and 0.95 on Research, Development, and Innovations Infrastructures”. the test set according to the state-of-the-art [19]. 64 O. Plátek, P. Bělohlávek, V. Hudeček, F. Jurčíček formance on this dataset [7]. The full dataset is freely (price range, food, area) available since January 2014 and it contains 1612 dia- logues in the training set, 506 dialogues in the develop- ment set and 1117 dialogues in the test set.3 The con- versations are manually annotated at the turn level where h1 h2 hT the hidden information state is expressed in form of (actionType, slotName, slotValue) based on the domain X1 X2 XT ontology. The task of the domain is defined by a database of restaurants and their properties4 . The database and Figure 2: The joint label predictions using RNN from last the manually designed ontology that captures a restaurant hidden state hT . The hT represents the whole dialog his- domain are both distributed with the dataset. tory of T words. The RNN takes as input for each word i an embedding and binary features concatenated to vec- 3 Models tor Xi . Our models are all based on a RNN encoder [17]. The food models update their hidden states h after processing each price range area word similarly to the RNN encoder of Žilka and Jurčíček [16]. The encoder takes as inputs the previous state ht−1 , representing history for first t − 1 words, and features Xt for the current word wt . It outputs the current state ht rep- h1 h2 hT resenting the whole dialogue history up to current word. We use a Gated Recurrent Unit (GRU) cell [5] as the up- X1 X2 XT date function instead of a simple RNN cell because it does not suffer from the vanishing gradient problem [10]. The Figure 3: The RNN encodes the word history into dialogue model optimizes its parameters including word embed- state hT and predicts slot values independently. dings [3] during training. For each input token, our RNN encoder reads the word embedding of this token along with several binary fea- 3.1 Independent classifiers for each label tures. The binary features for each word are: The independent model (see Figure 3.1) consists of three • the speaker role, representing either user or system, models which predict f ood, area and pricerange based on • and also indicators describing whether the word is the last hidden state hT independently. The independent part of a named entity representing a value from the slot prediction that uses one classifier per slot is straight- database. forward to implement, but the model introduces an unre- alistic assumption of uncorrelated slot properties. In case Since the DSTC2 database is a simple table with six of DSTC2 and the Cambridge restaurant domain, it is hard columns, we introduce six binary features firing if the to believe that, e.g., the slots area and pricerange are not word is a substring of named entity from the given col- correlated. umn. For example, the word indian will not only trigger We also experimented with a single classifier which pre- the feature for column f ood and its value indian but also dicts the labels jointly (see Figure 3) but it suffers from for column restaurant name and its value indian heaven. data sparsity of the predicted tuples, so we focused only The features make the data dense by abstracting the mean- on the independent label prediction and encoder-decoder ing from the lexical level. models. Our model variants differ only in the way they pre- dict goal labels, i.e., f ood, area and pricerange from the RNN’s last encoded state.5 The first model predicts the 3.2 Encoder-decoder framework output slot labels independently by employing three inde- We cast the slot predictions problem as a sequence-to- pendent classifiers (see Section 3.1). The second model sequence predictions task and we use a encoder-decoder uses a decoder in order to predict values one after the other model with attention [2] to learn this representation to- from the hT (see Section 3.2). gether with slot predictions (see Figure 4). To our knowl- The models were implemented using the Tensor- edge, we are the first who used this model for dialogue Flow [1] framework. state tracking. The model is successfully used in machine 3 Available online at http://camdial.org/~mh521/dstc/. translation where it is able to handle long sequences with 4 There are six columns in the database: name, food, price_range, good accuracy [2]. In DST, it captures correlation between area, telephone, address. 5 Accuracy measure with schedule 2 on slot f ood, area and the decoded slots easily. By introducing the encoder- pricerange about which users can inform the system is a featured metric decoder architecture, we aim to overcome the data sparsity for DSTC2 challenge [7]. problem and the incorrect independence assumptions. Recurrent Neural Networks for Dialogue State Tracking 65 Joint labels distribution a1 a2 a3 a4 600 500 400 # of occurences h1 h2 hT h'1 h'2 h'3 h'4 300 200 X1 X2 XT Slot1 Slot2 Slot3 EOS 100 Figure 4: Encoder decoder with attention predicts goals. 0 0 100 200 300 400 500 600 700 800 900 Labels sorted according # of occurences Figure 5: The number occurrences of labels in form of We employ the encoder RNN cell that captures the his- ( f ood, area, pricerange) triples from the least to the most tory of the dialogue which is represented as a sequence of frequent. words from the user and the system. The words are fed to the encoder as they appear in the dialogue - turn by turn - where the user and the system responses switch regularly. 4.1 Training The encoder updates its internal state hT after each pro- cessed word. The RNN decoder model is used when the The training procedure minimizes the cross-entropy loss system needs to generate its output, in our case it is at the function using the Adam optimizer [12] with a batch size end of the user response. The decoder generates arbitrary of 10. We train predicting the goal slot values for each length sequences of words given the encoded state hT step turn. We treat each dialogue turn as a separate training by step. In each step, an output word and a new hidden example, feeding the whole dialogue history up to the cur- state hT +1 is generated. The generation process is finished rent turn into the encoder and predicting the slot labels for when a special token End of Sequence (EOS) is decoded. the current turn. This mechanism allows the model to terminate the output We use early stopping with patience [14], validating on sequence. The attention part of model is used at decoding the development set after each epoch and stopping if the time for weighting the importance of the history. three top models does not change for four epochs. The disadvantage of this model is its complexity. The predicted labels in DST task depend not only on Firstly, the model is not trivial to implement6 . Secondly, the last turn, but on the dialogue full history as well. Since the decoding time is asymptotically quadratic in the length the lengths of dialogue histories vary a lot7 and we batch of the decoded sequences, but our target sequences are al- our inputs, we separated the dialogues into ten buckets ac- ways four tokens long nevertheless. cordingly to their lengths in order to provide a computa- tional speed-up. We reshuffle the data after each epoch only within each bucket. 4 Experiments In informal experiments, we tried to speed-up the train- ing by optimizing the parameters only on the last turn8 but The results are reported on the standard DSTC2 data split the performance dropped relatively by more than 40%. where we used 516 dialogues as a validation set for early stopping [14] and the remaining 1612 dialogues for train- ing. We use 1-best Automatic Speech Recognition (ASR) 4.2 Comparing models transcriptions of conversation history of the input and mea- Predicting the labels jointly is quite challenging because sure the joint slot accuracy. The models are evaluated us- the distribution of the labels is skewed as demonstrated ing the recommended measure accuracy [7] with schedule in Figure 5. Some of the labels combinations are very 2 which skips the first turns where the believe tracker does rare, and they occur only in the development and test set not track any values. In addition, our models are also eval- so the joint model is not able to predict them. During first uated on a randomly split DSTC2 dataset (see Section 4.3). informal experiments the joint model performed poorly For all our experiments, we train word embeddings of arguably due to data sparsity of slot triples. We further size 100 and use the encoder state size of size 100, together focus on model with independent classifiers and encoder- with a dropout keep probability of 0.7 for both encoder decoder architecture. inputs and outputs. These parameters are selected by a grid The model with independent label prediction is a strong search over the hyper-parameters on development data. baseline which was used, among others, in work of Žilka and Jurčíček [16]. The model suffers less from dataset 7 The maximum dialogue history length is 487 words and 95% per- centile is 205 words for the training set. 8 The prediction was conditioned on the full history but we back- 6 We modified code from the TensorFlow ‘seq2seq‘ module. propagated the error only in words within the last turn. 66 O. Plátek, P. Bělohlávek, V. Hudeček, F. Jurčíček Model Dev set Test set 5 Related work Indep 0.892 0.727 Since there are numerous systems which reported on the EncDec 0.867 0.730 DSTC2 dataset, we discuss only the systems which use Vodolán et al. [15] - 0.745 RNNs. In general, the RNN systems achieved excellent Žilka and Jurčíček [16] 0.69 0.72 results. Henderson et al. [6] - 0.737 Our system is related to the RNN tracker of Žilka and DSTC2 stacking ensemble [7] - 0.789 Jurčíček [16], which reported near state-of-the art results on the DSTC2 dataset and introduced the first incremental Table 1: Accuracy on DSTC2 dataset. The first group con- system which was able to update the dialogue state word- tains our systems which use ASR output as input, the sec- by-word with such accuracy. In contrast to work of Žilka ond group lists systems using also ASR hypothesis as in- and Jurčíček [16], we use no abstraction of slot values. In- put. The third group shows the results for ensemble model stead, we add the additional features as described in Sec- using ASR output nd also live language understanding an- tion 3. The first system which used a neural network for notations. dialogue state tracking [6] used a feed-forward network and more than 10 manually engineered features across dif- ferent levels of abstraction of the user input, including Model Dev set Test set the outputs of the spoken language understanding com- Indep 0.87 0.89 ponent (SLU). In our work, we focus on simplifying the EncDec 0.94 0.91 architecture, hence we used only features which were ex- plicitly given by the dialogue history word representation Table 2: Accuracy of our models on the re-split DSTC2 and the database. data. The system of Henderson et al. [9] achieves the state- of-the-art results and, similarly to our system, it predicts mismatch because it does not model the correlation be- the dialogue state from words by employing a RNN. On tween predicted labels. This property can explain a smaller the other hand, their system heavily relies on the user in- performance drop between the test set from reshuffled data put abstraction. Another dialogue state tracker with LSTM and the official test set in comparison to encoder-decoder was used in the reinforcement setting but the authors also model. used information from the SLU pipeline [13]. Since the encoder-decoder architecture is very general An interesting approach is presented in the work and can predict arbitrary output sequences, it also needs of Vodolán et al. [15], who combine a rule-based and a to learn how to predict only three slot labels in the correct machine learning based approach. The handcrafted fea- order. It turned out that the architecture learned to pre- tures are fed to an LSTM-based RNN which performs a dict quadruples with three slot values and the EOS symbol dialog-state update. However, unlike our work, their sys- quickly, even before seeing a half of the training data in tem requires SLU output on its input. the first epoch.9 At the end of the first epoch, the system It is worth noting that there are first attempts to train an made no more mistakes on predicting slot values in the in- end-to-end dialogue system even without explicitly mod- correct order. The encoder-decoder system is competitive eling the dialogue state [4], which further simplifies the with state-of-the art architectures and the time needed for architecture of a dialogue system. However, the reported learning the output structure was surprisingly short.10 end-to-end model was evaluated only on artificial dataset and cannot be compared to DSTC2 dataset directly. 4.3 Data preparation experiments The data for the DSTC2 test set were collected using a 6 Conclusion different spoken dialogue system configuration than the data for the validation and the training set.[7]. We in- We presented and compared two dialogue state tracking tended to investigate the influence of the complexity of models which are based on state-of-the-art architectures the task, hence we merged all DSTC2 data together and using recurrent neural networks. To our knowledge, we created splits of 80%, 10% and 10% for the training, de- are the first to use an encoder-decoder model for the dia- velopment and test sets. The results in Table 2 show that logue state tracking task, and we encourage others to do so the complexity of the task dropped significantly. because it is competitive with the standard RNN model.11 The models are comparable to the state-of-the-art models. We evaluate the models on DSTC2 dataset containing 9 We could have modified the decoder to always predict three sym- task-oriented dialogues in the restaurant domain. The bols for our three slots, but our experiments showed that the encoder- decoder architecture does not make mistakes at predicting the order of 11 The presented experiments are published at https://github. the three slots and EOS symbol. com/oplatek/e2end/ under Apache license. Informal experiments 10 The best model weights were found after 18 to 23 epochs for all were conducted during the Statistical Dialogue Systems course at Charles model architectures. University (see https://github.com/oplatek/sds-tracker). Recurrent Neural Networks for Dialogue State Tracking 67 models are trained using only ASR 1-best transcrip- [9] Matthew Henderson, Blaise Thomson, and Steve tions and task-specific lexical features defined by the task Young. Word-based dialog state tracking with re- database. We observe that dialogue state tracking on current neural networks. In Proceedings of the 15th DSTC2 test set is notoriously hard and that the task be- Annual Meeting of the Special Interest Group on comes substantially easier if the data is reshuffled. Discourse and Dialogue (SIGDIAL), pages 292–299, As future work, we plan to investigate the influence of 2014. the introduced database features on models accuracy. To our knowledge there is no dataset which can be used for [10] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, evaluating incremental dialogue state trackers, so it would and Jürgen Schmidhuber. Gradient flow in recurrent be beneficial to collect the word-level annotations so one nets: the difficulty of learning long-term dependen- can evaluate incremental DST models. cies, 2001. [11] Filip Jurčíček, Blaise Thomson, and Steve Young. Reinforcement learning for parameter estimation References in statistical spoken dialogue systems. Computer Speech & Language, 26(3):168–192, 2012. [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eu- gene Brevdo, Zhifeng Chen, Craig Citro, Greg S [12] Diederik Kingma and Jimmy Ba. Adam: A Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, method for stochastic optimization. arXiv preprint et al. TensorFlow: Large-scale machine learning on arXiv:1412.6980, 2014. heterogeneous systems, 2015. Software available [13] Byung-Jun Lee and Kee-Eung Kim. Dialog History from tensorflow. org. Construction with Long-Short Term Memory for Ro- [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua bust Generative Dialog State Tracking. Dialogue & Bengio. Neural machine translation by jointly Discourse, 7(3):47–64, 2016. learning to align and translate. arXiv preprint [14] Lutz Prechelt. Early stopping-but when? In Neural arXiv:1409.0473, 2014. Networks: Tricks of the trade, pages 55–69. Springer, 1998. [3] Yoshua Bengio, Rejean Ducharme, and Pascal Vin- cent. A Neural probabilistic language model. Jour- [15] Miroslav Vodolán, Rudolf Kadlec, and Jan Klein- nal of Machine Learning Research, 3:1137–1155, dienst. Hybrid Dialog State Tracker. CoRR, 2003. abs/1510.03710, 2015. URL http://arxiv.org/ abs/1510.03710. [4] Antoine Bordes and Jason Weston. Learning End-to-End Goal-Oriented Dialog. arXiv preprint [16] Lukás Žilka and Filip Jurčíček. Incremental arXiv:1605.07683, 2016. LSTM-based dialog state tracker. arXiv preprint arXiv:1507.03471, 2015. [5] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and [17] Paul J Werbos. Backpropagation through time: what Yoshua Bengio. Learning Phrase Representations us- it does and how to do it. Proceedings of the IEEE, 78 ing RNN Encoder-Decoder for Statistical Machine (10):1550–1560, 1990. Translation. CoRR, abs/1406.1078, 2014. URL [18] Jason Williams, Antoine Raux, Deepak Ramachan- http://arxiv.org/abs/1406.1078. dran, and Alan Black. The dialog state tracking chal- [6] Matthew Henderson, Blaise Thomson, and Steve lenge. In Proceedings of the SIGDIAL 2013 Confer- Young. Deep neural network approach for the dia- ence, pages 404–413, 2013. log state tracking challenge. Proceedings of the SIG- [19] Jason D Williams. Web-style ranking and SLU com- DIAL 2013 Conference, pages 467–471, 2013. bination for dialog state tracking. In 15th Annual Meeting of the Special Interest Group on Discourse [7] Matthew Henderson, Blaise Thomson, and Jason and Dialogue, volume 282, 2014. Williams. The second dialog state tracking chal- lenge. In 15th Annual Meeting of the Special Inter- [20] Steve Young, Milica Gašić, Simon Keizer, François est Group on Discourse and Dialogue, volume 263, Mairesse, Jost Schatzmann, Blaise Thomson, and 2014. Kai Yu. The hidden information state model: A prac- tical framework for POMDP-based spoken dialogue [8] Matthew Henderson, Blaise Thomson, and Jason D management. Computer Speech & Language, 24(2): Williams. The third dialog state tracking challenge. 150–174, 2010. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 324–329. IEEE, 2014.