=Paper=
{{Paper
|id=Vol-1649/63
|storemode=property
|title=Recurrent Neural Networks for Dialogue State Tracking
|pdfUrl=https://ceur-ws.org/Vol-1649/63.pdf
|volume=Vol-1649
|authors=Ondřej Plátek, Petr Bělohlávek, Vojtěch Hudeček, Filip Jurčíček
|dblpUrl=https://dblp.org/rec/conf/itat/PlatekBHJ16
}}
==Recurrent Neural Networks for Dialogue State Tracking==
<pdf width="1500px">https://ceur-ws.org/Vol-1649/63.pdf</pdf>
<pre>
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 63–67
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 O. Plátek, P. Bělohlávek, V. Hudeček, F. Jurčíček


                          Recurrent Neural Networks for Dialogue State Tracking

                                     Ondřej Plátek, Petr Bělohlávek, Vojtěch Hudeček, and Filip Jurčíček

                                          Charles University in Prague, Faculty of Mathematics and Physics
                                                   {oplatek,jurcicek}@ufal.mff.cuni.cz,
                                                                  me@petrbel.cz,
                                                           vojta.hudecek@gmail.com,
                                                  http://ufal.mff.cuni.cz/ondrej-platek
                                                                              ...
      Abstract: This paper discusses models for dialogue state                Dial. state n: food:None, area:None, pricerange:None
      tracking using recurrent neural networks (RNN). We                      System: What part of town do you have in mind?
      present experiments on the standard dialogue state track-               User: West part of town.
      ing (DST) dataset, DSTC2 [7]. On the one hand, RNN                      Dial. state n+1: food:None, area:west, pricerange:None
      models became the state of the art models in DST, on                    System: What kind of food would you like?
      the other hand, most state-of-the-art DST models are only               User: Indian
      turn-based and require dataset-specific preprocessing (e.g.             Dial. state n+2: food:Indian, area:west, pricerange:None
      DSTC2-specific) in order to achieve such results. We im-                System: India House is a nice place in the west of town
      plemented two architectures which can be used in an in-                 serving tasty Indian food.
      cremental setting and require almost no preprocessing. We               ...
      compare their performance to the benchmarks on DSTC2
      and discuss their properties. With only trivial preprocess-             Figure 1: Example of golden annotation of Dialogue Act
      ing, the performance of our models is close to the state-of-            Items (DAIs). The dialogue act items comprise from act
      the-art results.1                                                       type (all examples have type inform) and slots (food, area,
                                                                              pricerange) and their values (e.g. Indian, west, None).
      1    Introduction
                                                                                 This paper compares two different RNN architectures
                                                                              for dialogue state tracking (see Section 3). We describe
      Dialogue state tracking (DST) is a standard and important
                                                                              state-of-the art word-by-word dialogue state tracker archi-
      task for evaluating task-oriented conversational agents [18,
                                                                              tectures and propose to use a new encoder-decoder archi-
      7, 8]. Such agents play the role of a domain expert in a nar-
                                                                              tecture for the DST task (see Section 4.2).
      row domain, and users ask for information through con-
                                                                                 We focus only on the goal slot predictions because the
      versation in natural language (see the example system and
                                                                              other groups are trivial to predict2 .
      user responses in Figure 1). A dialogue state tracker sum-
                                                                                 We also experiment with re-splitting of the DSTC2 data
      marizes the dialogue history and maintains a probability
                                                                              because there are considerable differences between the
      distribution over the (possible) user’s goals (see annotation
                                                                              standard train and test datasets [7]. Since the training, de-
      in Figure 1). Dialogue agents as introduced in [20] decide
                                                                              velopment and test set data are distributed differently, the
      about the next action based on the dialogue state distribu-
                                                                              resulting performance difference between training and test
      tion given by the tracker. User’s goals are expressed in
                                                                              data is rather high. Based on our experiments, we con-
      a formal language, typically represented as a dialogue act
                                                                              clude that DSTC2 might suggest a too pessimistic view
      items (DAIs) (see Section 2) and the tracker updates prob-
                                                                              of the state-of-the-art methods in dialogue state tracking
      ability for each item. The dialogue state is a latent vari-
                                                                              caused by the data distribution mismatch.
      able [20] and one needs to label the conversations in order
      to train a dialogue state tracker using supervised learning.
      It was shown that with a better dialogue state tracker, con-            2    Dialogue state tracking on DSTC2 dataset
      versation agents achieve better success rate in overall com-
      pletion of the their task [11].                                         Dialogue state trackers maintain their believes beliefs
                                                                              about users’ goals by updating probabilities of dialogue
           1 Acknowledgment: We thank Mirek Vodolán and Ondřej Dušek         history representations. In the DSTC2 dataset, the his-
      for useful comments. This research was partly funded by the Ministry    tory is captured by dialogue act items and their probabili-
      of Education, Youth and Sports of the Czech Republic under the grant    ties. A Dialogue act item is a triple of the following form
      agreement LK11221, core research funding, grant GAUK 1915/2015,
      and also partially supported by SVV project number 260 333. We grate-
                                                                              (actionType, slotName, slotValue).
      fully acknowledge the support of NVIDIA Corporation with the donation      The DSTC2 is a standard dataset for DST, and most of
      of the Tesla K40c GPU used for this research. Computational resources   the state-of-the-art systems in DST have reported their per-
      were provided by the CESNET LM2015042 and the CERIT Scientific
      Cloud LM2015085, provided under the programme “Projects of Large             2 The slots Requested and Method have accuracies 0.95 and 0.95 on

      Research, Development, and Innovations Infrastructures”.                the test set according to the state-of-the-art [19].
64                                                                                                   O. Plátek, P. Bělohlávek, V. Hudeček, F. Jurčíček

     formance on this dataset [7]. The full dataset is freely                                  (price range, food, area)
     available since January 2014 and it contains 1612 dia-
     logues in the training set, 506 dialogues in the develop-
     ment set and 1117 dialogues in the test set.3 The con-
     versations are manually annotated at the turn level where                           h1          h2                  hT
     the hidden information state is expressed in form of
     (actionType, slotName, slotValue) based on the domain                                X1        X2                   XT
     ontology. The task of the domain is defined by a database
     of restaurants and their properties4 . The database and
                                                                               Figure 2: The joint label predictions using RNN from last
     the manually designed ontology that captures a restaurant
                                                                               hidden state hT . The hT represents the whole dialog his-
     domain are both distributed with the dataset.
                                                                               tory of T words. The RNN takes as input for each word
                                                                               i an embedding and binary features concatenated to vec-
     3     Models                                                              tor Xi .

     Our models are all based on a RNN encoder [17]. The
                                                                                                                 food
     models update their hidden states h after processing each
                                                                                                      price range area
     word similarly to the RNN encoder of Žilka and Jurčíček
     [16]. The encoder takes as inputs the previous state ht−1 ,
     representing history for first t − 1 words, and features Xt
     for the current word wt . It outputs the current state ht rep-                            h1      h2               hT
     resenting the whole dialogue history up to current word.
     We use a Gated Recurrent Unit (GRU) cell [5] as the up-                                   X1      X2               XT
     date function instead of a simple RNN cell because it does
     not suffer from the vanishing gradient problem [10]. The                  Figure 3: The RNN encodes the word history into dialogue
     model optimizes its parameters including word embed-                      state hT and predicts slot values independently.
     dings [3] during training.
        For each input token, our RNN encoder reads the word
     embedding of this token along with several binary fea-                    3.1   Independent classifiers for each label
     tures. The binary features for each word are:
                                                                               The independent model (see Figure 3.1) consists of three
         • the speaker role, representing either user or system,               models which predict f ood, area and pricerange based on
         • and also indicators describing whether the word is                  the last hidden state hT independently. The independent
           part of a named entity representing a value from the                slot prediction that uses one classifier per slot is straight-
           database.                                                           forward to implement, but the model introduces an unre-
                                                                               alistic assumption of uncorrelated slot properties. In case
     Since the DSTC2 database is a simple table with six                       of DSTC2 and the Cambridge restaurant domain, it is hard
     columns, we introduce six binary features firing if the                   to believe that, e.g., the slots area and pricerange are not
     word is a substring of named entity from the given col-                   correlated.
     umn. For example, the word indian will not only trigger                      We also experimented with a single classifier which pre-
     the feature for column f ood and its value indian but also                dicts the labels jointly (see Figure 3) but it suffers from
     for column restaurant name and its value indian heaven.                   data sparsity of the predicted tuples, so we focused only
     The features make the data dense by abstracting the mean-                 on the independent label prediction and encoder-decoder
     ing from the lexical level.                                               models.
        Our model variants differ only in the way they pre-
     dict goal labels, i.e., f ood, area and pricerange from the
     RNN’s last encoded state.5 The first model predicts the                   3.2   Encoder-decoder framework
     output slot labels independently by employing three inde-                 We cast the slot predictions problem as a sequence-to-
     pendent classifiers (see Section 3.1). The second model                   sequence predictions task and we use a encoder-decoder
     uses a decoder in order to predict values one after the other             model with attention [2] to learn this representation to-
     from the hT (see Section 3.2).                                            gether with slot predictions (see Figure 4). To our knowl-
        The models were implemented using the Tensor-                          edge, we are the first who used this model for dialogue
     Flow [1] framework.                                                       state tracking. The model is successfully used in machine
          3 Available online at http://camdial.org/~mh521/dstc/.
                                                                               translation where it is able to handle long sequences with
          4 There are six columns in the database: name, food, price_range,
                                                                               good accuracy [2]. In DST, it captures correlation between
     area, telephone, address.
          5 Accuracy measure with schedule 2 on slot f ood, area and           the decoded slots easily. By introducing the encoder-
     pricerange about which users can inform the system is a featured metric   decoder architecture, we aim to overcome the data sparsity
     for DSTC2 challenge [7].                                                  problem and the incorrect independence assumptions.
Recurrent Neural Networks for Dialogue State Tracking                                                                                                                                65
                                                                                                                                 Joint labels distribution


                                     a1        a2     a3      a4                                         600


                                                                                                         500


                                                                                                         400


                                                                                       # of occurences
          h1         h2                   hT        h'1    h'2     h'3   h'4                             300


                                                                                                         200


          X1        X2                    XT        Slot1 Slot2 Slot3 EOS
                                                                                                         100


       Figure 4: Encoder decoder with attention predicts goals.                                            0
                                                                                                               0   100   200   300      400      500         600   700   800   900

                                                                                                                          Labels sorted according # of occurences


                                                                               Figure 5: The number occurrences of labels in form of
         We employ the encoder RNN cell that captures the his-
                                                                               ( f ood, area, pricerange) triples from the least to the most
      tory of the dialogue which is represented as a sequence of
                                                                               frequent.
      words from the user and the system. The words are fed to
      the encoder as they appear in the dialogue - turn by turn -
      where the user and the system responses switch regularly.                4.1   Training
      The encoder updates its internal state hT after each pro-
      cessed word. The RNN decoder model is used when the                      The training procedure minimizes the cross-entropy loss
      system needs to generate its output, in our case it is at the            function using the Adam optimizer [12] with a batch size
      end of the user response. The decoder generates arbitrary                of 10. We train predicting the goal slot values for each
      length sequences of words given the encoded state hT step                turn. We treat each dialogue turn as a separate training
      by step. In each step, an output word and a new hidden                   example, feeding the whole dialogue history up to the cur-
      state hT +1 is generated. The generation process is finished             rent turn into the encoder and predicting the slot labels for
      when a special token End of Sequence (EOS) is decoded.                   the current turn.
      This mechanism allows the model to terminate the output                     We use early stopping with patience [14], validating on
      sequence. The attention part of model is used at decoding                the development set after each epoch and stopping if the
      time for weighting the importance of the history.                        three top models does not change for four epochs.
         The disadvantage of this model is its complexity.                        The predicted labels in DST task depend not only on
      Firstly, the model is not trivial to implement6 . Secondly,              the last turn, but on the dialogue full history as well. Since
      the decoding time is asymptotically quadratic in the length              the lengths of dialogue histories vary a lot7 and we batch
      of the decoded sequences, but our target sequences are al-               our inputs, we separated the dialogues into ten buckets ac-
      ways four tokens long nevertheless.                                      cordingly to their lengths in order to provide a computa-
                                                                               tional speed-up. We reshuffle the data after each epoch
                                                                               only within each bucket.
      4     Experiments                                                           In informal experiments, we tried to speed-up the train-
                                                                               ing by optimizing the parameters only on the last turn8 but
      The results are reported on the standard DSTC2 data split                the performance dropped relatively by more than 40%.
      where we used 516 dialogues as a validation set for early
      stopping [14] and the remaining 1612 dialogues for train-
      ing. We use 1-best Automatic Speech Recognition (ASR)                    4.2   Comparing models
      transcriptions of conversation history of the input and mea-             Predicting the labels jointly is quite challenging because
      sure the joint slot accuracy. The models are evaluated us-               the distribution of the labels is skewed as demonstrated
      ing the recommended measure accuracy [7] with schedule                   in Figure 5. Some of the labels combinations are very
      2 which skips the first turns where the believe tracker does             rare, and they occur only in the development and test set
      not track any values. In addition, our models are also eval-             so the joint model is not able to predict them. During first
      uated on a randomly split DSTC2 dataset (see Section 4.3).               informal experiments the joint model performed poorly
         For all our experiments, we train word embeddings of                  arguably due to data sparsity of slot triples. We further
      size 100 and use the encoder state size of size 100, together            focus on model with independent classifiers and encoder-
      with a dropout keep probability of 0.7 for both encoder                  decoder architecture.
      inputs and outputs. These parameters are selected by a grid                 The model with independent label prediction is a strong
      search over the hyper-parameters on development data.                    baseline which was used, among others, in work of Žilka
                                                                               and Jurčíček [16]. The model suffers less from dataset
                                                                                    7 The maximum dialogue history length is 487 words and 95% per-

                                                                               centile is 205 words for the training set.
                                                                                    8 The prediction was conditioned on the full history but we back-
           6 We modified code from the TensorFlow ‘seq2seq‘ module.            propagated the error only in words within the last turn.
66                                                                                                     O. Plátek, P. Bělohlávek, V. Hudeček, F. Jurčíček


           Model                                 Dev set      Test set        5    Related work
        Indep                                       0.892     0.727           Since there are numerous systems which reported on the
        EncDec                                      0.867     0.730           DSTC2 dataset, we discuss only the systems which use
        Vodolán et al. [15]                             -     0.745           RNNs. In general, the RNN systems achieved excellent
        Žilka and Jurčíček [16]                    0.69     0.72            results.
        Henderson et al. [6]                            -     0.737              Our system is related to the RNN tracker of Žilka and
        DSTC2 stacking ensemble [7]                     -     0.789           Jurčíček [16], which reported near state-of-the art results
                                                                              on the DSTC2 dataset and introduced the first incremental
     Table 1: Accuracy on DSTC2 dataset. The first group con-                 system which was able to update the dialogue state word-
     tains our systems which use ASR output as input, the sec-                by-word with such accuracy. In contrast to work of Žilka
     ond group lists systems using also ASR hypothesis as in-                 and Jurčíček [16], we use no abstraction of slot values. In-
     put. The third group shows the results for ensemble model                stead, we add the additional features as described in Sec-
     using ASR output nd also live language understanding an-                 tion 3. The first system which used a neural network for
     notations.                                                               dialogue state tracking [6] used a feed-forward network
                                                                              and more than 10 manually engineered features across dif-
                                                                              ferent levels of abstraction of the user input, including
                      Model         Dev set     Test set                      the outputs of the spoken language understanding com-
                     Indep              0.87    0.89                          ponent (SLU). In our work, we focus on simplifying the
                     EncDec             0.94    0.91                          architecture, hence we used only features which were ex-
                                                                              plicitly given by the dialogue history word representation
     Table 2: Accuracy of our models on the re-split DSTC2                    and the database.
     data.                                                                       The system of Henderson et al. [9] achieves the state-
                                                                              of-the-art results and, similarly to our system, it predicts
     mismatch because it does not model the correlation be-                   the dialogue state from words by employing a RNN. On
     tween predicted labels. This property can explain a smaller              the other hand, their system heavily relies on the user in-
     performance drop between the test set from reshuffled data               put abstraction. Another dialogue state tracker with LSTM
     and the official test set in comparison to encoder-decoder               was used in the reinforcement setting but the authors also
     model.                                                                   used information from the SLU pipeline [13].
        Since the encoder-decoder architecture is very general                   An interesting approach is presented in the work
     and can predict arbitrary output sequences, it also needs                of Vodolán et al. [15], who combine a rule-based and a
     to learn how to predict only three slot labels in the correct            machine learning based approach. The handcrafted fea-
     order. It turned out that the architecture learned to pre-               tures are fed to an LSTM-based RNN which performs a
     dict quadruples with three slot values and the EOS symbol                dialog-state update. However, unlike our work, their sys-
     quickly, even before seeing a half of the training data in               tem requires SLU output on its input.
     the first epoch.9 At the end of the first epoch, the system                 It is worth noting that there are first attempts to train an
     made no more mistakes on predicting slot values in the in-               end-to-end dialogue system even without explicitly mod-
     correct order. The encoder-decoder system is competitive                 eling the dialogue state [4], which further simplifies the
     with state-of-the art architectures and the time needed for              architecture of a dialogue system. However, the reported
     learning the output structure was surprisingly short.10                  end-to-end model was evaluated only on artificial dataset
                                                                              and cannot be compared to DSTC2 dataset directly.
     4.3    Data preparation experiments

     The data for the DSTC2 test set were collected using a                   6    Conclusion
     different spoken dialogue system configuration than the
     data for the validation and the training set.[7]. We in-                 We presented and compared two dialogue state tracking
     tended to investigate the influence of the complexity of                 models which are based on state-of-the-art architectures
     the task, hence we merged all DSTC2 data together and                    using recurrent neural networks. To our knowledge, we
     created splits of 80%, 10% and 10% for the training, de-                 are the first to use an encoder-decoder model for the dia-
     velopment and test sets. The results in Table 2 show that                logue state tracking task, and we encourage others to do so
     the complexity of the task dropped significantly.                        because it is competitive with the standard RNN model.11
                                                                              The models are comparable to the state-of-the-art models.
                                                                                 We evaluate the models on DSTC2 dataset containing
          9 We could have modified the decoder to always predict three sym-
                                                                              task-oriented dialogues in the restaurant domain. The
     bols for our three slots, but our experiments showed that the encoder-
     decoder architecture does not make mistakes at predicting the order of      11 The presented experiments are published at https://github.

     the three slots and EOS symbol.                                          com/oplatek/e2end/ under Apache license. Informal experiments
         10 The best model weights were found after 18 to 23 epochs for all   were conducted during the Statistical Dialogue Systems course at Charles
     model architectures.                                                     University (see https://github.com/oplatek/sds-tracker).
Recurrent Neural Networks for Dialogue State Tracking                                                                             67

      models are trained using only ASR 1-best transcrip-             [9] Matthew Henderson, Blaise Thomson, and Steve
      tions and task-specific lexical features defined by the task        Young. Word-based dialog state tracking with re-
      database. We observe that dialogue state tracking on                current neural networks. In Proceedings of the 15th
      DSTC2 test set is notoriously hard and that the task be-            Annual Meeting of the Special Interest Group on
      comes substantially easier if the data is reshuffled.               Discourse and Dialogue (SIGDIAL), pages 292–299,
         As future work, we plan to investigate the influence of          2014.
      the introduced database features on models accuracy. To
      our knowledge there is no dataset which can be used for        [10] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi,
      evaluating incremental dialogue state trackers, so it would         and Jürgen Schmidhuber. Gradient flow in recurrent
      be beneficial to collect the word-level annotations so one          nets: the difficulty of learning long-term dependen-
      can evaluate incremental DST models.                                cies, 2001.

                                                                     [11] Filip Jurčíček, Blaise Thomson, and Steve Young.
                                                                          Reinforcement learning for parameter estimation
      References                                                          in statistical spoken dialogue systems. Computer
                                                                          Speech & Language, 26(3):168–192, 2012.
       [1] Martın Abadi, Ashish Agarwal, Paul Barham, Eu-
           gene Brevdo, Zhifeng Chen, Craig Citro, Greg S            [12] Diederik Kingma and Jimmy Ba.          Adam: A
           Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,             method for stochastic optimization. arXiv preprint
           et al. TensorFlow: Large-scale machine learning on             arXiv:1412.6980, 2014.
           heterogeneous systems, 2015. Software available
                                                                     [13] Byung-Jun Lee and Kee-Eung Kim. Dialog History
           from tensorflow. org.
                                                                          Construction with Long-Short Term Memory for Ro-
       [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua                    bust Generative Dialog State Tracking. Dialogue &
           Bengio.    Neural machine translation by jointly               Discourse, 7(3):47–64, 2016.
           learning to align and translate. arXiv preprint           [14] Lutz Prechelt. Early stopping-but when? In Neural
           arXiv:1409.0473, 2014.                                         Networks: Tricks of the trade, pages 55–69. Springer,
                                                                          1998.
       [3] Yoshua Bengio, Rejean Ducharme, and Pascal Vin-
           cent. A Neural probabilistic language model. Jour-        [15] Miroslav Vodolán, Rudolf Kadlec, and Jan Klein-
           nal of Machine Learning Research, 3:1137–1155,                 dienst.   Hybrid Dialog State Tracker.   CoRR,
           2003.                                                          abs/1510.03710, 2015. URL http://arxiv.org/
                                                                          abs/1510.03710.
       [4] Antoine Bordes and Jason Weston.      Learning
           End-to-End Goal-Oriented Dialog. arXiv preprint           [16] Lukás Žilka and Filip Jurčíček.  Incremental
           arXiv:1605.07683, 2016.                                        LSTM-based dialog state tracker. arXiv preprint
                                                                          arXiv:1507.03471, 2015.
       [5] Kyunghyun Cho, Bart van Merrienboer, Çaglar
           Gülçehre, Fethi Bougares, Holger Schwenk, and             [17] Paul J Werbos. Backpropagation through time: what
           Yoshua Bengio. Learning Phrase Representations us-             it does and how to do it. Proceedings of the IEEE, 78
           ing RNN Encoder-Decoder for Statistical Machine                (10):1550–1560, 1990.
           Translation. CoRR, abs/1406.1078, 2014. URL
                                                                     [18] Jason Williams, Antoine Raux, Deepak Ramachan-
           http://arxiv.org/abs/1406.1078.
                                                                          dran, and Alan Black. The dialog state tracking chal-
       [6] Matthew Henderson, Blaise Thomson, and Steve                   lenge. In Proceedings of the SIGDIAL 2013 Confer-
           Young. Deep neural network approach for the dia-               ence, pages 404–413, 2013.
           log state tracking challenge. Proceedings of the SIG-     [19] Jason D Williams. Web-style ranking and SLU com-
           DIAL 2013 Conference, pages 467–471, 2013.                     bination for dialog state tracking. In 15th Annual
                                                                          Meeting of the Special Interest Group on Discourse
       [7] Matthew Henderson, Blaise Thomson, and Jason
                                                                          and Dialogue, volume 282, 2014.
           Williams. The second dialog state tracking chal-
           lenge. In 15th Annual Meeting of the Special Inter-       [20] Steve Young, Milica Gašić, Simon Keizer, François
           est Group on Discourse and Dialogue, volume 263,               Mairesse, Jost Schatzmann, Blaise Thomson, and
           2014.                                                          Kai Yu. The hidden information state model: A prac-
                                                                          tical framework for POMDP-based spoken dialogue
       [8] Matthew Henderson, Blaise Thomson, and Jason D                 management. Computer Speech & Language, 24(2):
           Williams. The third dialog state tracking challenge.           150–174, 2010.
           In Spoken Language Technology Workshop (SLT),
           2014 IEEE, pages 324–329. IEEE, 2014.

</pre>