<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Recurrent Neural Networks for Dialogue State Tracking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ondrˇej Plátek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Beˇlohlávek</string-name>
          <email>me@petrbel.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojteˇch Hudecˇek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filip Jurcˇícˇek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2: food:Indian, area:west, pricerange:None System: India House is a nice place in the west of town serving tasty Indian food. . .</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Charles University in Prague, Faculty of Mathematics and Physics</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1649</volume>
      <fpage>63</fpage>
      <lpage>67</lpage>
      <abstract>
        <p>This paper discusses models for dialogue state tracking using recurrent neural networks (RNN). We present experiments on the standard dialogue state tracking (DST) dataset, DSTC2 [7]. On the one hand, RNN models became the state of the art models in DST, on the other hand, most state-of-the-art DST models are only turn-based and require dataset-specific preprocessing (e.g. DSTC2-specific) in order to achieve such results. We implemented two architectures which can be used in an incremental setting and require almost no preprocessing. We compare their performance to the benchmarks on DSTC2 and discuss their properties. With only trivial preprocessing, the performance of our models is close to the state-ofthe-art results.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Dialogue state tracking (DST) is a standard and important
task for evaluating task-oriented conversational agents [
        <xref ref-type="bibr" rid="ref18 ref7 ref8">18,
7, 8</xref>
        ]. Such agents play the role of a domain expert in a
narrow domain, and users ask for information through
conversation in natural language (see the example system and
user responses in Figure 1). A dialogue state tracker
summarizes the dialogue history and maintains a probability
distribution over the (possible) user’s goals (see annotation
in Figure 1). Dialogue agents as introduced in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] decide
about the next action based on the dialogue state
distribution given by the tracker. User’s goals are expressed in
a formal language, typically represented as a dialogue act
items (DAIs) (see Section 2) and the tracker updates
probability for each item. The dialogue state is a latent
variable [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and one needs to label the conversations in order
to train a dialogue state tracker using supervised learning.
It was shown that with a better dialogue state tracker,
conversation agents achieve better success rate in overall
completion of the their task [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>1Acknowledgment: We thank Mirek Vodolán and Ondrˇej Dušek
for useful comments. This research was partly funded by the Ministry
of Education, Youth and Sports of the Czech Republic under the grant
agreement LK11221, core research funding, grant GAUK 1915/2015,
and also partially supported by SVV project number 260 333. We
gratefully acknowledge the support of NVIDIA Corporation with the donation
of the Tesla K40c GPU used for this research. Computational resources
were provided by the CESNET LM2015042 and the CERIT Scientific
Cloud LM2015085, provided under the programme “Projects of Large
Research, Development, and Innovations Infrastructures”.</p>
      <p>This paper compares two different RNN architectures
for dialogue state tracking (see Section 3). We describe
state-of-the art word-by-word dialogue state tracker
architectures and propose to use a new encoder-decoder
architecture for the DST task (see Section 4.2).</p>
      <p>We focus only on the goal slot predictions because the
other groups are trivial to predict2.</p>
      <p>
        We also experiment with re-splitting of the DSTC2 data
because there are considerable differences between the
standard train and test datasets [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Since the training,
development and test set data are distributed differently, the
resulting performance difference between training and test
data is rather high. Based on our experiments, we
conclude that DSTC2 might suggest a too pessimistic view
of the state-of-the-art methods in dialogue state tracking
caused by the data distribution mismatch.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dialogue state tracking on DSTC2 dataset</title>
      <p>Dialogue state trackers maintain their believes beliefs
about users’ goals by updating probabilities of dialogue
history representations. In the DSTC2 dataset, the
history is captured by dialogue act items and their
probabilities. A Dialogue act item is a triple of the following form
(actionType, slotName, slotValue).</p>
      <p>
        The DSTC2 is a standard dataset for DST, and most of
the state-of-the-art systems in DST have reported their
per2The slots Requested and Method have accuracies 0.95 and 0.95 on
the test set according to the state-of-the-art [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>h1
X1
h2</p>
      <p>X2
h1
X1
h2
X2
hT</p>
      <p>
        T
hT
XT
formance on this dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The full dataset is freely
available since January 2014 and it contains 1612
dialogues in the training set, 506 dialogues in the
development set and 1117 dialogues in the test set.3 The
conversations are manually annotated at the turn level where
the hidden information state is expressed in form of
(actionType, slotName, slotValue) based on the domain
ontology. The task of the domain is defined by a database
of restaurants and their properties4. The database and
the manually designed ontology that captures a restaurant
domain are both distributed with the dataset.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Models</title>
      <p>
        Our models are all based on a RNN encoder [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The
models update their hidden states h after processing each
word similarly to the RNN encoder of Žilka and Jurcˇícˇek
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The encoder takes as inputs the previous state ht−1,
representing history for first t − 1 words, and features Xt
for the current word wt . It outputs the current state ht
representing the whole dialogue history up to current word.
We use a Gated Recurrent Unit (GRU) cell [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as the
update function instead of a simple RNN cell because it does
not suffer from the vanishing gradient problem [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
model optimizes its parameters including word
embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] during training.
      </p>
      <p>For each input token, our RNN encoder reads the word
embedding of this token along with several binary
features. The binary features for each word are:
• the speaker role, representing either user or system,
• and also indicators describing whether the word is
part of a named entity representing a value from the
database.</p>
      <p>Since the DSTC2 database is a simple table with six
columns, we introduce six binary features firing if the
word is a substring of named entity from the given
column. For example, the word indian will not only trigger
the feature for column f ood and its value indian but also
for column restaurant name and its value indian heaven.
The features make the data dense by abstracting the
meaning from the lexical level.</p>
      <p>Our model variants differ only in the way they
predict goal labels, i.e., f ood, area and pricerange from the
RNN’s last encoded state.5 The first model predicts the
output slot labels independently by employing three
independent classifiers (see Section 3.1). The second model
uses a decoder in order to predict values one after the other
from the hT (see Section 3.2).</p>
      <p>
        The models were implemented using the
TensorFlow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] framework.
      </p>
      <p>3Available online at http://camdial.org/~mh521/dstc/.
4There are six columns in the database: name, food, price_range,
area, telephone, address.</p>
      <p>
        5Accuracy measure with schedule 2 on slot f ood, area and
pricerange about which users can inform the system is a featured metric
for DSTC2 challenge [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
(price range, food, area)
      </p>
      <sec id="sec-3-1">
        <title>3.1 Independent classifiers for each label</title>
        <p>The independent model (see Figure 3.1) consists of three
models which predict f ood, area and pricerange based on
the last hidden state hT independently. The independent
slot prediction that uses one classifier per slot is
straightforward to implement, but the model introduces an
unrealistic assumption of uncorrelated slot properties. In case
of DSTC2 and the Cambridge restaurant domain, it is hard
to believe that, e.g., the slots area and pricerange are not
correlated.</p>
        <p>We also experimented with a single classifier which
predicts the labels jointly (see Figure 3) but it suffers from
data sparsity of the predicted tuples, so we focused only
on the independent label prediction and encoder-decoder
models.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Encoder-decoder framework</title>
        <p>
          We cast the slot predictions problem as a
sequence-tosequence predictions task and we use a encoder-decoder
model with attention [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to learn this representation
together with slot predictions (see Figure 4). To our
knowledge, we are the first who used this model for dialogue
state tracking. The model is successfully used in machine
translation where it is able to handle long sequences with
good accuracy [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In DST, it captures correlation between
the decoded slots easily. By introducing the
encoderdecoder architecture, we aim to overcome the data sparsity
problem and the incorrect independence assumptions.
X
1
h2
X
2
hT
X
        </p>
        <p>T
h'1
h'2
h'3</p>
        <p>h'4
Slot1 Slot2 Slot3 EOS</p>
        <p>We employ the encoder RNN cell that captures the
history of the dialogue which is represented as a sequence of
words from the user and the system. The words are fed to
the encoder as they appear in the dialogue - turn by turn
where the user and the system responses switch regularly.
The encoder updates its internal state hT after each
processed word. The RNN decoder model is used when the
system needs to generate its output, in our case it is at the
end of the user response. The decoder generates arbitrary
length sequences of words given the encoded state hT step
by step. In each step, an output word and a new hidden
state hT +1 is generated. The generation process is finished
when a special token End of Sequence (EOS) is decoded.
This mechanism allows the model to terminate the output
sequence. The attention part of model is used at decoding
time for weighting the importance of the history.</p>
        <p>The disadvantage of this model is its complexity.
Firstly, the model is not trivial to implement6. Secondly,
the decoding time is asymptotically quadratic in the length
of the decoded sequences, but our target sequences are
always four tokens long nevertheless.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        The results are reported on the standard DSTC2 data split
where we used 516 dialogues as a validation set for early
stopping [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the remaining 1612 dialogues for
training. We use 1-best Automatic Speech Recognition (ASR)
transcriptions of conversation history of the input and
measure the joint slot accuracy. The models are evaluated
using the recommended measure accuracy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] with schedule
2 which skips the first turns where the believe tracker does
not track any values. In addition, our models are also
evaluated on a randomly split DSTC2 dataset (see Section 4.3).
      </p>
      <p>
        For all our experiments, we train word embeddings of
size 100 and use the encoder state size of size 100, together
with a dropout keep probability of 0.7 for both encoder
inputs and outputs. These parameters are selected by a grid
search over the hyper-parameters on development data.
6We modified code from the TensorFlow ‘seq2seq‘ module.
600
500
The training procedure minimizes the cross-entropy loss
function using the Adam optimizer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] with a batch size
of 10. We train predicting the goal slot values for each
turn. We treat each dialogue turn as a separate training
example, feeding the whole dialogue history up to the
current turn into the encoder and predicting the slot labels for
the current turn.
      </p>
      <p>
        We use early stopping with patience [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], validating on
the development set after each epoch and stopping if the
three top models does not change for four epochs.
      </p>
      <p>The predicted labels in DST task depend not only on
the last turn, but on the dialogue full history as well. Since
the lengths of dialogue histories vary a lot7 and we batch
our inputs, we separated the dialogues into ten buckets
accordingly to their lengths in order to provide a
computational speed-up. We reshuffle the data after each epoch
only within each bucket.</p>
      <p>In informal experiments, we tried to speed-up the
training by optimizing the parameters only on the last turn8 but
the performance dropped relatively by more than 40%.
Predicting the labels jointly is quite challenging because
the distribution of the labels is skewed as demonstrated
in Figure 5. Some of the labels combinations are very
rare, and they occur only in the development and test set
so the joint model is not able to predict them. During first
informal experiments the joint model performed poorly
arguably due to data sparsity of slot triples. We further
focus on model with independent classifiers and
encoderdecoder architecture.</p>
      <p>
        The model with independent label prediction is a strong
baseline which was used, among others, in work of Žilka
and Jurcˇícˇek [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The model suffers less from dataset
7The maximum dialogue history length is 487 words and 95%
percentile is 205 words for the training set.
      </p>
      <p>8The prediction was conditioned on the full history but we
backpropagated the error only in words within the last turn.
Model
mismatch because it does not model the correlation
between predicted labels. This property can explain a smaller
performance drop between the test set from reshuffled data
and the official test set in comparison to encoder-decoder
model.</p>
      <p>Since the encoder-decoder architecture is very general
and can predict arbitrary output sequences, it also needs
to learn how to predict only three slot labels in the correct
order. It turned out that the architecture learned to
predict quadruples with three slot values and the EOS symbol
quickly, even before seeing a half of the training data in
the first epoch.9 At the end of the first epoch, the system
made no more mistakes on predicting slot values in the
incorrect order. The encoder-decoder system is competitive
with state-of-the art architectures and the time needed for
learning the output structure was surprisingly short.10
4.3</p>
      <sec id="sec-4-1">
        <title>Data preparation experiments</title>
        <p>
          The data for the DSTC2 test set were collected using a
different spoken dialogue system configuration than the
data for the validation and the training set.[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We
intended to investigate the influence of the complexity of
the task, hence we merged all DSTC2 data together and
created splits of 80%, 10% and 10% for the training,
development and test sets. The results in Table 2 show that
the complexity of the task dropped significantly.
        </p>
        <p>9We could have modified the decoder to always predict three
symbols for our three slots, but our experiments showed that the
encoderdecoder architecture does not make mistakes at predicting the order of
the three slots and EOS symbol.</p>
        <p>10The best model weights were found after 18 to 23 epochs for all
model architectures.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related work</title>
      <p>Since there are numerous systems which reported on the
DSTC2 dataset, we discuss only the systems which use
RNNs. In general, the RNN systems achieved excellent
results.</p>
      <p>
        Our system is related to the RNN tracker of Žilka and
Jurcˇícˇek [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], which reported near state-of-the art results
on the DSTC2 dataset and introduced the first incremental
system which was able to update the dialogue state
wordby-word with such accuracy. In contrast to work of Žilka
and Jurcˇícˇek [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we use no abstraction of slot values.
Instead, we add the additional features as described in
Section 3. The first system which used a neural network for
dialogue state tracking [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used a feed-forward network
and more than 10 manually engineered features across
different levels of abstraction of the user input, including
the outputs of the spoken language understanding
component (SLU). In our work, we focus on simplifying the
architecture, hence we used only features which were
explicitly given by the dialogue history word representation
and the database.
      </p>
      <p>
        The system of Henderson et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] achieves the
stateof-the-art results and, similarly to our system, it predicts
the dialogue state from words by employing a RNN. On
the other hand, their system heavily relies on the user
input abstraction. Another dialogue state tracker with LSTM
was used in the reinforcement setting but the authors also
used information from the SLU pipeline [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        An interesting approach is presented in the work
of Vodolán et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], who combine a rule-based and a
machine learning based approach. The handcrafted
features are fed to an LSTM-based RNN which performs a
dialog-state update. However, unlike our work, their
system requires SLU output on its input.
      </p>
      <p>
        It is worth noting that there are first attempts to train an
end-to-end dialogue system even without explicitly
modeling the dialogue state [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which further simplifies the
architecture of a dialogue system. However, the reported
end-to-end model was evaluated only on artificial dataset
and cannot be compared to DSTC2 dataset directly.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We presented and compared two dialogue state tracking
models which are based on state-of-the-art architectures
using recurrent neural networks. To our knowledge, we
are the first to use an encoder-decoder model for the
dialogue state tracking task, and we encourage others to do so
because it is competitive with the standard RNN model.11
The models are comparable to the state-of-the-art models.</p>
      <p>We evaluate the models on DSTC2 dataset containing
task-oriented dialogues in the restaurant domain. The
11The presented experiments are published at https://github.
com/oplatek/e2end/ under Apache license. Informal experiments
were conducted during the Statistical Dialogue Systems course at Charles
University (see https://github.com/oplatek/sds-tracker).
models are trained using only ASR 1-best
transcriptions and task-specific lexical features defined by the task
database. We observe that dialogue state tracking on
DSTC2 test set is notoriously hard and that the task
becomes substantially easier if the data is reshuffled.</p>
      <p>As future work, we plan to investigate the influence of
the introduced database features on models accuracy. To
our knowledge there is no dataset which can be used for
evaluating incremental dialogue state trackers, so it would
be beneficial to collect the word-level annotations so one
can evaluate incremental DST models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Martın</given-names>
            <surname>Abadi</surname>
          </string-name>
          , Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis,
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Matthieu</given-names>
            <surname>Devin</surname>
          </string-name>
          , et al.
          <source>TensorFlow: Large-scale machine learning on heterogeneous systems</source>
          ,
          <year>2015</year>
          .
          <article-title>Software available from tensorflow</article-title>
          .
          <source>org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Rejean Ducharme, and
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Vincent</surname>
          </string-name>
          .
          <article-title>A Neural probabilistic language model</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <article-title>Learning End-to-End Goal-Oriented Dialog</article-title>
          .
          <source>arXiv preprint arXiv:1605.07683</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merrienboer,
          <string-name>
            <surname>Çaglar Gülçehre</surname>
            , Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          . CoRR, abs/1406.1078,
          <year>2014</year>
          . URL http://arxiv.org/abs/1406.1078.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Blaise Thomson, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Deep neural network approach for the dialog state tracking challenge</article-title>
          .
          <source>Proceedings of the SIGDIAL 2013 Conference</source>
          , pages
          <fpage>467</fpage>
          -
          <lpage>471</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Blaise Thomson, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>The second dialog state tracking challenge</article-title>
          .
          <source>In 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          , volume
          <volume>263</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Blaise Thomson, and
          <string-name>
            <given-names>Jason D</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>The third dialog state tracking challenge</article-title>
          .
          <source>In Spoken Language Technology Workshop (SLT)</source>
          ,
          <year>2014</year>
          IEEE, pages
          <fpage>324</fpage>
          -
          <lpage>329</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Blaise Thomson, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Word-based dialog state tracking with recurrent neural networks</article-title>
          .
          <source>In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)</source>
          , pages
          <fpage>292</fpage>
          -
          <lpage>299</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sepp</surname>
            <given-names>Hochreiter</given-names>
          </string-name>
          , Yoshua Bengio, Paolo Frasconi, and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Gradient flow in recurrent nets: the difficulty of learning long-term dependencies</article-title>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Filip</surname>
            <given-names>Jurcˇícˇek</given-names>
          </string-name>
          , Blaise Thomson, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Reinforcement learning for parameter estimation in statistical spoken dialogue systems</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ):
          <fpage>168</fpage>
          -
          <lpage>192</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Diederik</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Byung-Jun Lee</surname>
          </string-name>
          and
          <string-name>
            <surname>Kee-Eung Kim</surname>
          </string-name>
          .
          <article-title>Dialog History Construction with Long-Short Term Memory for Robust Generative Dialog State Tracking</article-title>
          .
          <source>Dialogue &amp; Discourse</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>47</fpage>
          -
          <lpage>64</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Lutz</given-names>
            <surname>Prechelt</surname>
          </string-name>
          .
          <article-title>Early stopping-but when? In Neural Networks: Tricks of the trade</article-title>
          , pages
          <fpage>55</fpage>
          -
          <lpage>69</lpage>
          . Springer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Miroslav</surname>
            <given-names>Vodolán</given-names>
          </string-name>
          , Rudolf Kadlec, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Kleindienst</surname>
          </string-name>
          .
          <article-title>Hybrid Dialog State Tracker</article-title>
          . CoRR, abs/1510.03710,
          <year>2015</year>
          . URL http://arxiv.org/ abs/1510.03710.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Lukás</given-names>
            <surname>Žilka</surname>
          </string-name>
          and
          <article-title>Filip Jurcˇícˇek. Incremental LSTM-based dialog state tracker</article-title>
          .
          <source>arXiv preprint arXiv:1507.03471</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Paul</surname>
          </string-name>
          J Werbos.
          <article-title>Backpropagation through time: what it does and how to do it</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>78</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1550</fpage>
          -
          <lpage>1560</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Jason</given-names>
            <surname>Williams</surname>
          </string-name>
          , Antoine Raux, Deepak Ramachandran, and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Black</surname>
          </string-name>
          .
          <article-title>The dialog state tracking challenge</article-title>
          .
          <source>In Proceedings of the SIGDIAL 2013 Conference</source>
          , pages
          <fpage>404</fpage>
          -
          <lpage>413</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jason</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Web-style ranking and SLU combination for dialog state tracking</article-title>
          .
          <source>In 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          , volume
          <volume>282</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Steve</surname>
            <given-names>Young</given-names>
          </string-name>
          , Milica Gašic´,
          <string-name>
            <surname>Simon</surname>
            <given-names>Keizer</given-names>
          </string-name>
          , François Mairesse, Jost Schatzmann, Blaise Thomson, and
          <string-name>
            <given-names>Kai</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>The hidden information state model: A practical framework for POMDP-based spoken dialogue management</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          ,
          <volume>24</volume>
          (
          <issue>2</issue>
          ):
          <fpage>150</fpage>
          -
          <lpage>174</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>