<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Method for Dataset Creation for Dialogue State Classification in Voice Control Systems for the Internet of Things</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ivan Shilin shilinivan@corp.ifmo.ru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Mouromtsev mouromtsev@mail.ifmo.ru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Ivanitskiy litemn@yandex.ru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ITMO University Saint-Petersburg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russian Federation</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gerhard Wohlgenannt</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Liubov Kovriguina</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In recent years, speech-based interaction became an important method of
communication with devices in the Internet of Things (IoT). Voice control interfaces
involve all the challenges and difficulties of natural language understanding and
human-computer communication. In this paper, we present a methodology to
create initial training data for voice-controlled devices which helps to design and track
dialogue system states. Using crowdsourcing, in a first step we collect simple
commands that users might give to devices. These commands are analyzed and manually
classified into 50 user-system interaction scenarios. In a second step, we design a
set of potential system states after processing the initial user commands, and crowd
workers are asked to provide multi-turn dialogues between a user and the device,
which simulate the processes of resolving a system state towards completion. The
resulting dataset contains 320 commands and their classification into interaction
scenarios for the first step, and 640 multi-turn dialogues for step two, generated given
12 potential system states. Finally, we present a baseline for automatic classification
of utterance type and slot types in user commands, which is important for dialogue
state detection. The proposed methodology allows collecting dialogues for IoT
devices, which cover a variety of system states and interaction patterns. Keywords:
voice control systems for Internet of Things, slot type classification, command type
classification, dataset for voice-controlled devices.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction and Related Work</title>
      <p>Gartner, Inc. forecasts that 8.4 billion connected things will be in use worldwide in
2017, up 31 percent from 2016, and will reach 20.4 billion by 2020. The total spending
on endpoints and services will reach almost $2 trillion in 20171. The pool of devices
and Internet of Things (IoT) architectures and protocols grows rapidly, however, this
domain lacks communication instruments, regarding both interaction between devices and
human-machine interaction and cooperation. Communication interfaces for IoT devices
are typically specific to producers, thus, the development of conversational agents and
voice control systems for IoT is in high demand due to the universal nature of natural
language and speech communication [Portet et al., 2013, Aldrich, 2003].</p>
      <p>There are two different paradigms for the design of dialogue systems and voice
interfaces. On the one hand, in the modular approach, the system is assembled using various
knowledge extraction algorithms and typically integrates a number of modules for natural
language understanding, a dialogue manager, knowledge bases, rule bases, ontologies and
other models [Jurafsky and Martin, 2017].</p>
      <p>On the other hand, end-to-end systems are trained directly from conversational data.
This approach requires no hand-crafted feature engineering and annotation, but large
dialogue datasets for training [Lowe et al., 2017, Serban et al., 2015]. This requirements
restricts end-to-end systems to certain domains and natural languages.</p>
      <p>In the case of voice-controlled systems for IoT devices, the problems are aggravated.
Here, often both specific training data and rule or knowledge bases are missing, and
furthermore there is a large diversity of IoT architectures and protocols. Moreover, linguistic
resources, describing devices, technologies, communication between users and IoT-systems
are limited or do not exist. For example, the Google Speech Commands dataset2 includes
65 000 short, one second long, commands (like "left !", "stop!"). Another source, the
Mozilla Common Voice dataset3, comprising 500 hours of speech from 20 000 different
volunteers, was developed for keyword spotting tasks and web applications control. These
and similar datasets are intended to train speech recognition systems.</p>
      <p>In this paper, we propose a first step to address the problem of missing datasets
for training end-to-end systems. We introduce a methodology for dataset creation using
a combination of crowdsourcing and domain experts, and implement and evaluate the
method by creating initial small-scale datasets of IoT dialogues for the Russian language.
This dataset is primarily intended for the development of voice-controlled systems for IoT,
such systems are within the emergent paradigm of artificial cognitive systems. Within
the paper, we not only focus on dataset creation, but also on automatic morphosyntactic
annotation and classification of user commands and slot type identification.</p>
      <p>The methodology can be summarized as follows: The first step involves crowdsourcing
to create first-turn data, which contain initial commands or requests of a user to an
IoT device. Then, we analyzed the first-turn commands and elaborated 50 scenarios
of user-system interaction. Independently, we designed an initial set of system states
(12 states), to which the system can transition after understanding the command and
collecting context knowledge.</p>
      <p>1https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-says-8-billion-connected-thin
2https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
3https://voice.mozilla.org/ru/data</p>
      <p>In a second iteration of crowdsourcing, the users are asked to generate natural
language responses to the respective system state, and to simulate a dialogue which resolves
the situation to a successful final state. Moreover, we present and evaluate a baseline
approach to classify the first-turn user commands into command types and and a set of
5 binary features ("slots"), which can be used to track the dialogue state in future work.</p>
      <p>Applying this methodology, we collect 320 first-turn command items, and 640
multiturn dialogues from the second crowdsourcing step. The proposed methodology allows for
collecting dialogues of the necessary variety and representativeness for the described task,
and it can be used to extend the dataset created in our experiments. To our best
knowledge, there is no previous work which provides a comparable methodology or datasets.
As a final remark about the dataset, it also contains automatically created annotations
according to Universal Dependencies 2.0 standard (for morphology and syntax). The
morphological and syntactical annotations were performed with the Russian language models
for the Stanford CoreNLP library [Kovriguina et al., 2017]. The current demonstration
files of the dataset with the corresponding annotations are published in the project
repository4.</p>
      <p>The paper is structured as follows: Section 2 explains core aspects of this work, such
as the methodology used for creating the dataset and to design the interaction scenarios
and dialogue manager states. In Section 3 we present the experiments to automatically
classify first-turn user commands, and we conclude with Section 4.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Creation for Dialogue State Tracking</title>
      <p>In this section, we first discuss some fundamentals of dialogue managers and its influence
on the data necessary for the development of conversational intelligence for IoT and then
present the two steps of our methodology to create datasets for dialogue state identification
in IoT systems.
2.1</p>
      <sec id="sec-3-1">
        <title>Fundamentals of Dialogue State Tracking for IoT Devices</title>
        <p>Typical modern architectures of conversational agents include dialogue managers as
a core module. Dialogue managers can be implemented with the use of several
approaches, such as finite-state automata, frame-based methods, the information state
update (ISU) approach, the belief-desire-intention (BDI) model, reinforcement learning,
etc. [Sungjin et al., 2016].</p>
        <p>Dialogue state trackers aim to identify the current conversation state based on a
user’s input and the previous conversation history, so that the dialogue manager can
choose the best next action. Typically, a slot-filling algorithm in the natural language
understanding module identifies relevant objects in the user’s utterance and tries to find
an appropriate slot, which is then processed by the dialogue manager. Analysis of human
conversations with smart environment has shown, that resolving coreference, especially its
abstract type, is a much harder task, because users are inclined to use abstract lexis (light,
sound ) to denote the device (lamp, audio system, correspondingly). Therefore, a natural
language understanding module needs algorithms, which can find empirical referents for
abstract and new, previously unseen, concepts [Jia et al., 2017].</p>
        <sec id="sec-3-1-1">
          <title>4https://github.com/organizations/MANASLU8/VoiceIoT</title>
          <p>Conceptualizing conversational agents for the IoT domain as being emergent types
of artificial cognitive architectures [Vernon, 2014, Profanter, 2012], we believe, that it is
critical to involve context-sensitive knowledge obtained from devices, data storages, and
device knowledge base into the space of dialogue states.</p>
          <p>In a first step towards the goal of dialogue state tracking models for Internet of
Things, it is necessary to isolate and analyze user request patterns for different devices.
From this data, potential dialogue states can be designed. Furthermore, it is necessary to
obtain users responses to given dialogue states, as well as to get natural language
equivalents for the system state and the user interactions. In the following two-step dataset
creation process, we present the methodology and first dataset creation experiments which
tackle the described problems.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Step I: First-Turn Dataset Creation and Scenario Identification</title>
        <p>In the first step of dataset creation, we collect initial commands by the device user. We
employ a crowdsourcing strategy, and provide the user with a task description, which,
in summary, asks crowd workers to choose an arbitrary smart device from any smart
environment and provide 2-4 operation commands for the chosen device. For example,
such a command might be “Wash at 40 degrees” (given to a washing machine). In this
step, 29 crowd workers were involved.</p>
        <p>The crowd workers proposed a vast list of devices, from smart lamps, smart
electricity/water meters to smart bread makers, smart bathroom sensors and garden paths with
heating. For those devices, the crowd workers were instructed to formulate several natural
language requests (or commands). As there were no restrictions on device selection, some
devices were chosen by several users.</p>
        <p>After collecting the results from the crowd workers, we analyzed and merged the
commands and ended up with 50 scenarios, each of which containing from 2 to 23 natural
language commands, 320 commands in total. All counts are given after the removal of
duplicate commands and scenarios from the dataset.</p>
        <p>A scenario is a string quadruple (D; C; P; L), where D is a device, which is typically
represented by a list of sensors. C is an array of potential commands, connected to this
device. P is a multidimensional array of parameters, associated with each command, and
L is a set of arbitrary natural language commands, generated by informants (in our case:
crowd workers). Therefore, a scenario contains starting phrases (first-turn utterances),
which are typically used as commands for the device.</p>
        <p>An example of a scenario5 is as follows:
- Device D: a washing machine that has a sensor for load detection ;
- Commands C: wash6;
- Parameters P : washing temperature of type integer, this parameter also has a default
value.</p>
        <p>5In the text of the paper all examples are given in English translation. The original dataset is in
Russian language.</p>
        <p>6This command starts the washing process, but does not turn the washing machine on.
- Natural language commands L:
1. Start washing at 23:05!
2. Wash the clothes!
3. Wash at 90 degrees!</p>
        <p>As a side note, the crowd workers have produced paraphrases for some commands,
which are preserved in the dataset and can be used in future work, for example for
paraphrase detection for a command.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Step II: Dialogue Manager States and Multi-turn Dialog Generation</title>
        <p>Given the first-turn commands and the collected scenarios from step I, we analyzed
potential system response states based on those command. Assuming a correct understanding
of the command by the dialogue system, this led to a list of 12 possible system reactions
to the first-turn requests, which correspond to dialogue manager states.</p>
        <p>To make those dialogue manager states more intuitive, we present some examples
of system reaction codes and natural language responses for a scenario with a washing
machine:
1. System State / Response code 1. All necessary devices, commands and parameters
have been found, there are no conflicts, the command will be run now.</p>
        <p>Corresponding natural language response: Washing starts!
2. System State / Response code 2. All necessary devices, commands and parameters
have been found, but a multiple choice problem exists (several entities/devices satisfy
the query).</p>
        <p>Corresponding natural language response: Which washing machine should be used:
the one in the kitchen or the one in the bathroom?
3. System State / Response code 3. A parameter or parameter value was set incorrectly.</p>
        <p>Corresponding natural language response: The washing temperature cannot be set
to 15 degrees. Please choose between 30 and 90 degrees!</p>
        <p>As mentioned, the system reaction codes correspond to the states of the dialogue
manager, given that user request was correctly understood by the system. The proposed list of
12 dialogue manager states includes states of parameter conflicts, scheduled requests,
identifying unknown or fuzzy parameters, cases with multiple choices, non-saturated frames,
mistakes and the reporting of side effects.</p>
        <p>After the definition of the 12 dialogue manager states, we did a second run of
crowdsourcing processes. This time, the crowd workers were given the list of first-turn natural
language utterances collected in step I, and the list of 12 system reaction codes. The
workers were asked to verbalize the system reaction code in natural language, and create a
dialogue of user responses and system reactions which completes the dialogue successfully.</p>
        <p>The final dataset of dialogues generated in step II includes 640 dialogues, and number
of turns per dialogue varies from 2 to 5. A sample dialogue is given below (U - user, S
system):
– U: Turn on the light!
– U: Only in the kitchen.
– S: Ok, done!
– S: Do you want to turn on the light in the kitchen or in the whole flat?
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments on Command Type and Slot Identification</title>
      <p>In this section we describe two experiments performed on the commands collected in step
I. So, the raw data comprises the 320 first-turn natural language requests to the smart
environment. The goal of the experiments is to automatically classify the commands a) by
the type of command, b) by slots, which represent various characteristics of the command,
for example if the command contains a condition or specifies some device parameters. This
classification of basic command type and slots will help in the construction of a dialogue
system, which is part of future work.
3.1</p>
      <sec id="sec-4-1">
        <title>Description of Command Classes</title>
        <p>For the purpose of classifying the first-turn user commands, we use six features. The first
feature is the command type:</p>
        <p>Command type: We currently distinguish three command types: request – 1, (e.g.,
"How much water did I spend last month? "), explicit command – 2, which can be directly
mapped to an entity in a knowledge base via its label, and implicit command – 3, which
cannot be directly mapped, and the system has to guess, what to do (e.g., "Reduce the
temperature! ").</p>
        <p>These three types have little in common with typical speech acts classification and
were introduced as a result of distinguishing implicit and explicit commands and requests
(typically, to databases external to the IoT system) from commands.</p>
        <p>The other 5 features are the slots describing important command characteristics7:
1. Device, 1 - if mentioned, 0 - otherwise ("Turn on the desk lamp.");
2. Condition, 1 - if mentioned, 0 - otherwise ("If the outdoor temperature is higher than
25 C, set the split system to +18 C.");
3. Location, 1 - if mentioned, 0 - otherwise, ("Start watering in greenhouse 17.");
4. Parameter, 1 - if mentioned, 0 - otherwise, ("Set the brightness of the chandelier to
600 Lumen.");
5. Schedule, 1 - if mentioned, 0 - otherwise, ("Turn on the oven at 9 pm.").
7Examples of requests are given in brackets, and the entity, corresponding to the slot type, is italicized</p>
        <p>Thus, a vector of labels, characterizing a command, looks as follows:
(2, 0, 1, 1, 0, 0) for the command "Turn off the light, when I leave the room ".
This is an explicit command (2), a device is not mentioned explicitly (0), a condition (1)
and location (1) are present, no parameters are specified (0) and the command is not
scheduled (0).
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Model Setup</title>
        <p>Given those six features, we created a gold-standard dataset by manually annotating each
of the 320 commands collected in step I with a vector of 6 features: element at ’0’ position
denotes command type, elements at positions ’1-5’ denote the slot types. The sequence
of slot types is a one-hot vector, if objects corresponding to the particular slot are present
in the request, the value is ’1’, else ’0’.</p>
        <p>For model training and evaluation, we used word embedding vector representations of
the user commands. First, we train word vectors on the Aranea Russicum Maius corpus8.
The size of the corpus is 1,200,001,911 tokens. The model was trained on the raw text
without preprocessing using the word2vec library [Mikolov et al., 2013], using the
skipgram algorithm and a vector size of 100 dimensions. With those word embeddings, we
created vector representation of the whole user request by averaging the word vectors
of words in a given request. In future work, we will experiment with more elaborate
methods to represent the commands, for example weighting schemes for words, sentence
embeddings, etc.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Experiment Results for Command Type Classification</title>
        <p>First, we classify the commands into the three command types described above. In the
gold data labelled by experts, the command types are distributed as follows: 64% of
utterances are explicit commands to the system, 25% are requests and 11% are implicit
commands. The classification of command types in those three classes was performed
with the Weka machine learning library9 using the word embedding representations of
user commands.
3.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Experimental Results for Slot Type Classification</title>
        <p>The same classification algorithms and word embedding representations were used in the
slot type classification tasks. According to manual annotation, the slots in user utterances
are distributed as follows: Devices are mentioned in 31% of utterances, conditions - in
27%, locations - in 24%, and parameters and scheduling - in 14% each. Here, we have five
binary classification tasks per user command, one per slot type.</p>
        <p>Classification results for Device, Condition and Location slots can be adopted as
baseline and used in future work. However, as indicated by predication accuracy of
the slots P arameter or Scheduling, the baseline classifier seems to not have learned
to discriminate those. For the P arameter and Scheduling slots, a similar prediction</p>
        <sec id="sec-4-4-1">
          <title>8http://aranea.juls.savba.sk/aranea_about/index.html 9https://www.cs.waikato.ac.nz/ml/weka/</title>
          <p>accuracy can be reached by always predicting value 0. Moreover, we plan to evaluate the
classifiers on more balanced datasets in future work.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The paper presents and applies a methodology for creating datasets for voice
communication with IoT devices. The method focuses on user commands which lead to specific
dialogue manager states, and are finally resolved in a dialogue between the IoT system
and the device user. The datasets created with crowdsourcing reflect those dialogue
elements, and were also applied to classify user commands according to six features which
will be used for dialogue state tracking in future work.</p>
      <p>The main contributions of the paper include (i) a procedure for crowdsourcing voice
control data from end users, (ii) the provision of datasets, (iii) automatic morphosyntactic
annotation of the datasets, and (iv) baseline evaluation of various machine learning
algorithms. The datasets created, esp. the natural language responses, which users formulated
on behalf of the system, may also be re-used in other tasks, for example as patterns in
natural language generation tasks.</p>
      <p>Finally, there are multiple directions for future work. We plan to extend the
dialogue datasets with crowdsourcing techniques, to improve on the classification baselines
presented in the experiments section, and to apply the collected data for the creation of
IoT dialogue systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>L.Kovriguina acknowledges support from the Russian Fund of Basic Research (RFBR),
Grant No. 16-36-60055. G.Wohlgenannt acknowledges support from the Government of
the Russian Federation (Grant 074-U01) through the ITMO Fellowship and Professorship
Program.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Aldrich</source>
          , 2003] Aldrich,
          <string-name>
            <surname>F. K.</surname>
          </string-name>
          (
          <year>2003</year>
          )
          <article-title>Smart Homes: Past, Present</article-title>
          and Future // Inside the Smart Home, Springer, London,
          <year>2003</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Jia et al.,
          <year>2017</year>
          ] Jia,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Heck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Hakkani-Tu¨</surname>
          </string-name>
          r, D., and
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Learning concepts through conversations in spoken dialogue systems</article-title>
          // Acoustics, Speech and
          <string-name>
            <given-names>Signal</given-names>
            <surname>Processing (ICASSP) 2017 IEEE International</surname>
          </string-name>
          <article-title>Conference on</article-title>
          ., IEEE,
          <year>2017</year>
          ,
          <fpage>5725</fpage>
          -
          <lpage>5729</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Jurafsky and Martin</source>
          , 2017] Jurafsky,
          <string-name>
            <given-names>D.</given-names>
            , and
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. H.</surname>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Dialogue Systems</article-title>
          and Chatbots // Speech and
          <string-name>
            <given-names>Language</given-names>
            <surname>Processing</surname>
          </string-name>
          ,
          <year>2017</year>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Kovriguina et al.,
          <year>2017</year>
          ] Kovriguina,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Shilin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Putintseva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , and
            <surname>Shipilo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Russian Tagging and Dependency Parsing Models for Stanford CoreNLP</article-title>
          Natural Language Toolkit // International Conference on Knowledge Engineering and the Semantic Web, Springer,
          <year>2017</year>
          ,
          <fpage>101</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Lowe et al.,
          <year>2017</year>
          ] Lowe,
          <string-name>
            <given-names>R. T.</given-names>
            ,
            <surname>Pow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Serban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. V.</given-names>
            ,
            <surname>Charlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          , C.-W., and
          <string-name>
            <surname>Pineau</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Training End-to-end Dialogue Systems with the Ubuntu Dialogue Corpus /</article-title>
          / Dialogue &amp; Discourse,
          <year>2017</year>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ),
          <fpage>31</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Mesnil et al.,
          <year>2015</year>
          ] Mesnil,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>HakkaniTur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ,
            <surname>Heck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Tur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , et al. (
          <year>2015</year>
          )
          <article-title>Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding /</article-title>
          / IEEE/ACM Transactions on Audio,
          <source>Speech and Language Processing (TASLP)</source>
          ,
          <year>2015</year>
          ,
          <volume>23</volume>
          (
          <issue>3</issue>
          ),
          <fpage>530</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Mikolov et al.,
          <year>2013</year>
          ] Mikolov,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , and
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Efficient estimation of word representations in vector space</article-title>
          // arXiv preprint arXiv:
          <volume>1301</volume>
          .3781,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Portet et al.,
          <year>2013</year>
          ] Portet,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Vacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Golanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            , and
            <surname>Meillon</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Design and evaluation of a smart home voice interface for the elderly: acceptability and</article-title>
          objection aspects // Personal and Ubiquitous Computing,
          <year>2013</year>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>127</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Profanter</source>
          , 2012] Profanter,
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2012</year>
          ) Cognitive architectures // HauptSeminar Human Robot Interaction,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Serban et al.,
          <year>2015</year>
          ] Serban,
          <string-name>
            <given-names>I. V.</given-names>
            ,
            <surname>Lowe</surname>
          </string-name>
          , R. T.,
          <string-name>
            <surname>Charlin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pineau</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>A Survey of Available Corpora For Building Data-Driven Dialogue Systems /</article-title>
          / arXiv preprint arXiv:
          <volume>1512</volume>
          .05742,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Sungjin et al.,
          <year>2016</year>
          ] Sungjin,
          <string-name>
            <given-names>L.</given-names>
            , and
            <surname>Stent</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Task Lineages: Dialog State Tracking for Flexible Interaction //</article-title>
          <source>Proceedings of the SIGDIAL 2016 Conference</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Vernon</source>
          , 2014] Vernon,
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Artificial cognitive systems:</article-title>
          A primer // MIT Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>