<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the EVALITA 2018 Spoken Utterances Guiding Chef's Assistant Robots (SUGAR) Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Di Maro</string-name>
          <email>maria.dimaro2@unina.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Origlia</string-name>
          <email>antonio.origlia@unina.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Cutugno</string-name>
          <email>cutugno@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi, di Napoli 'Federico II', Department of Electrical, Engineering and Information</institution>
          ,
          <addr-line>Technology</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi, di Napoli 'Federico II', Department of Humanities</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi, di Napoli 'Federico II', URBAN/ECO Research Center</institution>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Con il task SUGAR si intende
sviluppare una baseline per addestrare un
aiuto-cuoco robotico controllato da
comandi vocali. Il punto di partenza sarà,
pertanto, quello di fornire materiale
vocale autentico raccolto in un contesto
naturale simulato da cui saranno estratti i
predicati semantici al fine di classificare le
azioni da eseguire. Tre 16 diversi approcci
sono stati utilizzati dai due partecipanti
per risolvere il task. I risultati mostrano
i veri livelli di criticità che soggiaciono il
task stesso.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        In the last few years, Human-Machine interaction
systems have been in the spotlight, as far as
computer science and linguistics are concerned,
resulting in many applications such as Virtual Assistants
and Conversational Agents
        <xref ref-type="bibr" rid="ref1 ref4 ref5 ref7">(Cassell et al., 2000;
Cauell et al., 2000; Dzikovska et al., 2003; Allen
et al., 2007)</xref>
        . The possibility to use such
Artificial Intelligence technologies in domestic
environments is increasingly becoming a reality
        <xref ref-type="bibr" rid="ref17 ref6 ref7">(Darby,
2018; Ziefle and Valdez, 2017)</xref>
        . In order to ensure
the future possibility of making such systems even
more intelligent, further researches are needed. As
it has been the case with Apple SIRI and Google
Assistant technologies, recent approaches
transformed the former dialogue systems in direct
action actuators, removing or reducing, as much as
possible, clarification requests that may arise in
presence of ambiguous commands. In this view,
Spoken Language Understanding (SLU) is
nowadays one of the major challenge of the field.
Making a system able to truly understand the
intention of the speaker in different contexts and react
correctly, even in presence of Automatic Speech
Recognition (ASR) errors, is the ultimate purpose
to pursue in the field. In this context, the
application of various semantic annotation schemata
and criteria of knowledge modelling are of
particular interest. Among different techniques used
to model the interpretation process we cite: (i)
semantic-frame parsing, where the frame
classification with the recognition of its attribute can
improve the information retrieval process for a more
precise domain specific answer
        <xref ref-type="bibr" rid="ref15">(Wang, 2010)</xref>
        ;
(ii) semantic interpretation, for which
semanticsyntactic trees can be used to extract basic
semantic units and their relationships
        <xref ref-type="bibr" rid="ref10">(Miller et al.,
1996)</xref>
        ; (iii) intent classification, for which
structures comprising generic predicates working as
semantic primitives
        <xref ref-type="bibr" rid="ref16">(Wierzbicka, 1972)</xref>
        and
domaindependent arguments can be used to represent a
specific intent
        <xref ref-type="bibr" rid="ref11 ref13">(Tur and Deng, 2011; Serban et al.,
2018)</xref>
        . With this particular task, we propose a
possible framework for semantic classification to be
tested, recurring to state-of-the-art SLU systems
participating to the EVALITA-SUGAR challenge
        <xref ref-type="bibr" rid="ref3">(Caselli et al., 2018)</xref>
        .
a cooking assistant. For this purpose, a
training corpus of annotated spoken commands was
collected. To collect the corpus, we designed a
3D virtual environment reconstructing and
simulating a real kitchen where users could
interact with a robot (named Bastian) which received
commands to be performed in order to
accomplish some recipes. User’s orders were inspired by
silent cooking videos shown in the 3D scene, thus
ensuring the naturalness of the spoken production.
Videos were segmented into elementary portions
(frames) and sequentially proposed to the
speakers who uttered a single sentence after each seen
frame. In this view, speakers watched at video
portions and then gave instructions to the robot to
emulate what seen in the frame (Figure 1). The
collected corpus then consists of a set of spoken
commands, whose meaning derives from the
various combination of actions, items (i.e.
ingredients), tools and different modifiers.
      </p>
      <p>Audio files were captured in a real acoustic
environment, with a microphone posed at about 1 mt
of distance from the speakers. The resulting
corpus contains audio files for each speaker. These
files were then segmented into sentences
representing isolated commands. Orthographic
transcriptions of the audio files were not be provided.
Consequently, participants could use whichever
ASR they prefer, whose performance was not
under assessment. Nevertheless, the developed
systems were expected to be strongly efficient despite
the possible ASR deficiencies. Each resulting
audio file was paired to a textual one containing the
corresponding action annotation.</p>
      <p>Training set Actions are represented as a finite
set of generic predicates accepting an open set of
parameters. For example, the action of putting
may refer to a pot being placed on the fire
or to an egg being put in a bowl
put(pot; f ire)
put(egg; bowl)
The annotation process resulted in determining
the optimal action predicate corresponding to each
command.</p>
      <p>The training set consists of audio files and
predicate description pairs, where the predicate serves
as an interpretation of the intention to be
performed by the robot. For these scenarios, the audio
files are always mapped on a single interpretative
predicate. The training set consists of 1721
utterances (and therefore 1721 audio files) produced by
36 different speakers annotated by two linguistic
experts. The action templates, which have been
inferentially defined through the video collection,
are shown in Table 1, where [ ] indicates a list of
ingredients, / the alternative among possible
arguments, quantity and modality are not mandatory
arguments, and * is used when the argument is
recoverable from the context (i.e. previous
instantiated arguments, which are not uttered, not even
by means of clitics or other pronouns) or from the
semantics of the verb. For instance,</p>
      <p>friggere (fiori)1
is represented as</p>
      <p>aggiungere(fiori, *olio*)2
because olio (En. oil) is implicitly expressed in the
semantics of the verb friggere (En. to fry) as an
instrument to accomplish the action. Among other
phenomena, it is worth mentioning the presence
of actions paired with templates, even when the
syntactic structure needs a reconstruction, as in
coprire(ciotola, pellicola)3
which is annotated with the generic template as
mettere(pellicola, ciotola)4:</p>
      <sec id="sec-2-1">
        <title>1fry(flowers) 2add(flowers, *oil*) 3cover(bowl, wrap) 4put(wrap, bowl)</title>
        <p>Predicate Arguments
prendere quantità, [ingredienti]/recipiente
aprire quantità, [ingredienti], recipiente
mettere quantità, utensile/[ingredienti],</p>
        <p>elettrodomestico, modalità
sbucciare quantità, [ingredienti], utensile
schiacciare [ingredienti, utensile
passare [ingredienti], utensile
grattare [ingredienti], utensile
girare [ingredienti], utensile
togliere utensile/prodotto, elettrodomestico
aggiungere quantità, [ingredienti], utensile/recipiente/
elettrodomestico/[ingredienti], modalità
mescolare [ingredienti], utensile, modalità
impastare [ingredienti]
separare parte/[ingredienti],ingrediente/utensile
coprire recipiente/[ingredienti], strumento
scoprire recipiente/[ingredienti]
controllare temperatura, ingrediente
cuocere quantità, [ingredienti], utensile, modalità
In other cases, the uttered action represents the
consequence of the action reported in the template,
as in</p>
        <p>separare(parte, fiori)5
and
or
and</p>
        <p>pulire(fiori)6;
mescolare([lievito, acqua])7
sciogliere(lievito, acqua)8:
The argument order does not reflect the one in the
audio files, but the following:
azione(quantità9,
mento, modalità)10
oggetto,
compleThe modality arguments are of different types and
the order is adverb, cooking modality, temperature
and time.</p>
        <p>Test set The test set consists of about 572 audio
files containing uttered commands without
annotations. Task participants were asked to provide,</p>
      </sec>
      <sec id="sec-2-2">
        <title>5separate(part, flowers)</title>
        <p>6clean(flowers)
7stir([yeast, water])
8melt(yeast, water)
9The quantity always precedes the noun it is referred to.
Therefore, it can also come before the complement
10action(quantity, object, complement, modality)
for each target command, the correct action
predicate following the above-described format.
Although single actions are of the same kind of the
ones found in the training set and in the template
file, the objects, on which such actions may be
applied to, vary (i.e. different recipes,
ingredients, tools...). Participants have been evaluated on
the basis of correctly interpreted commands,
represented in the form of predicates.</p>
        <p>The task could be carried out either by using
only the provided linguistic information of the
training set or by means of other external
linguistic tools, such as ontologies, specialised lexicons,
and external reasoners.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Protocol</title>
      <p>The evaluation protocol covered the following
possibilities:</p>
      <p>The proposed system correctly detects the
requested action and all its parameters;</p>
      <sec id="sec-3-1">
        <title>The proposed system asks for repetition; The proposed system correctly detects the requested action but it assigns wrong parameters;</title>
      </sec>
      <sec id="sec-3-2">
        <title>The proposed system misses the action.</title>
        <p>The possibility of asking for repetitions is left
to participants to avoid forcing them to provide an
answer in uncertain conditions. In this case, the
evaluation protocol would assign a weaker
penalisation than the one considered for missing the
arguments or the action. The collected corpus did
not, however, contain situations in which the
system asks for repetitions.</p>
        <p>
          The designed evaluation procedure outputted
the following pieces of information:
1. an id comprising the listing number of the
recognised predicate and the number of
actions, in case of pluri-action predicates (1_1,
1_2, 2_1, etc);
2. a Boolean value (1: True, 0: False) indicating
if the predicate has been recognised; when
the predicates were not recognised, even the
argument number is set on 0;
3. the number of expected arguments as
indicated in the reference annotation files11;
11The reference annotation files were annotation files
created for the test set although not being made available
4. the distance between the participating
systems’ output file and the reference file
computed by means of the Levenshtein distance
          <xref ref-type="bibr" rid="ref8">(Levenshtein, 1966)</xref>
          ; the higher the computed
distance in the output was, the more mistakes
the system had detected;
5. the number of arguments for which the
system asked for repetition.
        </p>
        <p>Suppose the action in reference file is annotated as
1; [prendere(500 g, latte), aggiungere(latte,
pentola)]12
and the recognition procedure outputs
1; prendere(500 g, panna)13
instead of returning the following result,
indicating a correct recognition
(first predicate)
(second predicate)
1_1
(1, 2, 0, 0)
1_2
(1, 2, 0, 0)
1_1
(1, 2, 1, 0)
1_2
(0, 0, 0, 0)14
the evaluation outputs
where the first predicate is recognised despite one
mistaken argument, whereas the second predicate
is not recognised at all.</p>
        <p>The output format had to follow the one
provided for the training data. For instance,
asterisks indicating the implicitness of the arguments
had to be included in the output file. As a matter
of fact, retrieving the implicit function of a
reconstructed argument serves to catch the degree of
understanding of the system, along with making use
of the processing of this information for the
improvement of fine-grained action detection tasks.
On the other hand, the choice between alternative
arguments (separated by a slash in the reference
121; [take(500 g, milk), add(milk, pot)]
131; take(500 g, cream)
14The first action was recognised; two arguments were
expected but one of them was wrong. The second action was
not recognised at all.
files) do not invalidate the results. In fact, to
execute an action, only one of the uttered alternatives
must be chosen. Therefore, when one of the
alternatives was recognised, the resulting output did
not contain recognition errors. On the contrary,
when the system reports both alternatives in the
output file, the Levenshtein distance increased. In
the reference files, alternatives were also occurring
as implicit arguments, when an utterance can be
completed by more than one possible argument.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Participating Systems</title>
      <p>In this section, we will report the results collected
from testing the two participants’ systems: the
first (Section 4.1) have been developed at
Fondazione Bruno Kessler (FBK), while the second
by an Italian company which has decided to
remain anonymous (Section 4.2). In table 2, results
are summarised, showing that FBK had better
performances in terms of correct predicate and
arguments recognition for the intent classification, as
far as the second system is concerned (Figure 2).
On the other hand, the first one outputted worse
results, despite the introduction of the argument
repetition request. In this phase, the argument
repetition percentage was not weighted in the
accuracy rate of the system, which would have resulted
in a slight increase of the accuracy itself, but we
reported it as an additional performance of the
participating system. For the anonymous system the
action recognition is slightly beyond the 50%, but
the argument recognition shows some issues
(Figure 2) concerned with an over-fitting problem (see
Section 4.2). For all three systems, recognition
errors seemed to be random and not justifiable as
semantically-related word selections.
4.1</p>
      <p>
        FBK-HLT-NLP
To solve the proposed task, two different
approaches were introduced. The first system was
similar to the architecture proposed in
        <xref ref-type="bibr" rid="ref9">(Madotto et
al., 2018)</xref>
        and was based on an encoder-decoder
approach. The encoder consisted of a MemNN
network
        <xref ref-type="bibr" rid="ref12">(Sukhbaatar et al., 2015)</xref>
        that stored each
previous sentences in memory, from which
relevant information was retrieved for the current
sentence. The decoder was a combination of i) a
MemNN to decode the input to an instruction
containing tokens from output vocabulary and ii) a
Pointer network
        <xref ref-type="bibr" rid="ref12 ref14">(Vinyals et al., 2015)</xref>
        that chose
which token from the input was to be copied to
a One user is missing.
the output instruction. This system was used to
classify the SUGAR corpus intents after an ASR
transcription (System 1).
      </p>
      <p>The second approach consisted of modeling
the task as a sequence to sequence problem.
Rather than implementing a new system, Fairseq
(Gehring et al., 2017) - a fully convolutional
architecture for sequence to sequence modeling - was
used. Instead of relying on Recurrent Neural
Networks (RNNs) to compute intermediate encoder
states z and decoder states h convolutional neural
networks (CNN) were adopted. Since the amount
of training data was not big enough to train the
model with such a system, written synthetic data
were generated. To generate new data two main
methodologies were adopted: on one hand random
words were substituted with similar words based
on similarity mechanisms, such as word
embeddings; on the other hand, training sentences were
generated by replacing verbs and names with
synonyms extracted from an online vocabulary
(System 2).
4.2</p>
      <p>
        Deep neural network for SUGAR
The anonymous participant built a deep neural
network system to tackle this task15. First of all, to
convert the spoken utterances into text the Google
Speech API was used. The neural network used
a word embeddings lexicon trained on a corpus of
recipes crawled on the web (4.5 million words) as
features. The word embeddings, with vectors
having 100 dimensions, were trained with the
skipgram algorithm of fastText16
        <xref ref-type="bibr" rid="ref2">(Bojanowski et al.,
2016)</xref>
        .
      </p>
      <p>As a preliminary step an autoencoder to embed
the predicates in a vector was built. The encoder
was made of a two Bi-LSTM layers. The first one
was in charge of processing the token sequences
for each predicate. The second layer processed
the sequence of predicates and embeds them into
a vector called predicates embedding. This
vector was then split into n-parts where n was the
15The following report is a result of a conversation with the
involved participant, whose report was not officially
submitted to EVALITA 2018 in order to remain anonymous.
16https://fasttext.cc/
maximum number of predicates. The decoder was
made of two Bi-LSTM layers, where the first layer
was in charge of decoding the sequence of
predicates and the second layer was in charge of
decoding the sequence of token for each predicate. To
test the autoencoder, a development test set was
extracted from the training test. The autoencoder
was able to encode and decode with no changes
the 96.73% of the predicates in the development
test set.</p>
      <p>The different possible actions have been
represented as classes in a hot-encode vector, and for
each action a binary flag has been used to
represent whether the action was implicit or not. The
predicates have been encoded into a vector, using
the aforementioned encoder, and for each
predicate a flag was used to represent their alleged
implicitness.</p>
      <p>A multitask neural network was used to classify
the actions, to detect whether they were implicit
and to predict the predicates. The network took
in input a recipe as a list of commands, each of
whom was encoded by a Bi-LSTM layer. A
second Bi-LSTM layer processed the command
sequence and outputted a list of command
embeddings. Each embeddings was split into n-parts
which identified the actions included in the
command. Each of these actions was passed to 4 dense
layers that predicted the action class, the
implicitness of the action, and the predicates embedding.
Finally, the above-described decoder translated the
predicates embedding into actual predicates.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>With this task we proposed a field of
application for spoken language understanding research
concerned with intents classification of a
domaindependent system using a limited amount of
training data. The results show that further analysis
should be carried out to solve such semantic
recognition problems, starting with an analysis of the
errors occurred in the participating systems, an
enlargement of the reference corpus, up to
finding a suitable pipeline for data processing,
including a rule-based module to model issues such as
the argument implicitness, both in anaphoric- or
semantic-dependent situations. This task is
therefore intended to be a first reflection, whose next
developments would include the creation of a
corpus for the English language and the introduction
of multimodality. As a matter of fact, pointing
gestures or mimed actions and movements, on the
basis of which the interlocutor should be capable of
re-performing them with actual tools and
ingredients, are multimodal activities that are of interest
for this field of application as for any other
spoken understanding task where a shared context of
interaction is expected.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank the EVALITA 2018 organisers and the
SUGAR participants for the interest expressed. A
special thank also goes to Claudia Tortora, who
helped us collect recipes and annotate our training
set, and, last but not least, to the numerous testers
who had fun talking with our dear Bastian.</p>
      <p>This work is funded by the Italian PRIN
project Cultural Heritage Resources
Orienting Multimodal Experience (CHROME)
#B52F15000450001.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>James</given-names>
            <surname>Allen</surname>
          </string-name>
          , Mehdi Manshadi, Myroslava Dzikovska, and
          <string-name>
            <given-names>Mary</given-names>
            <surname>Swift</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Deep linguistic processing for spoken dialogue systems</article-title>
          .
          <source>In Proceedings of the Workshop on Deep Linguistic Processing</source>
          , pages
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607</source>
          .
          <fpage>04606</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          , Nicole Novielli, Viviana Patti, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Evalita 2018: Overview of the 6th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Tommaso Caselli</source>
          , Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ), Turin, Italy. CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Justine</given-names>
            <surname>Cassell</surname>
          </string-name>
          , Joseph Sullivan,
          <string-name>
            <given-names>Elizabeth</given-names>
            <surname>Churchill</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Scott</given-names>
            <surname>Prevost</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Embodied conversational agents</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Justine</given-names>
            <surname>Cauell</surname>
          </string-name>
          , Tim Bickmore, Lee Campbell, and
          <string-name>
            <given-names>Hannes</given-names>
            <surname>Vilhjálmsson</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Designing embodied conversational agents</article-title>
          .
          <source>Embodied conversational agents</source>
          , pages
          <fpage>29</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Sarah J Darby</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Smart technology in the home: time for more clarity</article-title>
          .
          <source>Building Research &amp; Information</source>
          ,
          <volume>46</volume>
          (
          <issue>1</issue>
          ):
          <fpage>140</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Myroslava O Dzikovska</surname>
            , James F Allen, and
            <given-names>Mary D</given-names>
          </string-name>
          <string-name>
            <surname>Swift</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Integrating linguistic and domain knowledge for spoken dialogue systems in multiple Jonas Gehring</article-title>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Auli</surname>
          </string-name>
          , David Grangier,
          <string-name>
            <given-names>Denis</given-names>
            <surname>Yarats</surname>
          </string-name>
          , and
          <string-name>
            <surname>Yann N Dauphin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Convolutional sequence to sequence learning</article-title>
          .
          <source>arXiv preprint arXiv:1705</source>
          .
          <fpage>03122</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Vladimir</surname>
            <given-names>I</given-names>
          </string-name>
          <string-name>
            <surname>Levenshtein</surname>
          </string-name>
          .
          <year>1966</year>
          .
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>In Soviet physics doklady</source>
          , volume
          <volume>10</volume>
          , pages
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chien-Sheng Wu</surname>
            , and
            <given-names>Pascale</given-names>
          </string-name>
          <string-name>
            <surname>Fung</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .08217.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Scott</given-names>
            <surname>Miller</surname>
          </string-name>
          , David Stallard,
          <string-name>
            <given-names>Robert</given-names>
            <surname>Bobrow</surname>
          </string-name>
          , and Richard Schwartz.
          <year>1996</year>
          .
          <article-title>A fully statistical approach to natural language interfaces</article-title>
          .
          <source>In Proceedings of the 34th annual meeting on Association for Computational Linguistics</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>61</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Iulian</given-names>
            <surname>Vlad</surname>
          </string-name>
          <string-name>
            <surname>Serban</surname>
          </string-name>
          , Ryan Lowe,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Henderson</surname>
          </string-name>
          , Laurent Charlin, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A survey of available corpora for building data-driven dialogue systems: The journal version</article-title>
          .
          <source>Dialogue &amp; Discourse</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Sainbayar</given-names>
            <surname>Sukhbaatar</surname>
          </string-name>
          , Jason Weston,
          <string-name>
            <given-names>Rob</given-names>
            <surname>Fergus</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>End-to-end memory networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>2440</fpage>
          -
          <lpage>2448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Gokhan</given-names>
            <surname>Tur</surname>
          </string-name>
          and
          <string-name>
            <given-names>Li</given-names>
            <surname>Deng</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Intent determination and spoken utterance classification</article-title>
          .
          <source>Spoken Language Understanding: Systems for Extracting Semantic Information from Speech</source>
          , pages
          <fpage>93</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Meire Fortunato, and
          <string-name>
            <given-names>Navdeep</given-names>
            <surname>Jaitly</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Pointer networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2692</fpage>
          -
          <lpage>2700</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Ye-Yi Wang</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Strategies for statistical spoken language understanding with small amount of dataan empirical study</article-title>
          .
          <source>In Eleventh Annual Conference of the International Speech Communication Association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Anna</given-names>
            <surname>Wierzbicka</surname>
          </string-name>
          .
          <year>1972</year>
          .
          <article-title>Semantic primitives</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Martina</given-names>
            <surname>Ziefle</surname>
          </string-name>
          and André Calero Valdez.
          <year>2017</year>
          .
          <article-title>Domestic robots for homecare: A technology acceptance perspective</article-title>
          .
          <source>In International Conference on Human Aspects of IT for the Aged Population</source>
          , pages
          <fpage>57</fpage>
          -
          <lpage>74</lpage>
          . Springer.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>