<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Perfect Recipe: Add SUGAR, Add Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Magnolini</string-name>
          <email>magnolini@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vevake Balaraman</string-name>
          <email>balaraman@fbk.eu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Guerini</string-name>
          <email>guerini@fbk.eu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <email>magnini@fbk.eu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fondazione Bruno Kessler</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Via Sommarive</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Trento - Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AdeptMind Scholar</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. We present the FBK participation at the EVALITA 2018 Shared Task “SUGAR - Spoken Utterances Guiding Chef's Assistant Robots”. There are two peculiar, and challenging, characteristics of the task: first, the amount of available training data is very limited; second, training consists of pairs [audio-utterance, system-action], without any intermediate representation. Given the characteristics of the task, we experimented two different approaches: (i) design and implement a neural architecture that can use as less training data as possible, and (ii) use a state of art tagging system, and then augment the initial training set with synthetically generated data. In the paper we present the two approaches, and show the results obtained by their respective runs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Presentiamo la partecipazione
di FBK allo shared task “SUGAR –
Spoken Utterances Guiding Chef ’s
Assistant Robots” a EVALITA 2018. Ci
sono due caratteristiche peculiari del task:
primo, la quantitá di dati di training é
molto limitata; secondo, il training
consiste di coppie [enunciato-audio,
azione-sistema], senza alcuna
rappresentazione intermedia. Date le
caratteristiche del task, abbiamo sperimentato
due approcci diversi: (i) la progettazione e
implementazione di una architettura
neurale che riesca ad usare la minor quantitá
di traning possibile; (ii) l’uso di un
sistama di tagging allo stato dell’arte,
aumentato con dati generati in modo
sintetico. Nel contributo presentiamo i due
approcci, e mostriamo i risultati ottenuti
nei loro rispettivi run.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>In the last few years, voice controlled systems
have been arising a great interest, both in research
and industrial projects, resulting in many
applications such as Virtual Assistants and
Conversational Agents. The use of voice controlled systems
allows to develop solutions for contexts where the
user is busy and can not operate with traditional
graphical interfaces, such as, for instance, while
driving a car or while cooking, as suggested by
the SUGAR task.</p>
      <p>The traditional approach to Spoken Language
Understanding (SLU) is based on a pipeline that
combines several components:</p>
      <p>An automatic speech recognizer (ASR),
which is in charge of converting the spoken
user utterance into a text.</p>
      <p>A Natural Language Understanding (NLU)
component, which takes as input the ASR
output and produces a set of instructions to be
used to operate on the system backend (e.g. a
knowledge base).</p>
      <p>A Dialogue Manager (DM), which selects the
appropriate state of the dialogue, based on the
context of previous interactions.</p>
      <p>A domain Knowledge Base (KB), which is
accessed in order to retrieve relevant
information for the user request.</p>
      <p>An utterance generation component, which
produces a text in natural language by taking
the dialogue state and the KB response.</p>
      <p>Finally, a text-to-speech (TTS) component is
responsible for generating a spoken response
to the user, on the base of the text produced
by the utterance generation component.</p>
      <p>
        While the pipeline approach has proven to be
very effective in a large range of task-oriented
applications, in the last years several deep
learning architectures have been experimented,
resulting in a strong push toward so called end-to-end
approaches
        <xref ref-type="bibr" rid="ref1 ref14 ref8">(Graves and Jaitly, 2014; Zeghidour
et al., 2018)</xref>
        . One of the main advantages of
end-to-end approaches is avoiding the
independent training of the various components of the
SLU pipeline, this way reducing the need of
human annotations and the risk of error propagation
among components. However, despite the
encouraging results of end-to-end approaches, they still
need significant amount of training data, which
are often not available for the task at hand. This
situation is also true in the SUGAR task, where,
as training data are rather limited, end-to-end
approaches are not directly applicable.
      </p>
      <p>Our contribution at the SUGAR task mainly
focuses on the NLU component, since we make use
of an ‘off the shelf’ ASR component. In
particular, we experimented two approaches: (i) the
implementation a neural NLU architecture that can
use as less training data as possible (described in
Section 4), and (ii) the use of a state of art
neural tagging system, where the initial training data
have been augmented with synthetically generated
data (described in Section 5 and 6).
2</p>
    </sec>
    <sec id="sec-3">
      <title>Task and Data description</title>
      <p>
        In the SUGAR task
        <xref ref-type="bibr" rid="ref6">(Maro et al., 2018)</xref>
        the
system’s goal is to understand a set of command in
the context of a voice-controlled robotic agent that
acts as a cooking assistant. In this scenario the user
can not interact using a "classical" interface
because he/she is supposed to be cooking. The
training data set is a corpus of annotated utterances;
spoken sentences are annotated only with the
appropriate command for the robot. Transcription
from speech to text are not available.
      </p>
      <p>The corpus is collected in a 3D virtual
environment, designed as a real kitchen, where
users give commands to the robot assistant to
accomplish some recipes. During data collection
users are inspired by silent cooking videos, which
should ensures a more natural spoken production.
Videos are segmented into short portions (frames),
that contain a single action, and sequentially
showed to users, who have to utter a single
sentence after each frame. The user’s goal is to guide
the robot to accomplish the same action seen
in the frame. The resulting dataset is a list of
utterances describing the actions needed to prepare
three different recipes. While utterances are
totally free, the commands are selected from a
finite set of possible actions, which may refer
either to to ingredients or tools. Audio files are
recorded in a real acoustic environment, with a
microphone posed at about 1 mt of distance from the
different speakers. The final corpus contains
audio files for the three recipes, grouped for each
speaker, and segmented into sentences
representing isolated commands (although few audio files
may contain multiple actions (e.g. "add while
mixing")).
3</p>
    </sec>
    <sec id="sec-4">
      <title>Data Pre-processing</title>
      <p>The SUGAR dataset is constituted by a collection
of audio files, that needs to be pre-processed in
several ways. The first step is ASR, i.e.,
transcription from audio to text. For this step we
made use of an external ASR, selected among the
ones easily available with a Python
implementation. We used the Google API, based on a
comparative study of the different ASR (Këpuska and
Bohouta, 2017); we conducted some sample tests
to be sure that the ASR ranking is reasonable also
for Italian, and we confirmed our choice.</p>
      <p>After this step, we split the dataset into
training set, development set and test set; in fact the
SUGAR corpus is a unique collection and there
is no train-dev-test split. Although the
train-devtest split is quite standard, with two round of
8020 split of the dataset (80% of the dataset is the
training and development set, which we split
8020 again, and 20% is the test set), in the SUGAR
task we split the dataset in a more complex way. In
fact, the dataset is composed by only three
different recipes (i.e. a small amount of ingredients and
similar sequence of operations), and with a
classical 80-20 split the training, the development and
the test sets would have been too different from
the final set (the one used to evaluate the system).
This is due to the fact this new set is composed by
new recipes, with new ingredients and new a
sequence of operations. To deal with this peculiar
characteristic, we decided to use the first recipe as
test set and the other two as train-dev sets. The
final split of the data resulted in 1142 utterance and
command pairs for training, a set of 291 pairs for
development and a set of 286 pairs for test.</p>
      <p>Finally we substituted all the prepositions in the
corpus with an apostrophe (e.g. "d’" "l’", "un’")
with their corresponding form without apostrophe
(e.g. "di", "lo", "una"). This substitution helps the
classifiers to correctly tokenize the utterances.</p>
      <p>In order to take advantage of the structure of the
dialogue in the dataset, in every line of the corpus
we added up to three previous interactions. Such
previous interactions are supposed to be useful to
correctly label a sample, because it is possible that
either an ingredient or a verb can appear in a
previous utterance, while being implied in the
current utterance. The implication is formalized in
the dataset, in fact the implied entity (action or
argument) are surrounded by . The decision of
having a "conversation history" of a maximum of
three utterances is due to a first formalization of
the task, in which the maximum history for every
utterance was set to three previous interactions.
Even if this constraint has been relaxed in the
final version of the task, we kept it in our system. In
addition, a sample test on the data confirms the
intuition that usually a history of three utterances is
enough to understand a new utterance. For sake of
clarity, we report below a line of the pre-processed
dataset:</p>
      <p>un filo di olio nella padella # e poi verso lo uovo
nella padella # gira la frittata # togli la frittata dal
fuoco</p>
      <p>where the first three utterances are the history in
reverse order, and the final is the current utterance.
4</p>
    </sec>
    <sec id="sec-5">
      <title>System 1: Memory + Pointer Networks</title>
      <p>
        The first system presented by FBK is based on a
neural model similar to the architecture proposed
by
        <xref ref-type="bibr" rid="ref4">(Madotto et al., 2018)</xref>
        , which implements a
encoder-decoder approach. The encoder consists
of a Gated Recurrent Unit (GRU)
        <xref ref-type="bibr" rid="ref1">(Cho et al.,
2014)</xref>
        that encodes the user sentence into a latent
representation. The decoder consists of a
combination of i) a MemNN that generate tokens from
the output vocabulary, and ii) a Pointer network
        <xref ref-type="bibr" rid="ref11">(Vinyals et al., 2015)</xref>
        that chooses which token
from the input is to be copied to the output.
4.1
      </p>
      <sec id="sec-5-1">
        <title>Encoder</title>
        <p>Each word in the input sentence x from the user
is represented in high-dimension by using an
embedding matrix A. These representations are
encoded by a Gated Recurrent Unit. The GRU takes
in the current word at time t and the previous
hidden state of the encoder to yield the representation
at time t. Formally,</p>
        <p>ht = GRU (ht 1; xt)
where xt is the current word at time t and ht 1 is
the previous hidden state of the network. The final
hidden state of the network is then passed on to the
decoder.
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Decoder</title>
        <p>
          The input sentences, denoted by x1; x2; ::xn, are
represented as memories r1; r2; ::rn by using an
embedding matrix R. A query ht at time t is
generated using a Gated Recurrent Unit (GRU)
          <xref ref-type="bibr" rid="ref1">(Cho
et al., 2014)</xref>
          , that takes as input the previously
generated output word y^t 1 and the previous query
ht 1. Formally:
        </p>
        <p>ht = GRU (y^t 1; ht 1)
The initial query h0 is the final output vector o
output by the encoder. The query h is then used
as the reading head over the memories. At each
time-step t, the model generates two probabilities,
namely Pvocab and Pptr. Pvocab denotes the
probability over all the words in the vocabulary and it
is defined as follows:</p>
        <p>Pvocab(y^t) = Sof tmax(W ht)
where W is the parameter learned during training.
The probability over the input words is denoted by
Pptr and is calculated using the attention weights
of the MemNN network. Formally:</p>
        <p>Pptr(y^t) = at
at;i = Sof tmax(htT ri)</p>
        <p>
          By generating two probabilities, Pvocab and
Pptr, the model learns both how to generate words
from the output vocabulary and also how to copy
words from the input sequence. Though it is
possible to learn a gating function to combine the
distributions, as used in
          <xref ref-type="bibr" rid="ref7">(Merity et al., 2016)</xref>
          , this model
uses a hard gate to combine the distributions. A
sentinel token $ is added to the input sequence
while training and the pointer network is trained
to maximize the Pptr probability for tokens that
should be generated from output vocabulary. If the
sentinel token is chosen by Pptr, then the model
switches to Pvocab to generate a token, else the
input token specified by Pptr is chosen as output
token. Though the MemNN can be modelled with
n hops, the nature of the SUGAR task and
several experiments that we carried on, showed that
adding more hops is not useful. As a consequence
the model is implemented as a single hop as
explained above.
        </p>
        <p>We use the pre-trained embeddings from
(Bojanowski et al., 2016) to train the model.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>System 2: Fairseq</title>
      <p>The second system experimented by FBK is based
on the work in (Gehring et al., 2017). In particular,
we make use of the Python implementation of the
toolkit known as Fairseq(-py)1. The toolkit is
implemented using PyTorch, and provides reference
implementations of various sequence-to-sequence
models. There are configurations for several tasks,
including translation, language model and stories
generation. In our experiment we use the toolkit
as a black-box since our goal is to obtain a dataset
that could be used with this system; hence, we use
the generic model (not designed for any specific
task) without fine tuning. Moreover, we do not
add any specific feature or tuning for the implicit
arguments (the ones surrounded by ), but we let
the system learn the rule by itself.</p>
      <p>
        A common approach in sequence learning is
to encode the input sequence with a series of
bi-directional recurrent neural networks (RNN);
this can be done with Long Short-Term Memory
(LSTM) networks, Gated Recurrent Unit (GRU)
networks or other types of network, and generate a
variable length output with another set of decoder
RNNs, not necessarily of the same type, both of
which interface via an attention mechanism
        <xref ref-type="bibr" rid="ref1 ref3">(Bahdanau et al., 2014; Luong et al., 2015)</xref>
        .
      </p>
      <p>On the other hand convolutional networks
create representations for fixed size contexts, that can
be seen as a disadvantage compared to the RNNs.
However, the context size of the convolutional
network can be expanded by adding new layers on top
of each other. This allows to control the maximum
length of dependencies to be modeled.
Furthermore, convolutional networks allow
parallelization over elements in the sequence, because they
do not need the computations of the previous time
step. This contrasts with RNNs, which maintain
a hidden state of the entire past that prevents
par</p>
      <sec id="sec-6-1">
        <title>1https://github.com/pytorch/fairseq.</title>
        <p>allel computation within a sequence. This can
increase dramatically the training time of the
system without reducing the performance, as shown
in (Gehring et al., 2017).</p>
        <p>The weak point of the system is that it needs a
consistent amount of training data to create
reasonable models. In fact, Fairseq(-py) trained with
only the SUGAR dataset can not converge and
gets stuck after some epochs, producing
pseudorandom sequences. Due to the small size of the
SUGAR training set, combined with its low
variability (training data are composed by possible
variations of only two recipes), for the system is
impossible to learn the correct structure of the
commands (e.g. balancing the parenthesis) or to
learn how to generalize arguments. In order
to use effectively this system we have expanded
the SUGAR dataset with data augmentation
techniques, presented in Section 6.
6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Data augmentation</title>
      <p>
        Overfitting is still an open issue in neural
models, especially in situations of data sparsity. In the
realm of NLP, regularization methods are typically
applied to the network
        <xref ref-type="bibr" rid="ref10">(Srivastava et al., 2014; Le
et al., 2015)</xref>
        , rather than to the training data.
      </p>
      <p>However, in some application fields, data
augmentation has proven to be fundamental in
improving the performance of neural models when
facing insufficient data. The first fields exploring
data augmentation techniques were computer
vision and speech recognition. In these fields there
now exist well-established techniques for
synthesizing data. In the former we can cite techniques
such as rescaling or affine distortions (LeCun et
al., 1998; Krizhevsky et al., 2012). In the latter,
adding background noise or applying small time
shifts (Deng et al., 2000; Hannun et al., 2014).</p>
      <p>
        In the realm of NLP tasks, data
augmentation has received little attention so far, some
notable exceptions being feature noising
        <xref ref-type="bibr" rid="ref12">(Wang et
al., 2013)</xref>
        or Kneser-Ney smoothing
        <xref ref-type="bibr" rid="ref13">(Xie et al.,
2017)</xref>
        . Additionally, negative examples generation
has been used in (Guerini et al., 2018).
      </p>
      <p>In this paper we build upon the idea of the
aforementioned papers by moving a step forward
and taking advantage of the structured nature of
the SUGAR task and of some domain/linguistic
knowledge. In particular, we used the following
methods to expand the vocabulary and the size of
the training data, but applying some substitution
Memory + Pointer Networks
- Data Augmentation
+ Data Augmentation</p>
      <p>Fine Tuning
Fairseq
Actions Arguments
most-similar token substitution: based on a
similarity mechanisms (i.e. embeddings).
synonym token substitution: synonymy
relations taken from an online dictionary and
applied to specific tokens.
entity substitution: replace entities in the
examples with random entities of the same
type taken from available gazetteers.</p>
      <p>
        The first approach implies substituting a
token from a training example with one of the five
most similar tokens (chosen at random) found
through cosine similarity in the embedding space
described in
        <xref ref-type="bibr" rid="ref8">(Pennington et al., 2014)</xref>
        . We use
the top five candidates in order to add
variability, since many tokens appeared multiple times in
the training data. If the token appeared also as an
argument in the command, it was substituted
as well, while if it appeared as action it was
left unchanged. This approach was applied with a
probability of 30% on each token of the utterances
in the training data.
      </p>
      <p>
        The second approach has been used over verbs
recognized in training utterances using the TextPro
PoS tagger
        <xref ref-type="bibr" rid="ref9">(Pianta et al., 2008)</xref>
        . Such verbs have
been substituted with one possible synonym taken
from an electronic dictionary2. Also in this case,
the action in the command was kept the same
(in fact the verbs present in the utterance are
usually paired with the action in the command).
The third approach has been used to substitute
ingredients in the text with other random ingredients
from a list of foods
        <xref ref-type="bibr" rid="ref5">(Magnini et al., 2018)</xref>
        . In this
case the ingredient has been modified accordingly
also in the annotation of the sentence.
      </p>
      <p>These methodologies allow to generate several
variants starting from a single sentence. While
the first approach has been used in isolation, the
second and the third one have been used together
to generate additional artificial training data.
Doing so, we obtained two different data sets: the
first is composed by 45680 pairs of utterances
and commands (most-similar token applied forty
times per example, 1142 40); the second dataset
contains 500916 pairs (each original sentence got
at least each verb replaced 3 times, and for each of
these variants, ingredients were randomly
substituted twice), the high number of variants is due to</p>
      <sec id="sec-7-1">
        <title>2http://www.sinonimi-contrari.it/.</title>
        <p>the inclusion of the history of three previous
utterances in the process.
7</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Results</title>
      <p>+ Data Augmentation 66,361</p>
      <p>Results of the two approaches are reported in
Table 1. Both approaches obtain a higher
accuracy in recognizing actions, than in
recognizing arguments. Fairseq trained with augmented
data is the top performer of the task,
outperforming more than 10% of accuracy on arguments
compared to the others approach. The ablation test
on Memory + Pointer Networks also show the
importance of data augmentation for tasks with low
resources, in particular fine tuning the classifier
with the new data.
8</p>
    </sec>
    <sec id="sec-9">
      <title>Conclusion and Future Work</title>
      <p>We presented the FBK participation at the
EVALITA 2018 Shared Task “SUGAR –
Spoken Utterances Guiding Chef’s Assistant Robots”.
Given the characteristics of the task, we
experimented two different approaches: (i) a neural
architecture based on memory and pointer
network, that can use as less training data as
possible, and (ii) a state of the art tagging system,
Fairseq, trained with several augmentation
techniques to expand the initial training set with
synthetically generated data. This second approach
seems promising and in the future we want to
deeper investigate the effect of the different
techniques of data augmentation on the performances.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the
AdeptMind scholarship.
arXiv preprint
Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2016. Enriching word
vectors with subword information. arXiv preprint
arXiv:1607.04606.</p>
      <p>Kyunghyun Cho, Bart van Merrienboer, Caglar
Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using rnn encoder–decoder
for statistical machine translation. In Proceedings of
the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 1724–
1734. Association for Computational Linguistics.
Li Deng, Alex Acero, Mike Plumpe, and Xuedong
Huang. 2000. Large-vocabulary speech recognition
under adverse acoustic environments. In Sixth
International Conference on Spoken Language
Processing.</p>
      <p>Jonas Gehring, Michael Auli, David Grangier, Denis
Yarats, and Yann N Dauphin. 2017.
Convolutional sequence to sequence learning. arXiv preprint
arXiv:1705.03122.</p>
      <p>Alex Graves and Navdeep Jaitly. 2014. Towards
endto-end speech recognition with recurrent neural
networks. In International Conference on Machine
Learning, pages 1764–1772.</p>
      <p>Marco Guerini, Simone Magnolini, Vevake
Balaraman, and Bernardo Magnini. 2018. Toward
zeroshot entity recognition in task-oriented
conversational agents. In Proceedings of the 19th Annual
SIGdial Meeting on Discourse and Dialogue, pages
317–326, Melbourne, Australia, July.</p>
      <p>Awni Hannun, Carl Case, Jared Casper, Bryan
Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger,
Sanjeev Satheesh, Shubho Sengupta, Adam Coates,
et al. 2014. Deep speech: Scaling up
end-to-end speech recognition. arXiv preprint
arXiv:1412.5567.</p>
      <p>Veton Këpuska and Gamal Bohouta. 2017. Comparing
speech recognition systems (microsoft api, google
api and cmu sphinx). Journal of Engineering
Research and Application, 7(3):20–24.</p>
      <p>Alex Krizhevsky, Ilya Sutskever, and Geoffrey E
Hinton. 2012. Imagenet classification with deep
convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105.
Quoc V Le, Navdeep Jaitly, and Geoffrey E
Hinton. 2015. A simple way to initialize recurrent
networks of rectified linear units. arXiv preprint
arXiv:1504.00941.</p>
      <p>Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
Haffner. 1998. Gradient-based learning applied to
document recognition. Proceedings of the IEEE,
86(11):2278–2324.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>arXiv:1409</source>
          .
          <fpage>0473</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Hieu Pham, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Effective approaches to attentionbased neural machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1508</source>
          .
          <fpage>04025</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chien-Sheng Wu</surname>
            , and
            <given-names>Pascale</given-names>
          </string-name>
          <string-name>
            <surname>Fung</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .08217.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          , Vevake Balaraman, Mauro Dragoni, Marco Guerini, Simone Magnolini, and
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Piccioni</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Ch1: A conversational system to calculate carbohydrates in a meal</article-title>
          .
          <source>In Proceedings of the 17th International Conference of the Italian Association for Artificial Intelligence (AI*IA</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Maro</surname>
          </string-name>
          , Antonio Origlia, and
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Cutugno</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 Spoken Utterances Guiding Chef's Assistant Robots (SUGAR) Task</article-title>
          . In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18)</source>
          , Turin, Italy. CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Merity</surname>
          </string-name>
          , Caiming Xiong, James Bradbury, and Richard Socher.
          <year>2016</year>
          .
          <article-title>Pointer sentinel mixture models</article-title>
          .
          <source>arXiv preprint arXiv:1609</source>
          .
          <fpage>07843</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Emanuele</given-names>
            <surname>Pianta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Girardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Zanoli</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The textpro tool suite</article-title>
          .
          <source>In Bente Maegaard Joseph Mariani Jan Odijk Stelios Piperidis Daniel Tapias Nicoletta Calzolari (Conference Chair)</source>
          , Khalid Choukri, editor,
          <source>Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)</source>
          , Marrakech, Morocco, may.
          <source>European Language Resources Association (ELRA)</source>
          . http://www.lrecconf.org/proceedings/lrec2008/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Nitish</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Meire Fortunato, and
          <string-name>
            <given-names>Navdeep</given-names>
            <surname>Jaitly</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Pointer networks</article-title>
          . In C. Cortes,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          , and R. Garnett, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          , pages
          <fpage>2692</fpage>
          -
          <lpage>2700</lpage>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Sida</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mengqiu</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Stefan Wager, Percy Liang, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Feature noising for log-linear structured prediction</article-title>
          .
          <source>In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1170</fpage>
          -
          <lpage>1179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Ziang</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sida I Wang</surname>
          </string-name>
          , Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng.
          <year>2017</year>
          .
          <article-title>Data noising as smoothing in neural network language models</article-title>
          .
          <source>arXiv preprint arXiv:1703</source>
          .
          <fpage>02573</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Neil</given-names>
            <surname>Zeghidour</surname>
          </string-name>
          , Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, and
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dupoux</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>End-to-end speech recognition from the raw waveform</article-title>
          .
          <source>Interspeech</source>
          <year>2018</year>
          , Sep.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>