The Perfect Recipe: Add SUGAR, Add Data

Simone Magnolini1,2 , Vevake Balaraman1,3 , Marco Guerini1 , Bernardo Magnini1
      1
        Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento — Italy
               2
                 AdeptMind Scholar 3 University of Trento, Italy.
      {magnolini, balaraman, guerini, magnini}@fbk.eu


                 Abstract                            approcci, e mostriamo i risultati ottenuti
                                                     nei loro rispettivi run.
English. We present the FBK partic-
ipation at the EVALITA 2018 Shared
Task “SUGAR – Spoken Utterances                 1    Introduction
Guiding Chef’s Assistant Robots”.
There are two peculiar, and challeng-           In the last few years, voice controlled systems
ing, characteristics of the task: first,        have been arising a great interest, both in research
the amount of available training data           and industrial projects, resulting in many appli-
is very limited; second, training con-          cations such as Virtual Assistants and Conversa-
sists of pairs [audio-utterance,                tional Agents. The use of voice controlled systems
system-action], without any in-                 allows to develop solutions for contexts where the
termediate representation.        Given the     user is busy and can not operate with traditional
characteristics of the task, we experi-         graphical interfaces, such as, for instance, while
mented two different approaches: (i)            driving a car or while cooking, as suggested by
design and implement a neural architec-         the SUGAR task.
ture that can use as less training data as         The traditional approach to Spoken Language
possible, and (ii) use a state of art tagging   Understanding (SLU) is based on a pipeline that
system, and then augment the initial            combines several components:
training set with synthetically generated           • An automatic speech recognizer (ASR),
data. In the paper we present the two                 which is in charge of converting the spoken
approaches, and show the results obtained             user utterance into a text.
by their respective runs.
                                                    • A Natural Language Understanding (NLU)
Italiano. Presentiamo la partecipazione               component, which takes as input the ASR
di FBK allo shared task “SUGAR –                      output and produces a set of instructions to be
Spoken Utterances Guiding Chef’s As-                  used to operate on the system backend (e.g. a
sistant Robots” a EVALITA 2018. Ci                    knowledge base).
sono due caratteristiche peculiari del task:
primo, la quantitá di dati di training é            • A Dialogue Manager (DM), which selects the
molto limitata; secondo, il training con-             appropriate state of the dialogue, based on the
siste di coppie [enunciato-audio,                     context of previous interactions.
azione-sistema], senza alcuna rap-
presentazione intermedia. Date le carat-            • A domain Knowledge Base (KB), which is
teristiche del task, abbiamo sperimentato             accessed in order to retrieve relevant informa-
due approcci diversi: (i) la progettazione e          tion for the user request.
implementazione di una architettura neu-            • An utterance generation component, which
rale che riesca ad usare la minor quantitá            produces a text in natural language by taking
di traning possibile; (ii) l’uso di un sis-           the dialogue state and the KB response.
tama di tagging allo stato dell’arte, au-
mentato con dati generati in modo sin-              • Finally, a text-to-speech (TTS) component is
tetico. Nel contributo presentiamo i due              responsible for generating a spoken response
     to the user, on the base of the text produced       tence after each frame. The user’s goal is to guide
     by the utterance generation component.              the robot to accomplish the same action seen
                                                         in the frame. The resulting dataset is a list of ut-
   While the pipeline approach has proven to be
                                                         terances describing the actions needed to prepare
very effective in a large range of task-oriented
                                                         three different recipes. While utterances are to-
applications, in the last years several deep learn-
                                                         tally free, the commands are selected from a fi-
ing architectures have been experimented, result-
                                                         nite set of possible actions, which may refer ei-
ing in a strong push toward so called end-to-end
                                                         ther to to ingredients or tools. Audio files are
approaches (Graves and Jaitly, 2014; Zeghidour
                                                         recorded in a real acoustic environment, with a mi-
et al., 2018). One of the main advantages of
                                                         crophone posed at about 1 mt of distance from the
end-to-end approaches is avoiding the indepen-
                                                         different speakers. The final corpus contains au-
dent training of the various components of the
                                                         dio files for the three recipes, grouped for each
SLU pipeline, this way reducing the need of hu-
                                                         speaker, and segmented into sentences represent-
man annotations and the risk of error propagation
                                                         ing isolated commands (although few audio files
among components. However, despite the encour-
                                                         may contain multiple actions (e.g. "add while mix-
aging results of end-to-end approaches, they still
                                                         ing")).
need significant amount of training data, which
are often not available for the task at hand. This
                                                         3   Data Pre-processing
situation is also true in the SUGAR task, where,
as training data are rather limited, end-to-end ap-      The SUGAR dataset is constituted by a collection
proaches are not directly applicable.                    of audio files, that needs to be pre-processed in
   Our contribution at the SUGAR task mainly fo-         several ways. The first step is ASR, i.e., tran-
cuses on the NLU component, since we make use            scription from audio to text. For this step we
of an ‘off the shelf’ ASR component. In particu-         made use of an external ASR, selected among the
lar, we experimented two approaches: (i) the im-         ones easily available with a Python implementa-
plementation a neural NLU architecture that can          tion. We used the Google API, based on a com-
use as less training data as possible (described in      parative study of the different ASR (Këpuska and
Section 4), and (ii) the use of a state of art neu-      Bohouta, 2017); we conducted some sample tests
ral tagging system, where the initial training data      to be sure that the ASR ranking is reasonable also
have been augmented with synthetically generated         for Italian, and we confirmed our choice.
data (described in Section 5 and 6).                        After this step, we split the dataset into train-
                                                         ing set, development set and test set; in fact the
2   Task and Data description
                                                         SUGAR corpus is a unique collection and there
In the SUGAR task (Maro et al., 2018) the sys-           is no train-dev-test split. Although the train-dev-
tem’s goal is to understand a set of command in          test split is quite standard, with two round of 80-
the context of a voice-controlled robotic agent that     20 split of the dataset (80% of the dataset is the
acts as a cooking assistant. In this scenario the user   training and development set, which we split 80-
can not interact using a "classical" interface be-       20 again, and 20% is the test set), in the SUGAR
cause he/she is supposed to be cooking. The train-       task we split the dataset in a more complex way. In
ing data set is a corpus of annotated utterances;        fact, the dataset is composed by only three differ-
spoken sentences are annotated only with the ap-         ent recipes (i.e. a small amount of ingredients and
propriate command for the robot. Transcription           similar sequence of operations), and with a classi-
from speech to text are not available.                   cal 80-20 split the training, the development and
   The corpus is collected in a 3D virtual en-           the test sets would have been too different from
vironment, designed as a real kitchen, where             the final set (the one used to evaluate the system).
users give commands to the robot assistant to ac-        This is due to the fact this new set is composed by
complish some recipes. During data collection            new recipes, with new ingredients and new a se-
users are inspired by silent cooking videos, which       quence of operations. To deal with this peculiar
should ensures a more natural spoken production.         characteristic, we decided to use the first recipe as
Videos are segmented into short portions (frames),       test set and the other two as train-dev sets. The fi-
that contain a single action, and sequentially           nal split of the data resulted in 1142 utterance and
showed to users, who have to utter a single sen-         command pairs for training, a set of 291 pairs for
development and a set of 286 pairs for test.               in the current word at time t and the previous hid-
   Finally we substituted all the prepositions in the      den state of the encoder to yield the representation
corpus with an apostrophe (e.g. "d’" "l’", "un’")          at time t. Formally,
with their corresponding form without apostrophe
(e.g. "di", "lo", "una"). This substitution helps the                     ht = GRU (ht−1 , xt )
classifiers to correctly tokenize the utterances.
   In order to take advantage of the structure of the      where xt is the current word at time t and ht−1 is
dialogue in the dataset, in every line of the corpus       the previous hidden state of the network. The final
we added up to three previous interactions. Such           hidden state of the network is then passed on to the
previous interactions are supposed to be useful to         decoder.
correctly label a sample, because it is possible that
                                                           4.2   Decoder
either an ingredient or a verb can appear in a pre-
vious utterance, while being implied in the cur-           The input sentences, denoted by x1 , x2 , ..xn , are
rent utterance. The implication is formalized in           represented as memories r1 , r2 , ..rn by using an
the dataset, in fact the implied entity (action or         embedding matrix R. A query ht at time t is gen-
argument) are surrounded by ∗. The decision of             erated using a Gated Recurrent Unit (GRU) (Cho
having a "conversation history" of a maximum of            et al., 2014), that takes as input the previously gen-
three utterances is due to a first formalization of        erated output word ŷt−1 and the previous query
the task, in which the maximum history for every           ht−1 . Formally:
utterance was set to three previous interactions.
Even if this constraint has been relaxed in the fi-                      ht = GRU (ŷt−1 , ht−1 )
nal version of the task, we kept it in our system. In
addition, a sample test on the data confirms the in-       The initial query h0 is the final output vector o
tuition that usually a history of three utterances is      output by the encoder. The query h is then used
enough to understand a new utterance. For sake of          as the reading head over the memories. At each
clarity, we report below a line of the pre-processed       time-step t, the model generates two probabilities,
dataset:                                                   namely Pvocab and Pptr . Pvocab denotes the prob-
   un filo di olio nella padella # e poi verso lo uovo     ability over all the words in the vocabulary and it
nella padella # gira la frittata # togli la frittata dal   is defined as follows:
fuoco
                                                                     Pvocab (ŷt ) = Sof tmax(W ht )
   where the first three utterances are the history in
reverse order, and the final is the current utterance.
                                                           where W is the parameter learned during training.
4     System 1: Memory + Pointer Networks                  The probability over the input words is denoted by
                                                           Pptr and is calculated using the attention weights
The first system presented by FBK is based on a            of the MemNN network. Formally:
neural model similar to the architecture proposed
by (Madotto et al., 2018), which implements a                                 Pptr (ŷt ) = at
encoder-decoder approach. The encoder consists                           at,i = Sof tmax(hTt ri )
of a Gated Recurrent Unit (GRU) (Cho et al.,
2014) that encodes the user sentence into a latent            By generating two probabilities, Pvocab and
representation. The decoder consists of a combi-           Pptr , the model learns both how to generate words
nation of i) a MemNN that generate tokens from             from the output vocabulary and also how to copy
the output vocabulary, and ii) a Pointer network           words from the input sequence. Though it is possi-
(Vinyals et al., 2015) that chooses which token            ble to learn a gating function to combine the distri-
from the input is to be copied to the output.              butions, as used in (Merity et al., 2016), this model
                                                           uses a hard gate to combine the distributions. A
4.1    Encoder                                             sentinel token $ is added to the input sequence
Each word in the input sentence x from the user            while training and the pointer network is trained
is represented in high-dimension by using an em-           to maximize the Pptr probability for tokens that
bedding matrix A. These representations are en-            should be generated from output vocabulary. If the
coded by a Gated Recurrent Unit. The GRU takes             sentinel token is chosen by Pptr , then the model
switches to Pvocab to generate a token, else the in-    allel computation within a sequence. This can in-
put token specified by Pptr is chosen as output to-     crease dramatically the training time of the sys-
ken. Though the MemNN can be modelled with              tem without reducing the performance, as shown
n hops, the nature of the SUGAR task and sev-           in (Gehring et al., 2017).
eral experiments that we carried on, showed that           The weak point of the system is that it needs a
adding more hops is not useful. As a consequence        consistent amount of training data to create rea-
the model is implemented as a single hop as ex-         sonable models. In fact, Fairseq(-py) trained with
plained above.                                          only the SUGAR dataset can not converge and
   We use the pre-trained embeddings from (Bo-          gets stuck after some epochs, producing pseudo-
janowski et al., 2016) to train the model.              random sequences. Due to the small size of the
                                                        SUGAR training set, combined with its low vari-
5       System 2: Fairseq                               ability (training data are composed by possible
The second system experimented by FBK is based          variations of only two recipes), for the system is
on the work in (Gehring et al., 2017). In particular,   impossible to learn the correct structure of the
we make use of the Python implementation of the         commands (e.g. balancing the parenthesis) or to
toolkit known as Fairseq(-py)1 . The toolkit is im-     learn how to generalize arguments. In order
plemented using PyTorch, and provides reference         to use effectively this system we have expanded
implementations of various sequence-to-sequence         the SUGAR dataset with data augmentation tech-
models. There are configurations for several tasks,     niques, presented in Section 6.
including translation, language model and stories
generation. In our experiment we use the toolkit        6   Data augmentation
as a black-box since our goal is to obtain a dataset    Overfitting is still an open issue in neural mod-
that could be used with this system; hence, we use      els, especially in situations of data sparsity. In the
the generic model (not designed for any specific        realm of NLP, regularization methods are typically
task) without fine tuning. Moreover, we do not          applied to the network (Srivastava et al., 2014; Le
add any specific feature or tuning for the implicit     et al., 2015), rather than to the training data.
arguments (the ones surrounded by ∗), but we let           However, in some application fields, data aug-
the system learn the rule by itself.                    mentation has proven to be fundamental in im-
   A common approach in sequence learning is            proving the performance of neural models when
to encode the input sequence with a series of           facing insufficient data. The first fields exploring
bi-directional recurrent neural networks (RNN);         data augmentation techniques were computer vi-
this can be done with Long Short-Term Memory            sion and speech recognition. In these fields there
(LSTM) networks, Gated Recurrent Unit (GRU)             now exist well-established techniques for synthe-
networks or other types of network, and generate a      sizing data. In the former we can cite techniques
variable length output with another set of decoder      such as rescaling or affine distortions (LeCun et
RNNs, not necessarily of the same type, both of         al., 1998; Krizhevsky et al., 2012). In the latter,
which interface via an attention mechanism (Bah-        adding background noise or applying small time
danau et al., 2014; Luong et al., 2015).                shifts (Deng et al., 2000; Hannun et al., 2014).
   On the other hand convolutional networks cre-
                                                           In the realm of NLP tasks, data augmenta-
ate representations for fixed size contexts, that can
                                                        tion has received little attention so far, some no-
be seen as a disadvantage compared to the RNNs.
                                                        table exceptions being feature noising (Wang et
However, the context size of the convolutional net-
                                                        al., 2013) or Kneser-Ney smoothing (Xie et al.,
work can be expanded by adding new layers on top
                                                        2017). Additionally, negative examples generation
of each other. This allows to control the maximum
                                                        has been used in (Guerini et al., 2018).
length of dependencies to be modeled. Further-
                                                           In this paper we build upon the idea of the
more, convolutional networks allow paralleliza-
                                                        aforementioned papers by moving a step forward
tion over elements in the sequence, because they
                                                        and taking advantage of the structured nature of
do not need the computations of the previous time
                                                        the SUGAR task and of some domain/linguistic
step. This contrasts with RNNs, which maintain
                                                        knowledge. In particular, we used the following
a hidden state of the entire past that prevents par-
                                                        methods to expand the vocabulary and the size of
    1
        https://github.com/pytorch/fairseq.             the training data, but applying some substitution
strategies to the original data:                       the inclusion of the history of three previous utter-
                                                       ances in the process.
  • most-similar token substitution: based on a
    similarity mechanisms (i.e. embeddings).           7    Results
  • synonym token substitution: synonymy re-
                                                                                         Actions   Arguments
    lations taken from an online dictionary and
                                                           Memory + Pointer Networks
    applied to specific tokens.                                    - Data Augmentation   65.091     30.856
                                                                   + Data Augmentation   65.396     35.786
  • entity substitution: replace entities in the                           Fine Tuning   66.158     36.102
    examples with random entities of the same              Fairseq
    type taken from available gazetteers.                          + Data Augmentation   66,361     46,221

   The first approach implies substituting a to-       Table 1: Accuracy of the two experimented ap-
ken from a training example with one of the five       proaches in recognizing actions and their argu-
most similar tokens (chosen at random) found           ments.
through cosine similarity in the embedding space
described in (Pennington et al., 2014). We use            Results of the two approaches are reported in
the top five candidates in order to add variabil-      Table 1. Both approaches obtain a higher accu-
ity, since many tokens appeared multiple times in      racy in recognizing actions, than in recogniz-
the training data. If the token appeared also as an    ing arguments. Fairseq trained with augmented
argument in the command, it was substituted            data is the top performer of the task, outperform-
as well, while if it appeared as action it was         ing more than 10% of accuracy on arguments
left unchanged. This approach was applied with a       compared to the others approach. The ablation test
probability of 30% on each token of the utterances     on Memory + Pointer Networks also show the im-
in the training data.                                  portance of data augmentation for tasks with low
   The second approach has been used over verbs        resources, in particular fine tuning the classifier
recognized in training utterances using the TextPro    with the new data.
PoS tagger (Pianta et al., 2008). Such verbs have
                                                       8    Conclusion and Future Work
been substituted with one possible synonym taken
from an electronic dictionary2 . Also in this case,    We presented the FBK participation at the
the action in the command was kept the same            EVALITA 2018 Shared Task “SUGAR – Spo-
(in fact the verbs present in the utterance are usu-   ken Utterances Guiding Chef’s Assistant Robots”.
ally paired with the action in the command).           Given the characteristics of the task, we exper-
The third approach has been used to substitute in-     imented two different approaches: (i) a neural
gredients in the text with other random ingredients    architecture based on memory and pointer net-
from a list of foods (Magnini et al., 2018). In this   work, that can use as less training data as pos-
case the ingredient has been modified accordingly      sible, and (ii) a state of the art tagging system,
also in the annotation of the sentence.                Fairseq, trained with several augmentation tech-
   These methodologies allow to generate several       niques to expand the initial training set with syn-
variants starting from a single sentence. While        thetically generated data. This second approach
the first approach has been used in isolation, the     seems promising and in the future we want to
second and the third one have been used together       deeper investigate the effect of the different tech-
to generate additional artificial training data. Do-   niques of data augmentation on the performances.
ing so, we obtained two different data sets: the
first is composed by 45680 pairs of utterances         Acknowledgments
and commands (most-similar token applied forty
                                                       This work has been partially supported by the
times per example, 1142 ∗ 40); the second dataset
                                                       AdeptMind scholarship.
contains 500916 pairs (each original sentence got
at least each verb replaced 3 times, and for each of
these variants, ingredients were randomly substi-      References
tuted twice), the high number of variants is due to
                                                       Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
   2
       http://www.sinonimi-contrari.it/.                 gio. 2014. Neural machine translation by jointly
  learning to align and translate.      arXiv preprint     Minh-Thang Luong, Hieu Pham, and Christopher D
  arXiv:1409.0473.                                           Manning. 2015. Effective approaches to attention-
                                                             based neural machine translation. arXiv preprint
Piotr Bojanowski, Edouard Grave, Armand Joulin,              arXiv:1508.04025.
   and Tomas Mikolov. 2016. Enriching word vec-
   tors with subword information. arXiv preprint           Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.
   arXiv:1607.04606.                                         2018. Mem2seq: Effectively incorporating knowl-
                                                             edge bases into end-to-end task-oriented dialog sys-
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-             tems. arXiv preprint arXiv:1804.08217.
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
  Schwenk, and Yoshua Bengio. 2014. Learning               Bernardo Magnini, Vevake Balaraman, Mauro Drag-
  phrase representations using rnn encoder–decoder           oni, Marco Guerini, Simone Magnolini, and Valerio
  for statistical machine translation. In Proceedings of     Piccioni. 2018. Ch1: A conversational system to
  the 2014 Conference on Empirical Methods in Nat-           calculate carbohydrates in a meal. In Proceedings
  ural Language Processing (EMNLP), pages 1724–              of the 17th International Conference of the Italian
  1734. Association for Computational Linguistics.           Association for Artificial Intelligence (AI*IA 2018).

Li Deng, Alex Acero, Mike Plumpe, and Xuedong              Maria Di Maro, Antonio Origlia, and Francesco Cu-
  Huang. 2000. Large-vocabulary speech recognition          tugno. 2018. Overview of the EVALITA 2018
  under adverse acoustic environments. In Sixth Inter-      Spoken Utterances Guiding Chef’s Assistant Robots
  national Conference on Spoken Language Process-           (SUGAR) Task.       In Tommaso Caselli, Nicole
  ing.                                                      Novielli, Viviana Patti, and Paolo Rosso, editors,
                                                            Proceedings of the 6th evaluation campaign of Nat-
Jonas Gehring, Michael Auli, David Grangier, Denis          ural Language Processing and Speech tools for Ital-
  Yarats, and Yann N Dauphin. 2017. Convolu-                ian (EVALITA’18), Turin, Italy. CEUR.org.
  tional sequence to sequence learning. arXiv preprint
  arXiv:1705.03122.                                        Stephen Merity, Caiming Xiong, James Bradbury, and
                                                              Richard Socher. 2016. Pointer sentinel mixture
Alex Graves and Navdeep Jaitly. 2014. Towards end-            models. arXiv preprint arXiv:1609.07843.
  to-end speech recognition with recurrent neural net-
  works. In International Conference on Machine            Jeffrey Pennington, Richard Socher, and Christopher
  Learning, pages 1764–1772.                                  Manning. 2014. Glove: Global vectors for word
                                                              representation. In Proceedings of the 2014 confer-
Marco Guerini, Simone Magnolini, Vevake Balara-               ence on empirical methods in natural language pro-
 man, and Bernardo Magnini. 2018. Toward zero-                cessing (EMNLP), pages 1532–1543.
 shot entity recognition in task-oriented conversa-
 tional agents. In Proceedings of the 19th Annual          Emanuele Pianta, Christian Girardi, and Roberto
 SIGdial Meeting on Discourse and Dialogue, pages            Zanoli.    2008.     The textpro tool suite.     In
 317–326, Melbourne, Australia, July.                        Bente Maegaard Joseph Mariani Jan Odijk Stelios
                                                             Piperidis Daniel Tapias Nicoletta Calzolari (Con-
Awni Hannun, Carl Case, Jared Casper, Bryan Catan-           ference Chair), Khalid Choukri, editor, Proceed-
  zaro, Greg Diamos, Erich Elsen, Ryan Prenger,              ings of the Sixth International Conference on Lan-
  Sanjeev Satheesh, Shubho Sengupta, Adam Coates,            guage Resources and Evaluation (LREC’08), Mar-
  et al.    2014.     Deep speech: Scaling up                rakech, Morocco, may. European Language Re-
  end-to-end speech recognition.    arXiv preprint           sources Association (ELRA).        http://www.lrec-
  arXiv:1412.5567.                                           conf.org/proceedings/lrec2008/.

Veton Këpuska and Gamal Bohouta. 2017. Comparing           Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
  speech recognition systems (microsoft api, google          Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
  api and cmu sphinx). Journal of Engineering Re-            Dropout: a simple way to prevent neural networks
  search and Application, 7(3):20–24.                        from overfitting. The Journal of Machine Learning
                                                             Research, 15(1):1929–1958.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
  ton. 2012. Imagenet classification with deep con-        Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
  volutional neural networks. In Advances in neural          2015. Pointer networks. In C. Cortes, N. D.
  information processing systems, pages 1097–1105.           Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
                                                             editors, Advances in Neural Information Processing
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hin-               Systems 28, pages 2692–2700. Curran Associates,
  ton. 2015. A simple way to initialize recurrent            Inc.
  networks of rectified linear units. arXiv preprint
  arXiv:1504.00941.                                        Sida Wang, Mengqiu Wang, Stefan Wager, Percy
                                                              Liang, and Christopher D Manning. 2013. Feature
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick           noising for log-linear structured prediction. In Pro-
  Haffner. 1998. Gradient-based learning applied to           ceedings of the 2013 Conference on Empirical Meth-
  document recognition. Proceedings of the IEEE,              ods in Natural Language Processing, pages 1170–
  86(11):2278–2324.                                           1179.
Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aim-
  ing Nie, Dan Jurafsky, and Andrew Y Ng. 2017.
  Data noising as smoothing in neural network lan-
  guage models. arXiv preprint arXiv:1703.02573.
Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve,
  Ronan Collobert, and Emmanuel Dupoux. 2018.
  End-to-end speech recognition from the raw wave-
  form. Interspeech 2018, Sep.