=Paper= {{Paper |id=Vol-2263/paper012 |storemode=property |title=Overview of the EVALITA 2018 Spoken Utterances Guiding Chef’s Assistant Robots (SUGAR) Task |pdfUrl=https://ceur-ws.org/Vol-2263/paper012.pdf |volume=Vol-2263 |authors=Maria Di Maro,Antonio Origlia,Francesco Cutugno |dblpUrl=https://dblp.org/rec/conf/evalita/MaroOC18 }} ==Overview of the EVALITA 2018 Spoken Utterances Guiding Chef’s Assistant Robots (SUGAR) Task== https://ceur-ws.org/Vol-2263/paper012.pdf
     Overview of the EVALITA 2018 Spoken Utterances Guiding Chef’s
                      Assistant Robots (SUGAR) Task
        Maria Di Maro              Antonio Origlia                          Francesco Cutugno
      Università degli Studi    Università degli Studi                     Università degli Studi
     di Napoli ‘Federico II’    di Napoli ‘Federico II’                    di Napoli ‘Federico II’
    Department of Humanities URBAN/ECO Research Center                    Department of Electrical
    maria.dimaro2@unina.it              antonio.origlia@unina.it        Engineering and Information
                                                                                Technology
                                                                              cutugno@unina.it

                     Abstract                          2018; Ziefle and Valdez, 2017). In order to ensure
                                                       the future possibility of making such systems even
    English. The SUGAR task is intended                more intelligent, further researches are needed. As
    to develop a baseline to train a voice-            it has been the case with Apple SIRI and Google
    controlled robotic agent to act as a cook-         Assistant technologies, recent approaches trans-
    ing assistant. The starting point will be          formed the former dialogue systems in direct ac-
    therefore to provide authentic spoken data         tion actuators, removing or reducing, as much as
    collected in a simulated natural context           possible, clarification requests that may arise in
    from which semantic predicates will be             presence of ambiguous commands. In this view,
    extracted to classify the actions to per-          Spoken Language Understanding (SLU) is nowa-
    form. Three different approaches were              days one of the major challenge of the field. Mak-
    used by the two SUGAR participants to              ing a system able to truly understand the inten-
    solve the task. The enlightening results           tion of the speaker in different contexts and react
    show the different elements of criticality         correctly, even in presence of Automatic Speech
    underlying the task itself.                        Recognition (ASR) errors, is the ultimate purpose
                                                       to pursue in the field. In this context, the appli-
                     Abstract
                                                       cation of various semantic annotation schemata
    Italiano. Con il task SUGAR si intende             and criteria of knowledge modelling are of par-
    sviluppare una baseline per addestrare un          ticular interest. Among different techniques used
    aiuto-cuoco robotico controllato da co-            to model the interpretation process we cite: (i)
    mandi vocali. Il punto di partenza sarà,           semantic-frame parsing, where the frame classifi-
    pertanto, quello di fornire materiale vo-          cation with the recognition of its attribute can im-
    cale autentico raccolto in un contesto nat-        prove the information retrieval process for a more
    urale simulato da cui saranno estratti i           precise domain specific answer (Wang, 2010);
    predicati semantici al fine di classificare le     (ii) semantic interpretation, for which semantic-
    azioni da eseguire. Tre 16 diversi approcci        syntactic trees can be used to extract basic se-
    sono stati utilizzati dai due partecipanti         mantic units and their relationships (Miller et al.,
    per risolvere il task. I risultati mostrano        1996); (iii) intent classification, for which struc-
    i veri livelli di criticità che soggiaciono il     tures comprising generic predicates working as se-
    task stesso.                                       mantic primitives (Wierzbicka, 1972) and domain-
                                                       dependent arguments can be used to represent a
1   Introduction                                       specific intent (Tur and Deng, 2011; Serban et al.,
                                                       2018). With this particular task, we propose a pos-
In the last few years, Human-Machine interaction
                                                       sible framework for semantic classification to be
systems have been in the spotlight, as far as com-
                                                       tested, recurring to state-of-the-art SLU systems
puter science and linguistics are concerned, result-
                                                       participating to the EVALITA-SUGAR challenge
ing in many applications such as Virtual Assistants
                                                       (Caselli et al., 2018).
and Conversational Agents (Cassell et al., 2000;
Cauell et al., 2000; Dzikovska et al., 2003; Allen
                                                       2   Corpus Collection and Description
et al., 2007). The possibility to use such Artifi-
cial Intelligence technologies in domestic environ-    In the SUGAR challenge, the underlying task is
ments is increasingly becoming a reality (Darby,       to train a voice-controlled robotic agent to act as
                                                      parameters. For example, the action of putting
                                                      may refer to a pot being placed on the fire

                                                                           put(pot, f ire)
                                                      or to an egg being put in a bowl

                                                                           put(egg, bowl)

                                                      The annotation process resulted in determining
Figure 1: 3D Recontruction of Bastian in his          the optimal action predicate corresponding to each
Kitchen. On the wall, the television showing          command.
frames of video recipes, from which users could          The training set consists of audio files and pred-
extract actions to utter as commands                  icate description pairs, where the predicate serves
                                                      as an interpretation of the intention to be per-
                                                      formed by the robot. For these scenarios, the audio
a cooking assistant. For this purpose, a train-
                                                      files are always mapped on a single interpretative
ing corpus of annotated spoken commands was
                                                      predicate. The training set consists of 1721 utter-
collected. To collect the corpus, we designed a
                                                      ances (and therefore 1721 audio files) produced by
3D virtual environment reconstructing and sim-
                                                      36 different speakers annotated by two linguistic
ulating a real kitchen where users could inter-
                                                      experts. The action templates, which have been
act with a robot (named Bastian) which received
                                                      inferentially defined through the video collection,
commands to be performed in order to accom-
                                                      are shown in Table 1, where [ ] indicates a list of
plish some recipes. User’s orders were inspired by
                                                      ingredients, / the alternative among possible argu-
silent cooking videos shown in the 3D scene, thus
                                                      ments, quantity and modality are not mandatory
ensuring the naturalness of the spoken production.
                                                      arguments, and * is used when the argument is re-
Videos were segmented into elementary portions
                                                      coverable from the context (i.e. previous instan-
(frames) and sequentially proposed to the speak-
                                                      tiated arguments, which are not uttered, not even
ers who uttered a single sentence after each seen
                                                      by means of clitics or other pronouns) or from the
frame. In this view, speakers watched at video
                                                      semantics of the verb. For instance,
portions and then gave instructions to the robot to
emulate what seen in the frame (Figure 1). The
                                                                           friggere (fiori)1
collected corpus then consists of a set of spoken
commands, whose meaning derives from the var-         is represented as
ious combination of actions, items (i.e. ingredi-
ents), tools and different modifiers.                               aggiungere(fiori, *olio*)2
   Audio files were captured in a real acoustic en-
vironment, with a microphone posed at about 1 mt      because olio (En. oil) is implicitly expressed in the
of distance from the speakers. The resulting cor-     semantics of the verb friggere (En. to fry) as an
pus contains audio files for each speaker. These      instrument to accomplish the action. Among other
files were then segmented into sentences repre-       phenomena, it is worth mentioning the presence
senting isolated commands. Orthographic tran-         of actions paired with templates, even when the
scriptions of the audio files were not be provided.   syntactic structure needs a reconstruction, as in
Consequently, participants could use whichever
ASR they prefer, whose performance was not un-                      coprire(ciotola, pellicola)3
der assessment. Nevertheless, the developed sys-
tems were expected to be strongly efficient despite   which is annotated with the generic template as
the possible ASR deficiencies. Each resulting au-
dio file was paired to a textual one containing the                mettere(pellicola, ciotola)4 .
corresponding action annotation.                         1
                                                           fry(flowers)
                                                         2
                                                           add(flowers, *oil*)
Training set Actions are represented as a finite         3
                                                           cover(bowl, wrap)
                                                         4
set of generic predicates accepting an open set of         put(wrap, bowl)
 Predicate       Arguments                                       for each target command, the correct action pred-
 prendere        quantità, [ingredienti]/recipiente              icate following the above-described format. Al-
 aprire          quantità, [ingredienti], recipiente
                                                                 though single actions are of the same kind of the
 mettere         quantità, utensile/[ingredienti],
                 elettrodomestico, modalità                      ones found in the training set and in the template
 sbucciare       quantità, [ingredienti], utensile               file, the objects, on which such actions may be
 schiacciare     [ingredienti, utensile                          applied to, vary (i.e. different recipes, ingredi-
 passare         [ingredienti], utensile                         ents, tools...). Participants have been evaluated on
 grattare        [ingredienti], utensile                         the basis of correctly interpreted commands, rep-
 girare          [ingredienti], utensile
                                                                 resented in the form of predicates.
 togliere        utensile/prodotto, elettrodomestico
 aggiungere      quantità, [ingredienti], utensile/recipiente/
                                                                    The task could be carried out either by using
                 elettrodomestico/[ingredienti], modalità        only the provided linguistic information of the
 mescolare       [ingredienti], utensile, modalità               training set or by means of other external linguis-
 impastare       [ingredienti]                                   tic tools, such as ontologies, specialised lexicons,
 separare        parte/[ingredienti],ingrediente/utensile        and external reasoners.
 coprire         recipiente/[ingredienti], strumento
 scoprire        recipiente/[ingredienti]                        3        Evaluation Protocol
 controllare     temperatura, ingrediente
 cuocere         quantità, [ingredienti], utensile, modalità     The evaluation protocol covered the following
                                                                 possibilities:
            Table 1: Italian Action templates
                                                                      •   The proposed system correctly detects the re-
                                                                          quested action and all its parameters;
In other cases, the uttered action represents the
consequence of the action reported in the template,                   •   The proposed system asks for repetition;
as in
                                                                      •   The proposed system correctly detects the re-
             separare(parte, fiori)5                                      quested action but it assigns wrong parame-
and                                                                       ters;
                       pulire(fiori)6 ,                               •   The proposed system misses the action.
or                                                                  The possibility of asking for repetitions is left
               mescolare([lievito, acqua])7                      to participants to avoid forcing them to provide an
                                                                 answer in uncertain conditions. In this case, the
and
                                                                 evaluation protocol would assign a weaker penali-
               sciogliere(lievito, acqua)8 .
                                                                 sation than the one considered for missing the ar-
The argument order does not reflect the one in the               guments or the action. The collected corpus did
audio files, but the following:                                  not, however, contain situations in which the sys-
                                                                 tem asks for repetitions.
         azione(quantità9 , oggetto,          comple-               The designed evaluation procedure outputted
         mento, modalità)10                                      the following pieces of information:

The modality arguments are of different types and                    1. an id comprising the listing number of the
the order is adverb, cooking modality, temperature                      recognised predicate and the number of ac-
and time.                                                               tions, in case of pluri-action predicates (1_1,
                                                                        1_2, 2_1, etc);
Test set The test set consists of about 572 audio
files containing uttered commands without anno-                      2. a Boolean value (1: True, 0: False) indicating
tations. Task participants were asked to provide,                       if the predicate has been recognised; when
                                                                        the predicates were not recognised, even the
     5
     separate(part, flowers)                                            argument number is set on 0;
     6
     clean(flowers)
   7
     stir([yeast, water])                                            3. the number of expected arguments as indi-
   8
     melt(yeast, water)                                                 cated in the reference annotation files11 ;
   9
     The quantity always precedes the noun it is referred to.
Therefore, it can also come before the complement                   11
                                                                       The reference annotation files were annotation files cre-
  10
     action(quantity, object, complement, modality)              ated for the test set although not being made available
  4. the distance between the participating sys-               files) do not invalidate the results. In fact, to exe-
     tems’ output file and the reference file com-             cute an action, only one of the uttered alternatives
     puted by means of the Levenshtein distance                must be chosen. Therefore, when one of the al-
     (Levenshtein, 1966); the higher the computed              ternatives was recognised, the resulting output did
     distance in the output was, the more mistakes             not contain recognition errors. On the contrary,
     the system had detected;                                  when the system reports both alternatives in the
                                                               output file, the Levenshtein distance increased. In
  5. the number of arguments for which the sys-                the reference files, alternatives were also occurring
     tem asked for repetition.                                 as implicit arguments, when an utterance can be
                                                               completed by more than one possible argument.
Suppose the action in reference file is annotated as

 1; [prendere(500 g, latte), aggiungere(latte, pen-            4     Participating Systems
     tola)]12
                                                               In this section, we will report the results collected
and the recognition procedure outputs                          from testing the two participants’ systems: the
                                                               first (Section 4.1) have been developed at Fon-
 1; prendere(500 g, panna)13                                   dazione Bruno Kessler (FBK), while the second
                                                               by an Italian company which has decided to re-
instead of returning the following result, indicat-            main anonymous (Section 4.2). In table 2, results
ing a correct recognition                                      are summarised, showing that FBK had better per-
                                                               formances in terms of correct predicate and argu-
 1_1                                     (first predicate)     ments recognition for the intent classification, as
 (1, 2, 0, 0)                                                  far as the second system is concerned (Figure 2).
                                                               On the other hand, the first one outputted worse
 1_2                                 (second predicate)
                                                               results, despite the introduction of the argument
 (1, 2, 0, 0)
                                                               repetition request. In this phase, the argument rep-
the evaluation outputs                                         etition percentage was not weighted in the accu-
                                                               racy rate of the system, which would have resulted
 1_1                                                           in a slight increase of the accuracy itself, but we
 (1, 2, 1, 0)                                                  reported it as an additional performance of the par-
                                                               ticipating system. For the anonymous system the
 1_2                                                           action recognition is slightly beyond the 50%, but
 (0, 0, 0, 0)14                                                the argument recognition shows some issues (Fig-
                                                               ure 2) concerned with an over-fitting problem (see
where the first predicate is recognised despite one            Section 4.2). For all three systems, recognition er-
mistaken argument, whereas the second predicate                rors seemed to be random and not justifiable as
is not recognised at all.                                      semantically-related word selections.
   The output format had to follow the one pro-
vided for the training data. For instance, aster-              4.1    FBK-HLT-NLP
isks indicating the implicitness of the arguments
                                                               To solve the proposed task, two different ap-
had to be included in the output file. As a matter
                                                               proaches were introduced. The first system was
of fact, retrieving the implicit function of a recon-
                                                               similar to the architecture proposed in (Madotto et
structed argument serves to catch the degree of un-
                                                               al., 2018) and was based on an encoder-decoder
derstanding of the system, along with making use
                                                               approach. The encoder consisted of a MemNN
of the processing of this information for the im-
                                                               network (Sukhbaatar et al., 2015) that stored each
provement of fine-grained action detection tasks.
                                                               previous sentences in memory, from which rele-
On the other hand, the choice between alternative
                                                               vant information was retrieved for the current sen-
arguments (separated by a slash in the reference
                                                               tence. The decoder was a combination of i) a
  12
     1; [take(500 g, milk), add(milk, pot)]                    MemNN to decode the input to an instruction con-
  13
     1; take(500 g, cream)                                     taining tokens from output vocabulary and ii) a
  14
     The first action was recognised; two arguments were ex-
pected but one of them was wrong. The second action was        Pointer network (Vinyals et al., 2015) that chose
not recognised at all.                                         which token from the input was to be copied to
                    Correct Actions   Correct Arguments   Incorrect Actions   Incorrect Arguments     Argument Repetition
  FBK System 1 a        50,16               28,31               49,83                71,68                  4,11
   FBK System 2         66,36               46,22               33,64                53,78                    0
 Anonymous System        53,89              17,46               46,11                82,54                    0
a One user is missing.


                  Table 2: Percentages of accuracy and error rate for each tested system




                                  Figure 2: Results of the FBK first system


the output instruction. This system was used to              4.2    Deep neural network for SUGAR
classify the SUGAR corpus intents after an ASR               The anonymous participant built a deep neural net-
transcription (System 1).                                    work system to tackle this task15 . First of all, to
                                                             convert the spoken utterances into text the Google
   The second approach consisted of modeling                 Speech API was used. The neural network used
the task as a sequence to sequence problem.                  a word embeddings lexicon trained on a corpus of
Rather than implementing a new system, Fairseq               recipes crawled on the web (4.5 million words) as
(Gehring et al., 2017) - a fully convolutional archi-        features. The word embeddings, with vectors hav-
tecture for sequence to sequence modeling - was              ing 100 dimensions, were trained with the skip-
used. Instead of relying on Recurrent Neural Net-            gram algorithm of fastText16 (Bojanowski et al.,
works (RNNs) to compute intermediate encoder                 2016).
states z and decoder states h convolutional neural              As a preliminary step an autoencoder to embed
networks (CNN) were adopted. Since the amount                the predicates in a vector was built. The encoder
of training data was not big enough to train the             was made of a two Bi-LSTM layers. The first one
model with such a system, written synthetic data             was in charge of processing the token sequences
were generated. To generate new data two main                for each predicate. The second layer processed
methodologies were adopted: on one hand random               the sequence of predicates and embeds them into
words were substituted with similar words based              a vector called predicates embedding. This vec-
on similarity mechanisms, such as word embed-                tor was then split into n-parts where n was the
dings; on the other hand, training sentences were
                                                                15
generated by replacing verbs and names with syn-                   The following report is a result of a conversation with the
                                                             involved participant, whose report was not officially submit-
onyms extracted from an online vocabulary (Sys-              ted to EVALITA 2018 in order to remain anonymous.
                                                                16
tem 2).                                                            https://fasttext.cc/
maximum number of predicates. The decoder was           tures or mimed actions and movements, on the ba-
made of two Bi-LSTM layers, where the first layer       sis of which the interlocutor should be capable of
was in charge of decoding the sequence of predi-        re-performing them with actual tools and ingredi-
cates and the second layer was in charge of decod-      ents, are multimodal activities that are of interest
ing the sequence of token for each predicate. To        for this field of application as for any other spo-
test the autoencoder, a development test set was        ken understanding task where a shared context of
extracted from the training test. The autoencoder       interaction is expected.
was able to encode and decode with no changes
the 96.73% of the predicates in the development         Acknowledgments
test set.                                               We thank the EVALITA 2018 organisers and the
   The different possible actions have been repre-      SUGAR participants for the interest expressed. A
sented as classes in a hot-encode vector, and for       special thank also goes to Claudia Tortora, who
each action a binary flag has been used to repre-       helped us collect recipes and annotate our training
sent whether the action was implicit or not. The        set, and, last but not least, to the numerous testers
predicates have been encoded into a vector, using       who had fun talking with our dear Bastian.
the aforementioned encoder, and for each predi-            This work is funded by the Italian PRIN
cate a flag was used to represent their alleged im-     project Cultural Heritage Resources Ori-
plicitness.                                             enting Multimodal Experience (CHROME)
   A multitask neural network was used to classify      #B52F15000450001.
the actions, to detect whether they were implicit
and to predict the predicates. The network took
in input a recipe as a list of commands, each of        References
whom was encoded by a Bi-LSTM layer. A sec-             James Allen, Mehdi Manshadi, Myroslava Dzikovska,
ond Bi-LSTM layer processed the command se-               and Mary Swift. 2007. Deep linguistic processing
quence and outputted a list of command embed-             for spoken dialogue systems. In Proceedings of the
dings. Each embeddings was split into n-parts             Workshop on Deep Linguistic Processing, pages 49–
                                                          56. Association for Computational Linguistics.
which identified the actions included in the com-
mand. Each of these actions was passed to 4 dense       Piotr Bojanowski, Edouard Grave, Armand Joulin,
layers that predicted the action class, the implicit-      and Tomas Mikolov. 2016. Enriching word vec-
                                                           tors with subword information. arXiv preprint
ness of the action, and the predicates embedding.
                                                           arXiv:1607.04606.
Finally, the above-described decoder translated the
predicates embedding into actual predicates.            Tommaso Caselli, Nicole Novielli, Viviana Patti, and
                                                          Paolo Rosso. 2018. Evalita 2018: Overview of
                                                          the 6th evaluation campaign of natural language
5   Conclusions                                           processing and speech tools for italian. In Tom-
                                                          maso Caselli, Nicole Novielli, Viviana Patti, and
With this task we proposed a field of applica-            Paolo Rosso, editors, Proceedings of Sixth Evalua-
tion for spoken language understanding research           tion Campaign of Natural Language Processing and
concerned with intents classification of a domain-        Speech Tools for Italian. Final Workshop (EVALITA
dependent system using a limited amount of train-         2018), Turin, Italy. CEUR.org.
ing data. The results show that further analysis        Justine Cassell, Joseph Sullivan, Elizabeth Churchill,
should be carried out to solve such semantic recog-        and Scott Prevost. 2000. Embodied conversational
nition problems, starting with an analysis of the          agents. MIT press.
errors occurred in the participating systems, an        Justine Cauell, Tim Bickmore, Lee Campbell, and
enlargement of the reference corpus, up to find-           Hannes Vilhjálmsson. 2000. Designing embod-
ing a suitable pipeline for data processing, includ-       ied conversational agents. Embodied conversational
ing a rule-based module to model issues such as            agents, pages 29–63.
the argument implicitness, both in anaphoric- or        Sarah J Darby. 2018. Smart technology in the home:
semantic-dependent situations. This task is there-        time for more clarity. Building Research & Informa-
fore intended to be a first reflection, whose next        tion, 46(1):140–147.
developments would include the creation of a cor-       Myroslava O Dzikovska, James F Allen, and Mary D
pus for the English language and the introduction        Swift. 2003. Integrating linguistic and domain
of multimodality. As a matter of fact, pointing ges-     knowledge for spoken dialogue systems in multiple
  domains. In Proc. of IJCAI-03 Workshop on Knowl-
  edge and Reasoning in Practical Dialogue Systems.
Jonas Gehring, Michael Auli, David Grangier, Denis
  Yarats, and Yann N Dauphin. 2017. Convolu-
  tional sequence to sequence learning. arXiv preprint
  arXiv:1705.03122.
Vladimir I Levenshtein. 1966. Binary codes capable
  of correcting deletions, insertions, and reversals. In
  Soviet physics doklady, volume 10, pages 707–710.

Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.
  2018. Mem2seq: Effectively incorporating knowl-
  edge bases into end-to-end task-oriented dialog sys-
  tems. arXiv preprint arXiv:1804.08217.
Scott Miller, David Stallard, Robert Bobrow, and
  Richard Schwartz. 1996. A fully statistical ap-
  proach to natural language interfaces. In Proceed-
  ings of the 34th annual meeting on Association for
  Computational Linguistics, pages 55–61. Associa-
  tion for Computational Linguistics.
Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau-
   rent Charlin, and Joelle Pineau. 2018. A survey of
   available corpora for building data-driven dialogue
   systems: The journal version. Dialogue & Dis-
   course, 9(1):1–49.
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
  2015. End-to-end memory networks. In Advances
  in neural information processing systems, pages
  2440–2448.
Gokhan Tur and Li Deng. 2011. Intent determination
  and spoken utterance classification. Spoken Lan-
  guage Understanding: Systems for Extracting Se-
  mantic Information from Speech, pages 93–118.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
  2015. Pointer networks. In Advances in Neural In-
  formation Processing Systems, pages 2692–2700.
Ye-Yi Wang. 2010. Strategies for statistical spoken
  language understanding with small amount of data-
  an empirical study. In Eleventh Annual Conference
  of the International Speech Communication Associ-
  ation.
Anna Wierzbicka. 1972. Semantic primitives.
Martina Ziefle and André Calero Valdez. 2017. Do-
 mestic robots for homecare: A technology accep-
 tance perspective. In International Conference on
 Human Aspects of IT for the Aged Population, pages
 57–74. Springer.