=Paper=
{{Paper
|id=Vol-2263/paper012
|storemode=property
|title=Overview of the EVALITA 2018 Spoken Utterances Guiding Chef’s Assistant Robots (SUGAR) Task
|pdfUrl=https://ceur-ws.org/Vol-2263/paper012.pdf
|volume=Vol-2263
|authors=Maria Di Maro,Antonio Origlia,Francesco Cutugno
|dblpUrl=https://dblp.org/rec/conf/evalita/MaroOC18
}}
==Overview of the EVALITA 2018 Spoken Utterances Guiding Chef’s Assistant Robots (SUGAR) Task==
Overview of the EVALITA 2018 Spoken Utterances Guiding Chef’s
Assistant Robots (SUGAR) Task
Maria Di Maro Antonio Origlia Francesco Cutugno
Università degli Studi Università degli Studi Università degli Studi
di Napoli ‘Federico II’ di Napoli ‘Federico II’ di Napoli ‘Federico II’
Department of Humanities URBAN/ECO Research Center Department of Electrical
maria.dimaro2@unina.it antonio.origlia@unina.it Engineering and Information
Technology
cutugno@unina.it
Abstract 2018; Ziefle and Valdez, 2017). In order to ensure
the future possibility of making such systems even
English. The SUGAR task is intended more intelligent, further researches are needed. As
to develop a baseline to train a voice- it has been the case with Apple SIRI and Google
controlled robotic agent to act as a cook- Assistant technologies, recent approaches trans-
ing assistant. The starting point will be formed the former dialogue systems in direct ac-
therefore to provide authentic spoken data tion actuators, removing or reducing, as much as
collected in a simulated natural context possible, clarification requests that may arise in
from which semantic predicates will be presence of ambiguous commands. In this view,
extracted to classify the actions to per- Spoken Language Understanding (SLU) is nowa-
form. Three different approaches were days one of the major challenge of the field. Mak-
used by the two SUGAR participants to ing a system able to truly understand the inten-
solve the task. The enlightening results tion of the speaker in different contexts and react
show the different elements of criticality correctly, even in presence of Automatic Speech
underlying the task itself. Recognition (ASR) errors, is the ultimate purpose
to pursue in the field. In this context, the appli-
Abstract
cation of various semantic annotation schemata
Italiano. Con il task SUGAR si intende and criteria of knowledge modelling are of par-
sviluppare una baseline per addestrare un ticular interest. Among different techniques used
aiuto-cuoco robotico controllato da co- to model the interpretation process we cite: (i)
mandi vocali. Il punto di partenza sarà, semantic-frame parsing, where the frame classifi-
pertanto, quello di fornire materiale vo- cation with the recognition of its attribute can im-
cale autentico raccolto in un contesto nat- prove the information retrieval process for a more
urale simulato da cui saranno estratti i precise domain specific answer (Wang, 2010);
predicati semantici al fine di classificare le (ii) semantic interpretation, for which semantic-
azioni da eseguire. Tre 16 diversi approcci syntactic trees can be used to extract basic se-
sono stati utilizzati dai due partecipanti mantic units and their relationships (Miller et al.,
per risolvere il task. I risultati mostrano 1996); (iii) intent classification, for which struc-
i veri livelli di criticità che soggiaciono il tures comprising generic predicates working as se-
task stesso. mantic primitives (Wierzbicka, 1972) and domain-
dependent arguments can be used to represent a
1 Introduction specific intent (Tur and Deng, 2011; Serban et al.,
2018). With this particular task, we propose a pos-
In the last few years, Human-Machine interaction
sible framework for semantic classification to be
systems have been in the spotlight, as far as com-
tested, recurring to state-of-the-art SLU systems
puter science and linguistics are concerned, result-
participating to the EVALITA-SUGAR challenge
ing in many applications such as Virtual Assistants
(Caselli et al., 2018).
and Conversational Agents (Cassell et al., 2000;
Cauell et al., 2000; Dzikovska et al., 2003; Allen
2 Corpus Collection and Description
et al., 2007). The possibility to use such Artifi-
cial Intelligence technologies in domestic environ- In the SUGAR challenge, the underlying task is
ments is increasingly becoming a reality (Darby, to train a voice-controlled robotic agent to act as
parameters. For example, the action of putting
may refer to a pot being placed on the fire
put(pot, f ire)
or to an egg being put in a bowl
put(egg, bowl)
The annotation process resulted in determining
Figure 1: 3D Recontruction of Bastian in his the optimal action predicate corresponding to each
Kitchen. On the wall, the television showing command.
frames of video recipes, from which users could The training set consists of audio files and pred-
extract actions to utter as commands icate description pairs, where the predicate serves
as an interpretation of the intention to be per-
formed by the robot. For these scenarios, the audio
a cooking assistant. For this purpose, a train-
files are always mapped on a single interpretative
ing corpus of annotated spoken commands was
predicate. The training set consists of 1721 utter-
collected. To collect the corpus, we designed a
ances (and therefore 1721 audio files) produced by
3D virtual environment reconstructing and sim-
36 different speakers annotated by two linguistic
ulating a real kitchen where users could inter-
experts. The action templates, which have been
act with a robot (named Bastian) which received
inferentially defined through the video collection,
commands to be performed in order to accom-
are shown in Table 1, where [ ] indicates a list of
plish some recipes. User’s orders were inspired by
ingredients, / the alternative among possible argu-
silent cooking videos shown in the 3D scene, thus
ments, quantity and modality are not mandatory
ensuring the naturalness of the spoken production.
arguments, and * is used when the argument is re-
Videos were segmented into elementary portions
coverable from the context (i.e. previous instan-
(frames) and sequentially proposed to the speak-
tiated arguments, which are not uttered, not even
ers who uttered a single sentence after each seen
by means of clitics or other pronouns) or from the
frame. In this view, speakers watched at video
semantics of the verb. For instance,
portions and then gave instructions to the robot to
emulate what seen in the frame (Figure 1). The
friggere (fiori)1
collected corpus then consists of a set of spoken
commands, whose meaning derives from the var- is represented as
ious combination of actions, items (i.e. ingredi-
ents), tools and different modifiers. aggiungere(fiori, *olio*)2
Audio files were captured in a real acoustic en-
vironment, with a microphone posed at about 1 mt because olio (En. oil) is implicitly expressed in the
of distance from the speakers. The resulting cor- semantics of the verb friggere (En. to fry) as an
pus contains audio files for each speaker. These instrument to accomplish the action. Among other
files were then segmented into sentences repre- phenomena, it is worth mentioning the presence
senting isolated commands. Orthographic tran- of actions paired with templates, even when the
scriptions of the audio files were not be provided. syntactic structure needs a reconstruction, as in
Consequently, participants could use whichever
ASR they prefer, whose performance was not un- coprire(ciotola, pellicola)3
der assessment. Nevertheless, the developed sys-
tems were expected to be strongly efficient despite which is annotated with the generic template as
the possible ASR deficiencies. Each resulting au-
dio file was paired to a textual one containing the mettere(pellicola, ciotola)4 .
corresponding action annotation. 1
fry(flowers)
2
add(flowers, *oil*)
Training set Actions are represented as a finite 3
cover(bowl, wrap)
4
set of generic predicates accepting an open set of put(wrap, bowl)
Predicate Arguments for each target command, the correct action pred-
prendere quantità, [ingredienti]/recipiente icate following the above-described format. Al-
aprire quantità, [ingredienti], recipiente
though single actions are of the same kind of the
mettere quantità, utensile/[ingredienti],
elettrodomestico, modalità ones found in the training set and in the template
sbucciare quantità, [ingredienti], utensile file, the objects, on which such actions may be
schiacciare [ingredienti, utensile applied to, vary (i.e. different recipes, ingredi-
passare [ingredienti], utensile ents, tools...). Participants have been evaluated on
grattare [ingredienti], utensile the basis of correctly interpreted commands, rep-
girare [ingredienti], utensile
resented in the form of predicates.
togliere utensile/prodotto, elettrodomestico
aggiungere quantità, [ingredienti], utensile/recipiente/
The task could be carried out either by using
elettrodomestico/[ingredienti], modalità only the provided linguistic information of the
mescolare [ingredienti], utensile, modalità training set or by means of other external linguis-
impastare [ingredienti] tic tools, such as ontologies, specialised lexicons,
separare parte/[ingredienti],ingrediente/utensile and external reasoners.
coprire recipiente/[ingredienti], strumento
scoprire recipiente/[ingredienti] 3 Evaluation Protocol
controllare temperatura, ingrediente
cuocere quantità, [ingredienti], utensile, modalità The evaluation protocol covered the following
possibilities:
Table 1: Italian Action templates
• The proposed system correctly detects the re-
quested action and all its parameters;
In other cases, the uttered action represents the
consequence of the action reported in the template, • The proposed system asks for repetition;
as in
• The proposed system correctly detects the re-
separare(parte, fiori)5 quested action but it assigns wrong parame-
and ters;
pulire(fiori)6 , • The proposed system misses the action.
or The possibility of asking for repetitions is left
mescolare([lievito, acqua])7 to participants to avoid forcing them to provide an
answer in uncertain conditions. In this case, the
and
evaluation protocol would assign a weaker penali-
sciogliere(lievito, acqua)8 .
sation than the one considered for missing the ar-
The argument order does not reflect the one in the guments or the action. The collected corpus did
audio files, but the following: not, however, contain situations in which the sys-
tem asks for repetitions.
azione(quantità9 , oggetto, comple- The designed evaluation procedure outputted
mento, modalità)10 the following pieces of information:
The modality arguments are of different types and 1. an id comprising the listing number of the
the order is adverb, cooking modality, temperature recognised predicate and the number of ac-
and time. tions, in case of pluri-action predicates (1_1,
1_2, 2_1, etc);
Test set The test set consists of about 572 audio
files containing uttered commands without anno- 2. a Boolean value (1: True, 0: False) indicating
tations. Task participants were asked to provide, if the predicate has been recognised; when
the predicates were not recognised, even the
5
separate(part, flowers) argument number is set on 0;
6
clean(flowers)
7
stir([yeast, water]) 3. the number of expected arguments as indi-
8
melt(yeast, water) cated in the reference annotation files11 ;
9
The quantity always precedes the noun it is referred to.
Therefore, it can also come before the complement 11
The reference annotation files were annotation files cre-
10
action(quantity, object, complement, modality) ated for the test set although not being made available
4. the distance between the participating sys- files) do not invalidate the results. In fact, to exe-
tems’ output file and the reference file com- cute an action, only one of the uttered alternatives
puted by means of the Levenshtein distance must be chosen. Therefore, when one of the al-
(Levenshtein, 1966); the higher the computed ternatives was recognised, the resulting output did
distance in the output was, the more mistakes not contain recognition errors. On the contrary,
the system had detected; when the system reports both alternatives in the
output file, the Levenshtein distance increased. In
5. the number of arguments for which the sys- the reference files, alternatives were also occurring
tem asked for repetition. as implicit arguments, when an utterance can be
completed by more than one possible argument.
Suppose the action in reference file is annotated as
1; [prendere(500 g, latte), aggiungere(latte, pen- 4 Participating Systems
tola)]12
In this section, we will report the results collected
and the recognition procedure outputs from testing the two participants’ systems: the
first (Section 4.1) have been developed at Fon-
1; prendere(500 g, panna)13 dazione Bruno Kessler (FBK), while the second
by an Italian company which has decided to re-
instead of returning the following result, indicat- main anonymous (Section 4.2). In table 2, results
ing a correct recognition are summarised, showing that FBK had better per-
formances in terms of correct predicate and argu-
1_1 (first predicate) ments recognition for the intent classification, as
(1, 2, 0, 0) far as the second system is concerned (Figure 2).
On the other hand, the first one outputted worse
1_2 (second predicate)
results, despite the introduction of the argument
(1, 2, 0, 0)
repetition request. In this phase, the argument rep-
the evaluation outputs etition percentage was not weighted in the accu-
racy rate of the system, which would have resulted
1_1 in a slight increase of the accuracy itself, but we
(1, 2, 1, 0) reported it as an additional performance of the par-
ticipating system. For the anonymous system the
1_2 action recognition is slightly beyond the 50%, but
(0, 0, 0, 0)14 the argument recognition shows some issues (Fig-
ure 2) concerned with an over-fitting problem (see
where the first predicate is recognised despite one Section 4.2). For all three systems, recognition er-
mistaken argument, whereas the second predicate rors seemed to be random and not justifiable as
is not recognised at all. semantically-related word selections.
The output format had to follow the one pro-
vided for the training data. For instance, aster- 4.1 FBK-HLT-NLP
isks indicating the implicitness of the arguments
To solve the proposed task, two different ap-
had to be included in the output file. As a matter
proaches were introduced. The first system was
of fact, retrieving the implicit function of a recon-
similar to the architecture proposed in (Madotto et
structed argument serves to catch the degree of un-
al., 2018) and was based on an encoder-decoder
derstanding of the system, along with making use
approach. The encoder consisted of a MemNN
of the processing of this information for the im-
network (Sukhbaatar et al., 2015) that stored each
provement of fine-grained action detection tasks.
previous sentences in memory, from which rele-
On the other hand, the choice between alternative
vant information was retrieved for the current sen-
arguments (separated by a slash in the reference
tence. The decoder was a combination of i) a
12
1; [take(500 g, milk), add(milk, pot)] MemNN to decode the input to an instruction con-
13
1; take(500 g, cream) taining tokens from output vocabulary and ii) a
14
The first action was recognised; two arguments were ex-
pected but one of them was wrong. The second action was Pointer network (Vinyals et al., 2015) that chose
not recognised at all. which token from the input was to be copied to
Correct Actions Correct Arguments Incorrect Actions Incorrect Arguments Argument Repetition
FBK System 1 a 50,16 28,31 49,83 71,68 4,11
FBK System 2 66,36 46,22 33,64 53,78 0
Anonymous System 53,89 17,46 46,11 82,54 0
a One user is missing.
Table 2: Percentages of accuracy and error rate for each tested system
Figure 2: Results of the FBK first system
the output instruction. This system was used to 4.2 Deep neural network for SUGAR
classify the SUGAR corpus intents after an ASR The anonymous participant built a deep neural net-
transcription (System 1). work system to tackle this task15 . First of all, to
convert the spoken utterances into text the Google
The second approach consisted of modeling Speech API was used. The neural network used
the task as a sequence to sequence problem. a word embeddings lexicon trained on a corpus of
Rather than implementing a new system, Fairseq recipes crawled on the web (4.5 million words) as
(Gehring et al., 2017) - a fully convolutional archi- features. The word embeddings, with vectors hav-
tecture for sequence to sequence modeling - was ing 100 dimensions, were trained with the skip-
used. Instead of relying on Recurrent Neural Net- gram algorithm of fastText16 (Bojanowski et al.,
works (RNNs) to compute intermediate encoder 2016).
states z and decoder states h convolutional neural As a preliminary step an autoencoder to embed
networks (CNN) were adopted. Since the amount the predicates in a vector was built. The encoder
of training data was not big enough to train the was made of a two Bi-LSTM layers. The first one
model with such a system, written synthetic data was in charge of processing the token sequences
were generated. To generate new data two main for each predicate. The second layer processed
methodologies were adopted: on one hand random the sequence of predicates and embeds them into
words were substituted with similar words based a vector called predicates embedding. This vec-
on similarity mechanisms, such as word embed- tor was then split into n-parts where n was the
dings; on the other hand, training sentences were
15
generated by replacing verbs and names with syn- The following report is a result of a conversation with the
involved participant, whose report was not officially submit-
onyms extracted from an online vocabulary (Sys- ted to EVALITA 2018 in order to remain anonymous.
16
tem 2). https://fasttext.cc/
maximum number of predicates. The decoder was tures or mimed actions and movements, on the ba-
made of two Bi-LSTM layers, where the first layer sis of which the interlocutor should be capable of
was in charge of decoding the sequence of predi- re-performing them with actual tools and ingredi-
cates and the second layer was in charge of decod- ents, are multimodal activities that are of interest
ing the sequence of token for each predicate. To for this field of application as for any other spo-
test the autoencoder, a development test set was ken understanding task where a shared context of
extracted from the training test. The autoencoder interaction is expected.
was able to encode and decode with no changes
the 96.73% of the predicates in the development Acknowledgments
test set. We thank the EVALITA 2018 organisers and the
The different possible actions have been repre- SUGAR participants for the interest expressed. A
sented as classes in a hot-encode vector, and for special thank also goes to Claudia Tortora, who
each action a binary flag has been used to repre- helped us collect recipes and annotate our training
sent whether the action was implicit or not. The set, and, last but not least, to the numerous testers
predicates have been encoded into a vector, using who had fun talking with our dear Bastian.
the aforementioned encoder, and for each predi- This work is funded by the Italian PRIN
cate a flag was used to represent their alleged im- project Cultural Heritage Resources Ori-
plicitness. enting Multimodal Experience (CHROME)
A multitask neural network was used to classify #B52F15000450001.
the actions, to detect whether they were implicit
and to predict the predicates. The network took
in input a recipe as a list of commands, each of References
whom was encoded by a Bi-LSTM layer. A sec- James Allen, Mehdi Manshadi, Myroslava Dzikovska,
ond Bi-LSTM layer processed the command se- and Mary Swift. 2007. Deep linguistic processing
quence and outputted a list of command embed- for spoken dialogue systems. In Proceedings of the
dings. Each embeddings was split into n-parts Workshop on Deep Linguistic Processing, pages 49–
56. Association for Computational Linguistics.
which identified the actions included in the com-
mand. Each of these actions was passed to 4 dense Piotr Bojanowski, Edouard Grave, Armand Joulin,
layers that predicted the action class, the implicit- and Tomas Mikolov. 2016. Enriching word vec-
tors with subword information. arXiv preprint
ness of the action, and the predicates embedding.
arXiv:1607.04606.
Finally, the above-described decoder translated the
predicates embedding into actual predicates. Tommaso Caselli, Nicole Novielli, Viviana Patti, and
Paolo Rosso. 2018. Evalita 2018: Overview of
the 6th evaluation campaign of natural language
5 Conclusions processing and speech tools for italian. In Tom-
maso Caselli, Nicole Novielli, Viviana Patti, and
With this task we proposed a field of applica- Paolo Rosso, editors, Proceedings of Sixth Evalua-
tion for spoken language understanding research tion Campaign of Natural Language Processing and
concerned with intents classification of a domain- Speech Tools for Italian. Final Workshop (EVALITA
dependent system using a limited amount of train- 2018), Turin, Italy. CEUR.org.
ing data. The results show that further analysis Justine Cassell, Joseph Sullivan, Elizabeth Churchill,
should be carried out to solve such semantic recog- and Scott Prevost. 2000. Embodied conversational
nition problems, starting with an analysis of the agents. MIT press.
errors occurred in the participating systems, an Justine Cauell, Tim Bickmore, Lee Campbell, and
enlargement of the reference corpus, up to find- Hannes Vilhjálmsson. 2000. Designing embod-
ing a suitable pipeline for data processing, includ- ied conversational agents. Embodied conversational
ing a rule-based module to model issues such as agents, pages 29–63.
the argument implicitness, both in anaphoric- or Sarah J Darby. 2018. Smart technology in the home:
semantic-dependent situations. This task is there- time for more clarity. Building Research & Informa-
fore intended to be a first reflection, whose next tion, 46(1):140–147.
developments would include the creation of a cor- Myroslava O Dzikovska, James F Allen, and Mary D
pus for the English language and the introduction Swift. 2003. Integrating linguistic and domain
of multimodality. As a matter of fact, pointing ges- knowledge for spoken dialogue systems in multiple
domains. In Proc. of IJCAI-03 Workshop on Knowl-
edge and Reasoning in Practical Dialogue Systems.
Jonas Gehring, Michael Auli, David Grangier, Denis
Yarats, and Yann N Dauphin. 2017. Convolu-
tional sequence to sequence learning. arXiv preprint
arXiv:1705.03122.
Vladimir I Levenshtein. 1966. Binary codes capable
of correcting deletions, insertions, and reversals. In
Soviet physics doklady, volume 10, pages 707–710.
Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.
2018. Mem2seq: Effectively incorporating knowl-
edge bases into end-to-end task-oriented dialog sys-
tems. arXiv preprint arXiv:1804.08217.
Scott Miller, David Stallard, Robert Bobrow, and
Richard Schwartz. 1996. A fully statistical ap-
proach to natural language interfaces. In Proceed-
ings of the 34th annual meeting on Association for
Computational Linguistics, pages 55–61. Associa-
tion for Computational Linguistics.
Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Lau-
rent Charlin, and Joelle Pineau. 2018. A survey of
available corpora for building data-driven dialogue
systems: The journal version. Dialogue & Dis-
course, 9(1):1–49.
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
2015. End-to-end memory networks. In Advances
in neural information processing systems, pages
2440–2448.
Gokhan Tur and Li Deng. 2011. Intent determination
and spoken utterance classification. Spoken Lan-
guage Understanding: Systems for Extracting Se-
mantic Information from Speech, pages 93–118.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
2015. Pointer networks. In Advances in Neural In-
formation Processing Systems, pages 2692–2700.
Ye-Yi Wang. 2010. Strategies for statistical spoken
language understanding with small amount of data-
an empirical study. In Eleventh Annual Conference
of the International Speech Communication Associ-
ation.
Anna Wierzbicka. 1972. Semantic primitives.
Martina Ziefle and André Calero Valdez. 2017. Do-
mestic robots for homecare: A technology accep-
tance perspective. In International Conference on
Human Aspects of IT for the Aged Population, pages
57–74. Springer.