=Paper=
{{Paper
|id=Vol-2481/paper58
|storemode=property
|title=Evaluating the MuMe Dialogue System with the IDIAL protocol
|pdfUrl=https://ceur-ws.org/Vol-2481/paper58.pdf
|volume=Vol-2481
|authors=Aureliano Porporato,Alessandro Mazzei,Daniele P. Radicioni,Rosa Meo
|dblpUrl=https://dblp.org/rec/conf/clic-it/PorporatoMRM19
}}
==Evaluating the MuMe Dialogue System with the IDIAL protocol==
Evaluating the MuMe Dialogue System with the IDIAL Protocol
Aureliano Porporato Alessandro Mazzei
Università degli Studi di Torino Università degli Studi di Torino
aureliano.porporato@unito.it alessandro.mazzei@unito.it
Rosa Meo Daniele P. Radicioni
Università degli Studi di Torino Università degli Studi di Torino
rosa.meo@unito.it daniele.radicioni@unito.it
Abstract on the linguistic variations of the successful in-
teractions with the users. The application being
English. In this paper we describe the im-
tested is a prototype dialogue system that we de-
plementation of the MuMe dialogue sys-
veloped for the reservation of electric vehicles in
tem, a task-based dialogue system for a car
the context of a car sharing service. A user must
sharing service, and its evaluation through
be able to interact with the system, to specify
the IDIAL protocol. Finally we report
when and where s/he wants to leave and which
some comments on this novel dialogue
sort of vehicle is needed. While there are some
system evaluation method.1
services and frameworks dedicated to the devel-
Italiano. In questo lavoro descriviamo opment of machine-learning-based dialogue sys-
l’implementazione del sistema di dialogo tems, like Google Dialogflow2 or the open source
MuMe, realizzato per un sistema di car Rasa3 frameworks, the lack of Italian dialogue cor-
sharing, e la sua valutazione attraverso il pora in the specific domain of car sharing reserva-
protocollo IDIAL. Infine, offriamo alcuni tions (see, e.g., Serban et al. (2018)) and the im-
commenti su questo nuovo metodo per la possibility on our part to recruit a number of peo-
valutazione di sistemi di dialogo. ple large enough for the creation of such a corpus,
forced us to choose a different solution: we de-
1 Introduction veloped a simpler and less data-reliant rule-based
system, based on slot-filling semantics. Moreover,
The interest in dialogue systems is on the rise in the decisions made by this kind of systems can be
the NLP community (McTear et al., 2016), under tracked throughout the computation, thereby re-
the strong demand for the introduction of a nat- sulting in the advantage of being quite explain-
ural and effective user interaction in applications, able. This is a desirable feature, since it simpli-
like in the customer care domain (Hu et al., 2018). fies the debugging and the maintenance of the rou-
A related and central issue is the evaluation of tines, and allows an easier extension of the system
such systems. In this setting, it is largely known to meet additional requirements.
that most evaluation metrics that come from ma-
chine translation and compare a model generated This paper is mostly concerned with the evalua-
response to a single target response, exhibit a poor tion of the MuMe system. The structure of the pa-
correlation with the human judgement (Liu et al., per is as follows. After surveying on related work
2016). (Section 2), we briefly introduce the overall archi-
In this paper we briefly illustrate a task-oriented tecture and the main components of the MuMe di-
dialogue system called MuMe (from “MUoversi alogue system (see Section 3); we evaluate MuMe
MEglio”, “travelling better” in English language), by using the IDIAL protocol, and employ MuMe
and examine how far the evaluation protocol experimentation as a case study for giving feed-
IDIAL (Cutugno et al., 2018) is helpful in its as- back on the IDIAL protocol itself (Section 4); fi-
sessment. IDIAL is composed by a usability eval- nally, in the final Section we briefly recap the main
uation (done by a group of users) and by an eval- contributions of the paper, and point to ongoing
uation of the robustness of the dialog model based and future work.
1
Copyright c 2019 for this paper by its authors. Use
2
permitted under Creative Commons License Attribution 4.0 https://dialogflow.com/
3
International (CC BY 4.0). https://rasa.com/
2 Related Work into a slot-filling form. When control returns to
OpenDial, it generates an answer and returns it to
The pioneering work of (Bobrow et al., 1977) pro-
the user on the basis of a dialogue control strategy
posed the frame-based architecture that most of
(see Section 3.3).
task-based dialogue systems implement. The ba-
sic idea is to abandon the demanding goal to have a 3.1 The OpenDial Dialogue Manager
genuine logic representation of the dialog meaning The main component of our software architecture
and adopt a simpler slot-filling semantics. In some is the OpenDial open source framework for dia-
sense, the event-entities representation of the mod- logue management (Lison, 2015). The system,
ern neural-based dialogue system frameworks can that was designed for speech interaction, adopts
be seen as an ultimate evolution of that simplifica- the information state approach for modelling the
tion idea. Aust et al. (1995) presented a rule-based state of the dialogue (Traum and Larsson, 2003),
system to some extents similar to ours in its pur- that is a collection of variables representing the ac-
pose and structure, created for a train-seat reserva- tual state of the system. The transition between
tion project. This system has to grasp the names states, i.e. the change of the variables values, is
of cities, train stations, dates and times, and it is governed by the activation of a set of ”if-then-else”
able to perform quite sophisticated temporal in- rules on input values as well as on the variation
formation processing. Further rule-based systems of some variables. Indeed, OpenDial uses these
are reviewed in the survey by (Abdul-Kader and rules when it models the sub-tasks of user utter-
Woods, 2015). ance understanding, the dialogue management and
A different class of dialogue systems are based the response generation. Moreover the integration
on neural networks. A survey on this class of sys- of the system with external tools is simple. We
tems can be found in (Mathur and Singh, 2018). exploited this capability in MuMe since for lan-
Regarding the evaluation of dialogue systems, guage understanding we used a module based on
the work by (Bohlin et al., 1999) proposes the an external parser (see below). Additionally, the
Trindi Tick-list, a wish list of the desired dia- OpenDial framework implements some statistical-
logue behaviour and features specified as a check- based techniques to deal with uncertainty. This
list of ”yes-no” questions. As regards this ap- is a way to learn interaction models from exist-
proach, Braunger and Maier (2017) argue that ing dialogues. This feature is particularly impor-
standardised evaluation models do not enable a tant for speech based dialogue systems where un-
complete evaluation of a dialogue system. Rather, certain information arises from automatic speech
they suggest that such evaluation must take into recognition. However, at this stage of the MuMe
account the natural flow of the interaction between project, we did not use this feature since we were
the user and the system itself; such measure in- working on written texts only.
volves many language- and user-dependant fac-
tors, such as the length of the user utterances. Such 3.2 Parsing and Information Extraction
principles were tested in human-computer vocal In order to assign semantic roles to the entities in
interactions occurring on board of vehicles. Fur- the dialogues, we decided to use a syntactic parser
ther information on dialogue systems evaluation on the text inserted by the user.
methods can be found in the survey by Deriu et As our main parsing module we used Tint
al. (2019). (The Italian NLP Tool) (Palmero Aprosio and
Moretti, 2016), a framework modeled on Stan-
3 The MuMe system architecture
ford CoreNLP (Manning et al., 2014). Tint per-
In Figure 1 we depicted the basic architecture of forms some fundamental processing of user utter-
the MuME dialogue system. The information flow ances, such as dependency parsing, Named Enti-
starts from a sentence typed by the user: this sen- ties Recognition and the extraction of Temporal
tence is handled by the OpenDial system (see Sec- Expressions. In particular, the tasks are executed
tion 3.1) which plays both the role of the dialogue by interfacing with external tools.
manager and of the system orchestrator. So, the For the recognition of temporal expressions
sentence is syntactically parsed and semantically (such as dates and times), Tint integrates the
analyzed by an IE module (see Section 3.2). At services provided by HeidelTime (Strötgen and
this point, the result of the processing is converted Gertz, 2013). HeidelTime allows the extraction
Figure 1: The schematic architecture of the MuMe dialogue system.
OPENDIAL Preprocessing
IE Module
(Dialogue Manager)
Linguistic Analysis (TINT)
Temporal Expressions
Extraction (Heideltime)
Postprocessing
Geographic Expressions
Response Extraction
Geocoding and
Generation
Geolocation (Google
Maps API)
Domain-specific IE
Backend Space & Time Inference
Slot Filling
Server
of various sorts of temporal expressions in vari- and vehicle type. For example, the user can choose
ous languages, including the Italian language, and between three types of vehicles, but if the kind of
represents them in the standard TIMEX3 format. vehicle is not specified, the system assigns a de-
For the treatment of geographic expressions, fault ‘economy car’ to the vehicle type slot.
Tint is interfacing with the Nominatim wrapper.4 The MuMe system adopts a mixed initiative for
However, this (free and open source) service per- dialogue handling. Although the dialogue is over-
forms poorly in geocoding (i.e., in searching the all system-driven, the user starts the conversation
GPS coordinates of a given address). As a conse- by possibly providing some initial information. A
quence we decided to use the Google Maps API5 , richer initial information is expected to result in
which provides for better performances. Indeed, a shorter dialogue interaction. Indeed, a design
Maps offers an API for address autocomplete, goal of the MuMe system is to produce a dialogue
once this information piece has been isolated from as short as possible. For this reason, also in the
the rest of the sentence, and for geolocation (i.e., subsequent interactions, if the user gives various
searching the coordinates of the user), too. pieces of information in a single utterance, the sys-
tem can extract all such information and is able to
3.3 Dialogue Control Strategy assign each filler to the corresponding slot, thus
The simple control strategy implemented, that avoiding further unnecessary questions.
governs the moves of the dialogue, is based on the When the user begins the interaction with the
fulfillment of a number of mandatory slots in the MuMe system, the system replies with a welcome
domain-specific slot-filling semantics adopted for message, and with a general question aiming at en-
the car reservation domain. couraging the user to start the interaction in the
In particular, the mandatory slots are the start most natural way.
date, the start time and start stall (which encodes In order to give more details on the control strat-
the start position). Indeed, the simplest reservation egy, we consider now the following running exam-
in MuMe needs only of these pieces of informa- ple and its processing in MuMe (see Figure 1):
tion: a person reserves a standard car, starting at a (it) “User: Ho bisogno di un’auto domani per
specific time of a specific day from a specific stall, andare
::::::
in via Pessinetto”
and will return the car in the same stall without the (en) “User: I need a car tomorrow to go ::
in
need to specify the return date and time. Pessinetto street”6
However, more complex reservations need The Information Extraction phase detects a date
more information, that are encoded in the non- (through HeidelTime) and an address (extracted
mandatory slots of end date, end time, end stall through a basic set of custom rules) in the user
4 6
http://nominatim.org/. The English version of the user and system sentences are
5
https://cloud.google.com/ given for clarity. The system is available in Italian language
maps-platform/. only.
sentence. By means of other rules that check the which is split in a questionnaire concerning the
shape of the dependency tree (obtained through user experience (Section 4.1), and a number of
Tint), date and address are labelled as start date stress tests concerning the linguistic robustness of
and end address. Particularly relevant in this case the system (Section 4.2).
is the verb “andare” (“to go”), that signals that the
following address is where the user wants to ar- 4.1 IDIAL User Evaluation
rive and not a starting point. In the post-processing A group of 5 subjects (3 males, 2 females, 19, 22,
phase some additional information can be inferred, 25, 26 and 61 years old) were recruited for the
like the value of the start address, left unspecified evaluation task by personal invitation and without
by the user: it can be selected by retrieving the rewards. After a brief oral description of the do-
GPS coordinates of the address by means of the main and of the basic mechanisms of interaction
Google Maps API. Once the user’s current loca- with the system, each user was asked to generate 7
tion has been identified, the nearest stall is selected complete dialogues with the system in a controlled
as the start stall. environment. We asked the users to simulate the
At the end of this processing, the system suc- process of reserving a car without other specific
cessfully filled the start address, start stall, end constraints.
address, end stall and start date slots. Some In Table 1 we report the ten questions of the
mandatory slots are still left unfilled, such as the IDIAL user test with the average score, obtained
start time, so that the system will ask the user by using a Likert scale based on five points.7 Note
to provide the missing information. As a conse- that the questions 3, 4, 7 and 10 have been de-
quence, the response of the system will be a ques- signed to evaluate the effectiveness of the dialogue
tion selected from a fixed list based on unfilled system, while questions 1 and 2 regard the system
slots: in this specific example, the system will con- efficiency.8
tinue asking for the departure time.
At the end of the filling-phase of the mandatory- 4.2 IDIAL Stress Tests
slots, the systems gives the user the possibility The second evaluation stage in the IDIAL protocol
to modify the request and to correct possible er- consists in a set of linguistic stress tests. We se-
rors and misunderstandings. The slot-filling val- lected 5 dialogues (one for each user) among those
ues will be sent to a dedicated server for the final- successfully completed9 during the user evalua-
ization of the reservation. tion stage. Following the IDIAL protocol, we
modified one sentence in each dialogue, once for
4 Evaluation each test, as illustrated in (Cutugno et al., 2018),
and repeated the dialogue with the modified sen-
In order to have a first preliminary evaluation of
tence. The results are reported in Table 2.
the MuMe system, we applied the Trindi Tick-
Note that we could not perform three stress tests
list protocols, that is a set of ”yes-no” questions
for distinct reasons. We could not perform the
concerning specific capabilities of the developed
ST-8 test, regarding active-passive alternation, be-
system (Bohlin et al., 1999). While this simple
cause the users almost always used intransitive
questionnaire is helpful in the development phase,
verbs (like “andare” [“to go”] and “partire” [“to
since it is able to give a measure of the system
depart”]). We could not perform the ST-9 test,
limits, it is not suitable to completely evaluate the
concerning adjective-noun alternation, since the
actual experience of the user. At this stage of de-
users used a very few adjectives (like vehicle types
velopment, the MuMe system has a Trindi score
modifiers “lussuosa” [“luxurious”]), and no adjec-
of six over twelve with respect to the (original)
tives have been used in a successful dialogue. Fi-
list. Among the six features not yet implemented,
there are complex tasks, such as the management 7
We used the Italian version of the questionnaire,
of the help and non-help sub-dialogues, dealing found in the Appendix A of https://tinyurl.com/
yxngqkx4, but for sake of readability in Table 1 we report
with negative information, and dealing with noisy the English version.
input. 8
The answers of each subjects are available at https:
In the rest of the Section, we report the results //tinyurl.com/y6nruwon
9
We considered an interaction as ‘successfully com-
obtained by applying the IDIAL evaluation pro- pleted’ if the system recognized and processed correctly all
tocol to the current version of the MuMe system, the data given by the user.
N Sentence Evaluation Stress Test Passed
1 The system was efficient in 3.2 (0.45) Spelling Substitutions
accomplishing the task. ST-1 Confused words 60%
2 The system quickly pro- 3.6 (0.55) ST-2 Misspelled words 40%
vided all the information ST-3 Character replacement 80%
that the user needed. ST-4 Character swapping 60%
3 The system is easy to use. 3.6 (1.52) Lexical Substitutions
4 The system is awkward 2.8 (0.84) ST-5 Less frequent synonyms 60%
when the user interacts ST-6 Change of register 40%
with a non-standard or un- ST-7 Coreference 100%
expected input. Syntactic Substitutions
5 The user is satisfied by 3.0 (0.00) ST-8 Active-Passive alternation −
his/her experience. ST-9 Nouns-adjectives inversion −
6 The user would recom- 3.2 (0.84) ST-10 Anaphora resolution 0%
mend the system. ST-11 Verbal-modifier inversion 80%
7 The system has a fluent di- 2.8 (0.84)
alogue. Table 2: IDIAL stress test results.
8 The system is charming. 3.4 (0.90)
9 The user enjoyed the time 3.8 (0.84) 4 out of 5 users explicitly stated (in private con-
s/he spent using the soft- versations after the evaluation phase) that they ex-
ware. pected longer interactions. Also, they expected to
10 The system is flexible to 3.6 (0.55) receive more questions by the system, challenging
the user’s needs. our assumption on the length of dialogues. How-
ever, two of the same users added that 7 interac-
Table 1: IDIAL user ratings of their experience: tions are enough to evaluate the system.
the average scores are provided on a 1-5 Likert
With respect to the evaluation of the stress tests,
scale with standard deviation, in parentheses.
we can say that the sentences provided by the users
during the interaction with the system, were of-
nally, we could not perform the ST-10 test, con- ten very short and scarcely usable from the view-
cerning anaphora resolution, since at the actual point of the IDIAL stress tests (especially those
stage of development the system never asks the concerned with lexical and syntactic aspects). An-
user to pick an answer from a set of options. other source of problems are typos, in particular in
expressions regarding time and addresses. While
4.3 Discussion our system seems quite robust to this kind of er-
rors (see the first 4 rows of Table 2), it is difficult
With respect to the user evaluation test, a number
to automatically deal with them without some do-
of considerations arise from scores. The main is-
main specific knowledge on their occurrence and
sue pointed out by the users during the evaluation
some correction strategies.
phase is the difficulty in grasping when and why
the system misunderstood (or lost) some pieces of As a final note, we want to report some com-
information, thereby resulting in a relatively poor ments given by the users about the questionnaire.
evaluation score for the fluency of the system (av- Two users expressed some doubts on the interpre-
erage score of 2.8). The lack of feedback due to tation of question 8 and in general all of them
the too simple way we used to generate system found difficult to assign a meaningful evaluation
responses has even worsened this problem, lead- to it. For example, some of the users interpreted
ing the user to repeat the same mistake more than the question as regarding the lack of a GUI, ab-
once. The standard deviation of the evaluations sent in our prototype. We think that the ambi-
given to question 3 shows the high subjectivity of guity of the sentence explains the slightly higher
the user experiences with the system, and points standard deviation for that question in respect to
out the necessity to equip the system with some others. Other comments include the lack of di-
form of user model to account for the expectation versity between some sentences (like questions 1
of different kinds of users. It is worth noting that and 5, often judged as redundant), and the inade-
quacy of this Likert scale to evaluate some ques- [Braunger and Maier2017] Patricia Braunger and Wolf-
tions, like 5 and 9: they consider a more subjective gang Maier. 2017. Natural language input for in-car
spoken dialog systems: How natural is natural? In
scale (“poco” [“few”] - “molto” [“a lot”]) more ap-
Proceedings of the 18th Annual SIGdial Meeting on
propriate, perceiving the whole process as a single Discourse and Dialogue, pages 137–146.
experience.
While the linguistic stress test can be a valuable [Conte et al.2017] Giorgia Conte, Cristina Bosco, and
Alessandro Mazzei. 2017. Dealing with ital-
tool for the improvement of the system, the ques- ian adjectives in noun phrase: a study oriented
tionnaire concerning the user experience should be to natural language generation. In Proceedings
revised for addressing some critics that we col- of the Fourth Italian Conference on Computational
lected. In particular, the questionnaire should be Linguistics (CLiC-it 2017), Rome, Italy, December
11-13, 2017., December.
augmented with more specific questions.
[Cutugno et al.2018] Francesco Cutugno, Maria
5 Conclusion and Future Work Di Maro, Sara Falcone, Marco Guerini, Bernardo
Magnini, and Antonio Origlia. 2018. Overview
We presented the MuMe system, a prototype of of the evalita 2018 evaluation of italian dialogue
a rule-based dialogue system and its evaluation systems (idial) task. In EVALITA@ CLiC-it.
through the IDIAL method. [Deriu et al.2019] Jan Deriu, Alvaro Rodrigo, Arantxa
Since the MuMe project is still in development, Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko
there is much room for improvement. The most Agirre, and Mark Cieliebak. 2019. Survey on eval-
uation methods for dialogue systems. arXiv preprint
pressing problem to be addressed in future devel-
arXiv:1905.04071.
opment is the generation of a response more mean-
ingful to the user. The application of a natural lan- [Ghezzi et al.2018] Ilaria Ghezzi, Cristina Bosco, and
guage generation pipeline for Italian (e.g. (Mazzei Alessandro Mazzei. 2018. Auxiliary selection in
italian intransitive verbs: A computational investi-
et al., 2016; Mazzei, 2016; Conte et al., 2017; gation based on annotated corpora. In Proceedings
Ghezzi et al., 2018)) could help to these ends. of the Fifth Italian Conference on Computational
Linguistics (CLiC-it 2018), pages 1–6, Berlin.
Acknowledgments CEUR.
This project has been partially supported by the [Hu et al.2018] Tianran Hu, Anbang Xu, Zhe Liu,
Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo,
MuMe Project (Muoversi Meglio), funded by the and Rama Akkiraju. 2018. Touch your heart: A
Piedmont Region and EU in the frame of the tone-aware chatbot for customer care on social me-
F.E.S.R. 2014/2020. dia. In Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, page 415.
ACM.
References [Lison2015] Pierre Lison. 2015. A hybrid approach to
[Abdul-Kader and Woods2015] Sameera A Abdul- dialogue management based on probabilistic rules.
Kader and JC Woods. 2015. Survey on chatbot Computer Speech & Language, 34(1):232 – 255.
design techniques in speech conversation systems. [Liu et al.2016] Chia-Wei Liu, Ryan Lowe, Iulian V
International Journal of Advanced Computer Serban, Michael Noseworthy, Laurent Charlin, and
Science and Applications, 6(7). Joelle Pineau. 2016. How not to evaluate your di-
alogue system: An empirical study of unsupervised
[Aust et al.1995] Harald Aust, Martin Oerder, Frank evaluation metrics for dialogue response generation.
Seide, and Volker Steinbiss. 1995. The philips au- arXiv preprint arXiv:1603.08023.
tomatic train timetable information system. Speech
Communication, 17(3-4):249–262. [Manning et al.2014] Christopher Manning, Mihai Sur-
deanu, John Bauer, Jenny Finkel, Steven Bethard,
[Bobrow et al.1977] Daniel G. Bobrow, Ronald M. Ka- and David McClosky. 2014. The stanford corenlp
plan, Martin Kay, Donald A. Norman, Henry natural language processing toolkit. In Proceedings
Thompson, and Terry Winograd. 1977. Gus, a of 52nd annual meeting of the association for
frame-driven dialog system. Artif. Intell., 8(2):155– computational linguistics: system demonstrations,
173, April. pages 55–60.
[Bohlin et al.1999] Peter Bohlin, Johan Bos, Staffan [Mathur and Singh2018] Vinayak Mathur and Arpit
Larsson, Ian Lewin, Colin Matheson, and David Singh. 2018. The rapidly changing land-
Milward. 1999. Survey of existing interactive sys- scape of conversational agents. arXiv preprint
tems. Deliverable D1, 3:1–23. arXiv:1803.08419.
[Mazzei et al.2016] Alessandro Mazzei, Cristina
Battaglino, and Cristina Bosco. 2016. Simplenlg-it:
adapting simplenlg to italian. In Proceedings of
the 9th International Natural Language Generation
conference, pages 184–192, Edinburgh, UK,
September 5-8. Association for Computational
Linguistics.
[Mazzei2016] Alessandro Mazzei. 2016. Build-
ing a computational lexicon by using SQL. In
Pierpaolo Basile, Anna Corazza, Francesco Cu-
tugno, Simonetta Montemagni, Malvina Nissim,
Viviana Patti, Giovanni Semeraro, and Rachele
Sprugnoli, editors, Proceedings of Third Italian
Conference on Computational Linguistics (CLiC-it
2016) & Fifth Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian.
Final Workshop (EVALITA 2016), Napoli, Italy,
December 5-7, 2016., volume 1749, pages 1–5.
CEUR-WS.org, December.
[McTear et al.2016] Michael McTear, Zoraida Calle-
jas, and David Griol. 2016. The Conversational
Interface: Talking to Smart Devices. Springer Pub-
lishing Company, Incorporated, 1st edition.
[Palmero Aprosio and Moretti2016] A. Palmero Apro-
sio and G. Moretti. 2016. Italy goes to Stanford:
a collection of CoreNLP modules for Italian. ArXiv
e-prints, September.
[Serban et al.2018] Iulian Vlad Serban, Ryan Lowe, Pe-
ter Henderson, Laurent Charlin, and Joelle Pineau.
2018. A survey of available corpora for building
data-driven dialogue systems: The journal version.
Dialogue & Discourse, 9(1):1–49.
[Strötgen and Gertz2013] Jannik Strötgen and Michael
Gertz. 2013. Multilingual and cross-domain tem-
poral tagging. Language Resources and Evaluation,
47(2):269–298.
[Traum and Larsson2003] David Traum and Staffan
Larsson. 2003. The Information State Approach
to Dialogue Management. In Current and New
Directions in Discourse and Dialogue, pages 325–
353. Springer.