Evaluating the MuMe Dialogue System with the IDIAL Protocol Aureliano Porporato Alessandro Mazzei Università degli Studi di Torino Università degli Studi di Torino aureliano.porporato@unito.it alessandro.mazzei@unito.it Rosa Meo Daniele P. Radicioni Università degli Studi di Torino Università degli Studi di Torino rosa.meo@unito.it daniele.radicioni@unito.it Abstract on the linguistic variations of the successful in- teractions with the users. The application being English. In this paper we describe the im- tested is a prototype dialogue system that we de- plementation of the MuMe dialogue sys- veloped for the reservation of electric vehicles in tem, a task-based dialogue system for a car the context of a car sharing service. A user must sharing service, and its evaluation through be able to interact with the system, to specify the IDIAL protocol. Finally we report when and where s/he wants to leave and which some comments on this novel dialogue sort of vehicle is needed. While there are some system evaluation method.1 services and frameworks dedicated to the devel- Italiano. In questo lavoro descriviamo opment of machine-learning-based dialogue sys- l’implementazione del sistema di dialogo tems, like Google Dialogflow2 or the open source MuMe, realizzato per un sistema di car Rasa3 frameworks, the lack of Italian dialogue cor- sharing, e la sua valutazione attraverso il pora in the specific domain of car sharing reserva- protocollo IDIAL. Infine, offriamo alcuni tions (see, e.g., Serban et al. (2018)) and the im- commenti su questo nuovo metodo per la possibility on our part to recruit a number of peo- valutazione di sistemi di dialogo. ple large enough for the creation of such a corpus, forced us to choose a different solution: we de- 1 Introduction veloped a simpler and less data-reliant rule-based system, based on slot-filling semantics. Moreover, The interest in dialogue systems is on the rise in the decisions made by this kind of systems can be the NLP community (McTear et al., 2016), under tracked throughout the computation, thereby re- the strong demand for the introduction of a nat- sulting in the advantage of being quite explain- ural and effective user interaction in applications, able. This is a desirable feature, since it simpli- like in the customer care domain (Hu et al., 2018). fies the debugging and the maintenance of the rou- A related and central issue is the evaluation of tines, and allows an easier extension of the system such systems. In this setting, it is largely known to meet additional requirements. that most evaluation metrics that come from ma- chine translation and compare a model generated This paper is mostly concerned with the evalua- response to a single target response, exhibit a poor tion of the MuMe system. The structure of the pa- correlation with the human judgement (Liu et al., per is as follows. After surveying on related work 2016). (Section 2), we briefly introduce the overall archi- In this paper we briefly illustrate a task-oriented tecture and the main components of the MuMe di- dialogue system called MuMe (from “MUoversi alogue system (see Section 3); we evaluate MuMe MEglio”, “travelling better” in English language), by using the IDIAL protocol, and employ MuMe and examine how far the evaluation protocol experimentation as a case study for giving feed- IDIAL (Cutugno et al., 2018) is helpful in its as- back on the IDIAL protocol itself (Section 4); fi- sessment. IDIAL is composed by a usability eval- nally, in the final Section we briefly recap the main uation (done by a group of users) and by an eval- contributions of the paper, and point to ongoing uation of the robustness of the dialog model based and future work. 1 Copyright c 2019 for this paper by its authors. Use 2 permitted under Creative Commons License Attribution 4.0 https://dialogflow.com/ 3 International (CC BY 4.0). https://rasa.com/ 2 Related Work into a slot-filling form. When control returns to OpenDial, it generates an answer and returns it to The pioneering work of (Bobrow et al., 1977) pro- the user on the basis of a dialogue control strategy posed the frame-based architecture that most of (see Section 3.3). task-based dialogue systems implement. The ba- sic idea is to abandon the demanding goal to have a 3.1 The OpenDial Dialogue Manager genuine logic representation of the dialog meaning The main component of our software architecture and adopt a simpler slot-filling semantics. In some is the OpenDial open source framework for dia- sense, the event-entities representation of the mod- logue management (Lison, 2015). The system, ern neural-based dialogue system frameworks can that was designed for speech interaction, adopts be seen as an ultimate evolution of that simplifica- the information state approach for modelling the tion idea. Aust et al. (1995) presented a rule-based state of the dialogue (Traum and Larsson, 2003), system to some extents similar to ours in its pur- that is a collection of variables representing the ac- pose and structure, created for a train-seat reserva- tual state of the system. The transition between tion project. This system has to grasp the names states, i.e. the change of the variables values, is of cities, train stations, dates and times, and it is governed by the activation of a set of ”if-then-else” able to perform quite sophisticated temporal in- rules on input values as well as on the variation formation processing. Further rule-based systems of some variables. Indeed, OpenDial uses these are reviewed in the survey by (Abdul-Kader and rules when it models the sub-tasks of user utter- Woods, 2015). ance understanding, the dialogue management and A different class of dialogue systems are based the response generation. Moreover the integration on neural networks. A survey on this class of sys- of the system with external tools is simple. We tems can be found in (Mathur and Singh, 2018). exploited this capability in MuMe since for lan- Regarding the evaluation of dialogue systems, guage understanding we used a module based on the work by (Bohlin et al., 1999) proposes the an external parser (see below). Additionally, the Trindi Tick-list, a wish list of the desired dia- OpenDial framework implements some statistical- logue behaviour and features specified as a check- based techniques to deal with uncertainty. This list of ”yes-no” questions. As regards this ap- is a way to learn interaction models from exist- proach, Braunger and Maier (2017) argue that ing dialogues. This feature is particularly impor- standardised evaluation models do not enable a tant for speech based dialogue systems where un- complete evaluation of a dialogue system. Rather, certain information arises from automatic speech they suggest that such evaluation must take into recognition. However, at this stage of the MuMe account the natural flow of the interaction between project, we did not use this feature since we were the user and the system itself; such measure in- working on written texts only. volves many language- and user-dependant fac- tors, such as the length of the user utterances. Such 3.2 Parsing and Information Extraction principles were tested in human-computer vocal In order to assign semantic roles to the entities in interactions occurring on board of vehicles. Fur- the dialogues, we decided to use a syntactic parser ther information on dialogue systems evaluation on the text inserted by the user. methods can be found in the survey by Deriu et As our main parsing module we used Tint al. (2019). (The Italian NLP Tool) (Palmero Aprosio and Moretti, 2016), a framework modeled on Stan- 3 The MuMe system architecture ford CoreNLP (Manning et al., 2014). Tint per- In Figure 1 we depicted the basic architecture of forms some fundamental processing of user utter- the MuME dialogue system. The information flow ances, such as dependency parsing, Named Enti- starts from a sentence typed by the user: this sen- ties Recognition and the extraction of Temporal tence is handled by the OpenDial system (see Sec- Expressions. In particular, the tasks are executed tion 3.1) which plays both the role of the dialogue by interfacing with external tools. manager and of the system orchestrator. So, the For the recognition of temporal expressions sentence is syntactically parsed and semantically (such as dates and times), Tint integrates the analyzed by an IE module (see Section 3.2). At services provided by HeidelTime (Strötgen and this point, the result of the processing is converted Gertz, 2013). HeidelTime allows the extraction Figure 1: The schematic architecture of the MuMe dialogue system. OPENDIAL Preprocessing IE Module (Dialogue Manager) Linguistic Analysis (TINT) Temporal Expressions Extraction (Heideltime) Postprocessing Geographic Expressions Response Extraction Geocoding and Generation Geolocation (Google Maps API) Domain-specific IE Backend Space & Time Inference Slot Filling Server of various sorts of temporal expressions in vari- and vehicle type. For example, the user can choose ous languages, including the Italian language, and between three types of vehicles, but if the kind of represents them in the standard TIMEX3 format. vehicle is not specified, the system assigns a de- For the treatment of geographic expressions, fault ‘economy car’ to the vehicle type slot. Tint is interfacing with the Nominatim wrapper.4 The MuMe system adopts a mixed initiative for However, this (free and open source) service per- dialogue handling. Although the dialogue is over- forms poorly in geocoding (i.e., in searching the all system-driven, the user starts the conversation GPS coordinates of a given address). As a conse- by possibly providing some initial information. A quence we decided to use the Google Maps API5 , richer initial information is expected to result in which provides for better performances. Indeed, a shorter dialogue interaction. Indeed, a design Maps offers an API for address autocomplete, goal of the MuMe system is to produce a dialogue once this information piece has been isolated from as short as possible. For this reason, also in the the rest of the sentence, and for geolocation (i.e., subsequent interactions, if the user gives various searching the coordinates of the user), too. pieces of information in a single utterance, the sys- tem can extract all such information and is able to 3.3 Dialogue Control Strategy assign each filler to the corresponding slot, thus The simple control strategy implemented, that avoiding further unnecessary questions. governs the moves of the dialogue, is based on the When the user begins the interaction with the fulfillment of a number of mandatory slots in the MuMe system, the system replies with a welcome domain-specific slot-filling semantics adopted for message, and with a general question aiming at en- the car reservation domain. couraging the user to start the interaction in the In particular, the mandatory slots are the start most natural way. date, the start time and start stall (which encodes In order to give more details on the control strat- the start position). Indeed, the simplest reservation egy, we consider now the following running exam- in MuMe needs only of these pieces of informa- ple and its processing in MuMe (see Figure 1): tion: a person reserves a standard car, starting at a (it) “User: Ho bisogno di un’auto domani per specific time of a specific day from a specific stall, andare :::::: in via Pessinetto” and will return the car in the same stall without the (en) “User: I need a car tomorrow to go :: in need to specify the return date and time. Pessinetto street”6 However, more complex reservations need The Information Extraction phase detects a date more information, that are encoded in the non- (through HeidelTime) and an address (extracted mandatory slots of end date, end time, end stall through a basic set of custom rules) in the user 4 6 http://nominatim.org/. The English version of the user and system sentences are 5 https://cloud.google.com/ given for clarity. The system is available in Italian language maps-platform/. only. sentence. By means of other rules that check the which is split in a questionnaire concerning the shape of the dependency tree (obtained through user experience (Section 4.1), and a number of Tint), date and address are labelled as start date stress tests concerning the linguistic robustness of and end address. Particularly relevant in this case the system (Section 4.2). is the verb “andare” (“to go”), that signals that the following address is where the user wants to ar- 4.1 IDIAL User Evaluation rive and not a starting point. In the post-processing A group of 5 subjects (3 males, 2 females, 19, 22, phase some additional information can be inferred, 25, 26 and 61 years old) were recruited for the like the value of the start address, left unspecified evaluation task by personal invitation and without by the user: it can be selected by retrieving the rewards. After a brief oral description of the do- GPS coordinates of the address by means of the main and of the basic mechanisms of interaction Google Maps API. Once the user’s current loca- with the system, each user was asked to generate 7 tion has been identified, the nearest stall is selected complete dialogues with the system in a controlled as the start stall. environment. We asked the users to simulate the At the end of this processing, the system suc- process of reserving a car without other specific cessfully filled the start address, start stall, end constraints. address, end stall and start date slots. Some In Table 1 we report the ten questions of the mandatory slots are still left unfilled, such as the IDIAL user test with the average score, obtained start time, so that the system will ask the user by using a Likert scale based on five points.7 Note to provide the missing information. As a conse- that the questions 3, 4, 7 and 10 have been de- quence, the response of the system will be a ques- signed to evaluate the effectiveness of the dialogue tion selected from a fixed list based on unfilled system, while questions 1 and 2 regard the system slots: in this specific example, the system will con- efficiency.8 tinue asking for the departure time. At the end of the filling-phase of the mandatory- 4.2 IDIAL Stress Tests slots, the systems gives the user the possibility The second evaluation stage in the IDIAL protocol to modify the request and to correct possible er- consists in a set of linguistic stress tests. We se- rors and misunderstandings. The slot-filling val- lected 5 dialogues (one for each user) among those ues will be sent to a dedicated server for the final- successfully completed9 during the user evalua- ization of the reservation. tion stage. Following the IDIAL protocol, we modified one sentence in each dialogue, once for 4 Evaluation each test, as illustrated in (Cutugno et al., 2018), and repeated the dialogue with the modified sen- In order to have a first preliminary evaluation of tence. The results are reported in Table 2. the MuMe system, we applied the Trindi Tick- Note that we could not perform three stress tests list protocols, that is a set of ”yes-no” questions for distinct reasons. We could not perform the concerning specific capabilities of the developed ST-8 test, regarding active-passive alternation, be- system (Bohlin et al., 1999). While this simple cause the users almost always used intransitive questionnaire is helpful in the development phase, verbs (like “andare” [“to go”] and “partire” [“to since it is able to give a measure of the system depart”]). We could not perform the ST-9 test, limits, it is not suitable to completely evaluate the concerning adjective-noun alternation, since the actual experience of the user. At this stage of de- users used a very few adjectives (like vehicle types velopment, the MuMe system has a Trindi score modifiers “lussuosa” [“luxurious”]), and no adjec- of six over twelve with respect to the (original) tives have been used in a successful dialogue. Fi- list. Among the six features not yet implemented, there are complex tasks, such as the management 7 We used the Italian version of the questionnaire, of the help and non-help sub-dialogues, dealing found in the Appendix A of https://tinyurl.com/ yxngqkx4, but for sake of readability in Table 1 we report with negative information, and dealing with noisy the English version. input. 8 The answers of each subjects are available at https: In the rest of the Section, we report the results //tinyurl.com/y6nruwon 9 We considered an interaction as ‘successfully com- obtained by applying the IDIAL evaluation pro- pleted’ if the system recognized and processed correctly all tocol to the current version of the MuMe system, the data given by the user. N Sentence Evaluation Stress Test Passed 1 The system was efficient in 3.2 (0.45) Spelling Substitutions accomplishing the task. ST-1 Confused words 60% 2 The system quickly pro- 3.6 (0.55) ST-2 Misspelled words 40% vided all the information ST-3 Character replacement 80% that the user needed. ST-4 Character swapping 60% 3 The system is easy to use. 3.6 (1.52) Lexical Substitutions 4 The system is awkward 2.8 (0.84) ST-5 Less frequent synonyms 60% when the user interacts ST-6 Change of register 40% with a non-standard or un- ST-7 Coreference 100% expected input. Syntactic Substitutions 5 The user is satisfied by 3.0 (0.00) ST-8 Active-Passive alternation − his/her experience. ST-9 Nouns-adjectives inversion − 6 The user would recom- 3.2 (0.84) ST-10 Anaphora resolution 0% mend the system. ST-11 Verbal-modifier inversion 80% 7 The system has a fluent di- 2.8 (0.84) alogue. Table 2: IDIAL stress test results. 8 The system is charming. 3.4 (0.90) 9 The user enjoyed the time 3.8 (0.84) 4 out of 5 users explicitly stated (in private con- s/he spent using the soft- versations after the evaluation phase) that they ex- ware. pected longer interactions. Also, they expected to 10 The system is flexible to 3.6 (0.55) receive more questions by the system, challenging the user’s needs. our assumption on the length of dialogues. How- ever, two of the same users added that 7 interac- Table 1: IDIAL user ratings of their experience: tions are enough to evaluate the system. the average scores are provided on a 1-5 Likert With respect to the evaluation of the stress tests, scale with standard deviation, in parentheses. we can say that the sentences provided by the users during the interaction with the system, were of- nally, we could not perform the ST-10 test, con- ten very short and scarcely usable from the view- cerning anaphora resolution, since at the actual point of the IDIAL stress tests (especially those stage of development the system never asks the concerned with lexical and syntactic aspects). An- user to pick an answer from a set of options. other source of problems are typos, in particular in expressions regarding time and addresses. While 4.3 Discussion our system seems quite robust to this kind of er- rors (see the first 4 rows of Table 2), it is difficult With respect to the user evaluation test, a number to automatically deal with them without some do- of considerations arise from scores. The main is- main specific knowledge on their occurrence and sue pointed out by the users during the evaluation some correction strategies. phase is the difficulty in grasping when and why the system misunderstood (or lost) some pieces of As a final note, we want to report some com- information, thereby resulting in a relatively poor ments given by the users about the questionnaire. evaluation score for the fluency of the system (av- Two users expressed some doubts on the interpre- erage score of 2.8). The lack of feedback due to tation of question 8 and in general all of them the too simple way we used to generate system found difficult to assign a meaningful evaluation responses has even worsened this problem, lead- to it. For example, some of the users interpreted ing the user to repeat the same mistake more than the question as regarding the lack of a GUI, ab- once. The standard deviation of the evaluations sent in our prototype. We think that the ambi- given to question 3 shows the high subjectivity of guity of the sentence explains the slightly higher the user experiences with the system, and points standard deviation for that question in respect to out the necessity to equip the system with some others. Other comments include the lack of di- form of user model to account for the expectation versity between some sentences (like questions 1 of different kinds of users. It is worth noting that and 5, often judged as redundant), and the inade- quacy of this Likert scale to evaluate some ques- [Braunger and Maier2017] Patricia Braunger and Wolf- tions, like 5 and 9: they consider a more subjective gang Maier. 2017. Natural language input for in-car spoken dialog systems: How natural is natural? In scale (“poco” [“few”] - “molto” [“a lot”]) more ap- Proceedings of the 18th Annual SIGdial Meeting on propriate, perceiving the whole process as a single Discourse and Dialogue, pages 137–146. experience. While the linguistic stress test can be a valuable [Conte et al.2017] Giorgia Conte, Cristina Bosco, and Alessandro Mazzei. 2017. Dealing with ital- tool for the improvement of the system, the ques- ian adjectives in noun phrase: a study oriented tionnaire concerning the user experience should be to natural language generation. In Proceedings revised for addressing some critics that we col- of the Fourth Italian Conference on Computational lected. In particular, the questionnaire should be Linguistics (CLiC-it 2017), Rome, Italy, December 11-13, 2017., December. augmented with more specific questions. [Cutugno et al.2018] Francesco Cutugno, Maria 5 Conclusion and Future Work Di Maro, Sara Falcone, Marco Guerini, Bernardo Magnini, and Antonio Origlia. 2018. Overview We presented the MuMe system, a prototype of of the evalita 2018 evaluation of italian dialogue a rule-based dialogue system and its evaluation systems (idial) task. In EVALITA@ CLiC-it. through the IDIAL method. [Deriu et al.2019] Jan Deriu, Alvaro Rodrigo, Arantxa Since the MuMe project is still in development, Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko there is much room for improvement. The most Agirre, and Mark Cieliebak. 2019. Survey on eval- uation methods for dialogue systems. arXiv preprint pressing problem to be addressed in future devel- arXiv:1905.04071. opment is the generation of a response more mean- ingful to the user. The application of a natural lan- [Ghezzi et al.2018] Ilaria Ghezzi, Cristina Bosco, and guage generation pipeline for Italian (e.g. (Mazzei Alessandro Mazzei. 2018. Auxiliary selection in italian intransitive verbs: A computational investi- et al., 2016; Mazzei, 2016; Conte et al., 2017; gation based on annotated corpora. In Proceedings Ghezzi et al., 2018)) could help to these ends. of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 1–6, Berlin. Acknowledgments CEUR. This project has been partially supported by the [Hu et al.2018] Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo, MuMe Project (Muoversi Meglio), funded by the and Rama Akkiraju. 2018. Touch your heart: A Piedmont Region and EU in the frame of the tone-aware chatbot for customer care on social me- F.E.S.R. 2014/2020. dia. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 415. ACM. References [Lison2015] Pierre Lison. 2015. A hybrid approach to [Abdul-Kader and Woods2015] Sameera A Abdul- dialogue management based on probabilistic rules. Kader and JC Woods. 2015. Survey on chatbot Computer Speech & Language, 34(1):232 – 255. design techniques in speech conversation systems. [Liu et al.2016] Chia-Wei Liu, Ryan Lowe, Iulian V International Journal of Advanced Computer Serban, Michael Noseworthy, Laurent Charlin, and Science and Applications, 6(7). Joelle Pineau. 2016. How not to evaluate your di- alogue system: An empirical study of unsupervised [Aust et al.1995] Harald Aust, Martin Oerder, Frank evaluation metrics for dialogue response generation. Seide, and Volker Steinbiss. 1995. The philips au- arXiv preprint arXiv:1603.08023. tomatic train timetable information system. Speech Communication, 17(3-4):249–262. [Manning et al.2014] Christopher Manning, Mihai Sur- deanu, John Bauer, Jenny Finkel, Steven Bethard, [Bobrow et al.1977] Daniel G. Bobrow, Ronald M. Ka- and David McClosky. 2014. The stanford corenlp plan, Martin Kay, Donald A. Norman, Henry natural language processing toolkit. In Proceedings Thompson, and Terry Winograd. 1977. Gus, a of 52nd annual meeting of the association for frame-driven dialog system. Artif. Intell., 8(2):155– computational linguistics: system demonstrations, 173, April. pages 55–60. [Bohlin et al.1999] Peter Bohlin, Johan Bos, Staffan [Mathur and Singh2018] Vinayak Mathur and Arpit Larsson, Ian Lewin, Colin Matheson, and David Singh. 2018. The rapidly changing land- Milward. 1999. Survey of existing interactive sys- scape of conversational agents. arXiv preprint tems. Deliverable D1, 3:1–23. arXiv:1803.08419. [Mazzei et al.2016] Alessandro Mazzei, Cristina Battaglino, and Cristina Bosco. 2016. Simplenlg-it: adapting simplenlg to italian. In Proceedings of the 9th International Natural Language Generation conference, pages 184–192, Edinburgh, UK, September 5-8. Association for Computational Linguistics. [Mazzei2016] Alessandro Mazzei. 2016. Build- ing a computational lexicon by using SQL. In Pierpaolo Basile, Anna Corazza, Francesco Cu- tugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016., volume 1749, pages 1–5. CEUR-WS.org, December. [McTear et al.2016] Michael McTear, Zoraida Calle- jas, and David Griol. 2016. The Conversational Interface: Talking to Smart Devices. Springer Pub- lishing Company, Incorporated, 1st edition. [Palmero Aprosio and Moretti2016] A. Palmero Apro- sio and G. Moretti. 2016. Italy goes to Stanford: a collection of CoreNLP modules for Italian. ArXiv e-prints, September. [Serban et al.2018] Iulian Vlad Serban, Ryan Lowe, Pe- ter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue & Discourse, 9(1):1–49. [Strötgen and Gertz2013] Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain tem- poral tagging. Language Resources and Evaluation, 47(2):269–298. [Traum and Larsson2003] David Traum and Staffan Larsson. 2003. The Information State Approach to Dialogue Management. In Current and New Directions in Discourse and Dialogue, pages 325– 353. Springer.