Evaluating the MuMe Dialogue System with the IDIAL Protocol
                 Aureliano Porporato                Alessandro Mazzei
             Università degli Studi di Torino Università degli Studi di Torino
          aureliano.porporato@unito.it alessandro.mazzei@unito.it

                           Rosa Meo                              Daniele P. Radicioni
                Università degli Studi di Torino            Università degli Studi di Torino
                  rosa.meo@unito.it                        daniele.radicioni@unito.it

                        Abstract                           on the linguistic variations of the successful in-
                                                           teractions with the users. The application being
        English. In this paper we describe the im-
                                                           tested is a prototype dialogue system that we de-
        plementation of the MuMe dialogue sys-
                                                           veloped for the reservation of electric vehicles in
        tem, a task-based dialogue system for a car
                                                           the context of a car sharing service. A user must
        sharing service, and its evaluation through
                                                           be able to interact with the system, to specify
        the IDIAL protocol. Finally we report
                                                           when and where s/he wants to leave and which
        some comments on this novel dialogue
                                                           sort of vehicle is needed. While there are some
        system evaluation method.1
                                                           services and frameworks dedicated to the devel-
        Italiano. In questo lavoro descriviamo             opment of machine-learning-based dialogue sys-
        l’implementazione del sistema di dialogo           tems, like Google Dialogflow2 or the open source
        MuMe, realizzato per un sistema di car             Rasa3 frameworks, the lack of Italian dialogue cor-
        sharing, e la sua valutazione attraverso il        pora in the specific domain of car sharing reserva-
        protocollo IDIAL. Infine, offriamo alcuni          tions (see, e.g., Serban et al. (2018)) and the im-
        commenti su questo nuovo metodo per la             possibility on our part to recruit a number of peo-
        valutazione di sistemi di dialogo.                 ple large enough for the creation of such a corpus,
                                                           forced us to choose a different solution: we de-
1       Introduction                                       veloped a simpler and less data-reliant rule-based
                                                           system, based on slot-filling semantics. Moreover,
The interest in dialogue systems is on the rise in         the decisions made by this kind of systems can be
the NLP community (McTear et al., 2016), under             tracked throughout the computation, thereby re-
the strong demand for the introduction of a nat-           sulting in the advantage of being quite explain-
ural and effective user interaction in applications,       able. This is a desirable feature, since it simpli-
like in the customer care domain (Hu et al., 2018).        fies the debugging and the maintenance of the rou-
A related and central issue is the evaluation of           tines, and allows an easier extension of the system
such systems. In this setting, it is largely known         to meet additional requirements.
that most evaluation metrics that come from ma-
chine translation and compare a model generated               This paper is mostly concerned with the evalua-
response to a single target response, exhibit a poor       tion of the MuMe system. The structure of the pa-
correlation with the human judgement (Liu et al.,          per is as follows. After surveying on related work
2016).                                                     (Section 2), we briefly introduce the overall archi-
   In this paper we briefly illustrate a task-oriented     tecture and the main components of the MuMe di-
dialogue system called MuMe (from “MUoversi                alogue system (see Section 3); we evaluate MuMe
MEglio”, “travelling better” in English language),         by using the IDIAL protocol, and employ MuMe
and examine how far the evaluation protocol                experimentation as a case study for giving feed-
IDIAL (Cutugno et al., 2018) is helpful in its as-         back on the IDIAL protocol itself (Section 4); fi-
sessment. IDIAL is composed by a usability eval-           nally, in the final Section we briefly recap the main
uation (done by a group of users) and by an eval-          contributions of the paper, and point to ongoing
uation of the robustness of the dialog model based         and future work.
    1
     Copyright c 2019 for this paper by its authors. Use
                                                              2
permitted under Creative Commons License Attribution 4.0          https://dialogflow.com/
                                                              3
International (CC BY 4.0).                                        https://rasa.com/
2   Related Work                                        into a slot-filling form. When control returns to
                                                        OpenDial, it generates an answer and returns it to
The pioneering work of (Bobrow et al., 1977) pro-
                                                        the user on the basis of a dialogue control strategy
posed the frame-based architecture that most of
                                                        (see Section 3.3).
task-based dialogue systems implement. The ba-
sic idea is to abandon the demanding goal to have a     3.1   The OpenDial Dialogue Manager
genuine logic representation of the dialog meaning      The main component of our software architecture
and adopt a simpler slot-filling semantics. In some     is the OpenDial open source framework for dia-
sense, the event-entities representation of the mod-    logue management (Lison, 2015). The system,
ern neural-based dialogue system frameworks can         that was designed for speech interaction, adopts
be seen as an ultimate evolution of that simplifica-    the information state approach for modelling the
tion idea. Aust et al. (1995) presented a rule-based    state of the dialogue (Traum and Larsson, 2003),
system to some extents similar to ours in its pur-      that is a collection of variables representing the ac-
pose and structure, created for a train-seat reserva-   tual state of the system. The transition between
tion project. This system has to grasp the names        states, i.e. the change of the variables values, is
of cities, train stations, dates and times, and it is   governed by the activation of a set of ”if-then-else”
able to perform quite sophisticated temporal in-        rules on input values as well as on the variation
formation processing. Further rule-based systems        of some variables. Indeed, OpenDial uses these
are reviewed in the survey by (Abdul-Kader and          rules when it models the sub-tasks of user utter-
Woods, 2015).                                           ance understanding, the dialogue management and
   A different class of dialogue systems are based      the response generation. Moreover the integration
on neural networks. A survey on this class of sys-      of the system with external tools is simple. We
tems can be found in (Mathur and Singh, 2018).          exploited this capability in MuMe since for lan-
   Regarding the evaluation of dialogue systems,        guage understanding we used a module based on
the work by (Bohlin et al., 1999) proposes the          an external parser (see below). Additionally, the
Trindi Tick-list, a wish list of the desired dia-       OpenDial framework implements some statistical-
logue behaviour and features specified as a check-      based techniques to deal with uncertainty. This
list of ”yes-no” questions. As regards this ap-         is a way to learn interaction models from exist-
proach, Braunger and Maier (2017) argue that            ing dialogues. This feature is particularly impor-
standardised evaluation models do not enable a          tant for speech based dialogue systems where un-
complete evaluation of a dialogue system. Rather,       certain information arises from automatic speech
they suggest that such evaluation must take into        recognition. However, at this stage of the MuMe
account the natural flow of the interaction between     project, we did not use this feature since we were
the user and the system itself; such measure in-        working on written texts only.
volves many language- and user-dependant fac-
tors, such as the length of the user utterances. Such   3.2   Parsing and Information Extraction
principles were tested in human-computer vocal          In order to assign semantic roles to the entities in
interactions occurring on board of vehicles. Fur-       the dialogues, we decided to use a syntactic parser
ther information on dialogue systems evaluation         on the text inserted by the user.
methods can be found in the survey by Deriu et             As our main parsing module we used Tint
al. (2019).                                             (The Italian NLP Tool) (Palmero Aprosio and
                                                        Moretti, 2016), a framework modeled on Stan-
3   The MuMe system architecture
                                                        ford CoreNLP (Manning et al., 2014). Tint per-
In Figure 1 we depicted the basic architecture of       forms some fundamental processing of user utter-
the MuME dialogue system. The information flow          ances, such as dependency parsing, Named Enti-
starts from a sentence typed by the user: this sen-     ties Recognition and the extraction of Temporal
tence is handled by the OpenDial system (see Sec-       Expressions. In particular, the tasks are executed
tion 3.1) which plays both the role of the dialogue     by interfacing with external tools.
manager and of the system orchestrator. So, the            For the recognition of temporal expressions
sentence is syntactically parsed and semantically       (such as dates and times), Tint integrates the
analyzed by an IE module (see Section 3.2). At          services provided by HeidelTime (Strötgen and
this point, the result of the processing is converted   Gertz, 2013). HeidelTime allows the extraction
                   Figure 1: The schematic architecture of the MuMe dialogue system.


                           OPENDIAL                           Preprocessing
                                                                                                      IE Module
                      (Dialogue Manager)
                                                         Linguistic Analysis (TINT)
                                                                                               Temporal Expressions
                                                                                               Extraction (Heideltime)

                                                             Postprocessing
                                                                                              Geographic Expressions
                    Response                                                                        Extraction
                                                             Geocoding and
                    Generation
                                                           Geolocation (Google
                                                               Maps API)
                                                                                                 Domain-specific IE
 Backend                                                 Space & Time Inference
                                 Slot Filling
  Server


of various sorts of temporal expressions in vari-            and vehicle type. For example, the user can choose
ous languages, including the Italian language, and           between three types of vehicles, but if the kind of
represents them in the standard TIMEX3 format.               vehicle is not specified, the system assigns a de-
   For the treatment of geographic expressions,              fault ‘economy car’ to the vehicle type slot.
Tint is interfacing with the Nominatim wrapper.4                The MuMe system adopts a mixed initiative for
However, this (free and open source) service per-            dialogue handling. Although the dialogue is over-
forms poorly in geocoding (i.e., in searching the            all system-driven, the user starts the conversation
GPS coordinates of a given address). As a conse-             by possibly providing some initial information. A
quence we decided to use the Google Maps API5 ,              richer initial information is expected to result in
which provides for better performances. Indeed,              a shorter dialogue interaction. Indeed, a design
Maps offers an API for address autocomplete,                 goal of the MuMe system is to produce a dialogue
once this information piece has been isolated from           as short as possible. For this reason, also in the
the rest of the sentence, and for geolocation (i.e.,         subsequent interactions, if the user gives various
searching the coordinates of the user), too.                 pieces of information in a single utterance, the sys-
                                                             tem can extract all such information and is able to
3.3    Dialogue Control Strategy                             assign each filler to the corresponding slot, thus
The simple control strategy implemented, that                avoiding further unnecessary questions.
governs the moves of the dialogue, is based on the              When the user begins the interaction with the
fulfillment of a number of mandatory slots in the            MuMe system, the system replies with a welcome
domain-specific slot-filling semantics adopted for           message, and with a general question aiming at en-
the car reservation domain.                                  couraging the user to start the interaction in the
   In particular, the mandatory slots are the start          most natural way.
date, the start time and start stall (which encodes             In order to give more details on the control strat-
the start position). Indeed, the simplest reservation        egy, we consider now the following running exam-
in MuMe needs only of these pieces of informa-               ple and its processing in MuMe (see Figure 1):
tion: a person reserves a standard car, starting at a        (it) “User: Ho bisogno di un’auto domani per
specific time of a specific day from a specific stall,       andare
                                                             ::::::
                                                                     in via Pessinetto”
and will return the car in the same stall without the        (en) “User: I need a car tomorrow to go        ::
                                                                                                                 in
need to specify the return date and time.                    Pessinetto street”6
   However, more complex reservations need                      The Information Extraction phase detects a date
more information, that are encoded in the non-               (through HeidelTime) and an address (extracted
mandatory slots of end date, end time, end stall             through a basic set of custom rules) in the user
   4                                                             6
   http://nominatim.org/.                                        The English version of the user and system sentences are
   5
   https://cloud.google.com/                                 given for clarity. The system is available in Italian language
maps-platform/.                                              only.
sentence. By means of other rules that check the          which is split in a questionnaire concerning the
shape of the dependency tree (obtained through            user experience (Section 4.1), and a number of
Tint), date and address are labelled as start date        stress tests concerning the linguistic robustness of
and end address. Particularly relevant in this case       the system (Section 4.2).
is the verb “andare” (“to go”), that signals that the
following address is where the user wants to ar-          4.1   IDIAL User Evaluation
rive and not a starting point. In the post-processing     A group of 5 subjects (3 males, 2 females, 19, 22,
phase some additional information can be inferred,        25, 26 and 61 years old) were recruited for the
like the value of the start address, left unspecified     evaluation task by personal invitation and without
by the user: it can be selected by retrieving the         rewards. After a brief oral description of the do-
GPS coordinates of the address by means of the            main and of the basic mechanisms of interaction
Google Maps API. Once the user’s current loca-            with the system, each user was asked to generate 7
tion has been identified, the nearest stall is selected   complete dialogues with the system in a controlled
as the start stall.                                       environment. We asked the users to simulate the
   At the end of this processing, the system suc-         process of reserving a car without other specific
cessfully filled the start address, start stall, end      constraints.
address, end stall and start date slots. Some                In Table 1 we report the ten questions of the
mandatory slots are still left unfilled, such as the      IDIAL user test with the average score, obtained
start time, so that the system will ask the user          by using a Likert scale based on five points.7 Note
to provide the missing information. As a conse-           that the questions 3, 4, 7 and 10 have been de-
quence, the response of the system will be a ques-        signed to evaluate the effectiveness of the dialogue
tion selected from a fixed list based on unfilled         system, while questions 1 and 2 regard the system
slots: in this specific example, the system will con-     efficiency.8
tinue asking for the departure time.
   At the end of the filling-phase of the mandatory-      4.2   IDIAL Stress Tests
slots, the systems gives the user the possibility         The second evaluation stage in the IDIAL protocol
to modify the request and to correct possible er-         consists in a set of linguistic stress tests. We se-
rors and misunderstandings. The slot-filling val-         lected 5 dialogues (one for each user) among those
ues will be sent to a dedicated server for the final-     successfully completed9 during the user evalua-
ization of the reservation.                               tion stage. Following the IDIAL protocol, we
                                                          modified one sentence in each dialogue, once for
4   Evaluation                                            each test, as illustrated in (Cutugno et al., 2018),
                                                          and repeated the dialogue with the modified sen-
In order to have a first preliminary evaluation of
                                                          tence. The results are reported in Table 2.
the MuMe system, we applied the Trindi Tick-
                                                             Note that we could not perform three stress tests
list protocols, that is a set of ”yes-no” questions
                                                          for distinct reasons. We could not perform the
concerning specific capabilities of the developed
                                                          ST-8 test, regarding active-passive alternation, be-
system (Bohlin et al., 1999). While this simple
                                                          cause the users almost always used intransitive
questionnaire is helpful in the development phase,
                                                          verbs (like “andare” [“to go”] and “partire” [“to
since it is able to give a measure of the system
                                                          depart”]). We could not perform the ST-9 test,
limits, it is not suitable to completely evaluate the
                                                          concerning adjective-noun alternation, since the
actual experience of the user. At this stage of de-
                                                          users used a very few adjectives (like vehicle types
velopment, the MuMe system has a Trindi score
                                                          modifiers “lussuosa” [“luxurious”]), and no adjec-
of six over twelve with respect to the (original)
                                                          tives have been used in a successful dialogue. Fi-
list. Among the six features not yet implemented,
there are complex tasks, such as the management               7
                                                                We used the Italian version of the questionnaire,
of the help and non-help sub-dialogues, dealing           found in the Appendix A of https://tinyurl.com/
                                                          yxngqkx4, but for sake of readability in Table 1 we report
with negative information, and dealing with noisy         the English version.
input.                                                        8
                                                                The answers of each subjects are available at https:
   In the rest of the Section, we report the results      //tinyurl.com/y6nruwon
                                                              9
                                                                We considered an interaction as ‘successfully com-
obtained by applying the IDIAL evaluation pro-            pleted’ if the system recognized and processed correctly all
tocol to the current version of the MuMe system,          the data given by the user.
 N    Sentence                        Evaluation        Stress Test                                 Passed
 1    The system was efficient in      3.2 (0.45)       Spelling Substitutions
      accomplishing the task.                           ST-1 Confused words                           60%
  2   The system quickly pro-         3.6 (0.55)        ST-2 Misspelled words                         40%
      vided all the information                         ST-3 Character replacement                    80%
      that the user needed.                             ST-4 Character swapping                       60%
  3   The system is easy to use.      3.6 (1.52)        Lexical Substitutions
  4   The system is awkward           2.8 (0.84)        ST-5 Less frequent synonyms                  60%
      when the user interacts                           ST-6 Change of register                      40%
      with a non-standard or un-                        ST-7 Coreference                             100%
      expected input.                                   Syntactic Substitutions
  5   The user is satisfied by        3.0 (0.00)        ST-8 Active-Passive alternation                 −
      his/her experience.                               ST-9 Nouns-adjectives inversion                 −
  6   The user would recom-           3.2 (0.84)        ST-10 Anaphora resolution                      0%
      mend the system.                                  ST-11 Verbal-modifier inversion               80%
  7   The system has a fluent di-     2.8 (0.84)
      alogue.                                                  Table 2: IDIAL stress test results.
  8   The system is charming.         3.4 (0.90)
  9   The user enjoyed the time       3.8 (0.84)       4 out of 5 users explicitly stated (in private con-
      s/he spent using the soft-                       versations after the evaluation phase) that they ex-
      ware.                                            pected longer interactions. Also, they expected to
 10   The system is flexible to       3.6 (0.55)       receive more questions by the system, challenging
      the user’s needs.                                our assumption on the length of dialogues. How-
                                                       ever, two of the same users added that 7 interac-
Table 1: IDIAL user ratings of their experience:       tions are enough to evaluate the system.
the average scores are provided on a 1-5 Likert
                                                          With respect to the evaluation of the stress tests,
scale with standard deviation, in parentheses.
                                                       we can say that the sentences provided by the users
                                                       during the interaction with the system, were of-
nally, we could not perform the ST-10 test, con-       ten very short and scarcely usable from the view-
cerning anaphora resolution, since at the actual       point of the IDIAL stress tests (especially those
stage of development the system never asks the         concerned with lexical and syntactic aspects). An-
user to pick an answer from a set of options.          other source of problems are typos, in particular in
                                                       expressions regarding time and addresses. While
4.3   Discussion                                       our system seems quite robust to this kind of er-
                                                       rors (see the first 4 rows of Table 2), it is difficult
With respect to the user evaluation test, a number
                                                       to automatically deal with them without some do-
of considerations arise from scores. The main is-
                                                       main specific knowledge on their occurrence and
sue pointed out by the users during the evaluation
                                                       some correction strategies.
phase is the difficulty in grasping when and why
the system misunderstood (or lost) some pieces of         As a final note, we want to report some com-
information, thereby resulting in a relatively poor    ments given by the users about the questionnaire.
evaluation score for the fluency of the system (av-    Two users expressed some doubts on the interpre-
erage score of 2.8). The lack of feedback due to       tation of question 8 and in general all of them
the too simple way we used to generate system          found difficult to assign a meaningful evaluation
responses has even worsened this problem, lead-        to it. For example, some of the users interpreted
ing the user to repeat the same mistake more than      the question as regarding the lack of a GUI, ab-
once. The standard deviation of the evaluations        sent in our prototype. We think that the ambi-
given to question 3 shows the high subjectivity of     guity of the sentence explains the slightly higher
the user experiences with the system, and points       standard deviation for that question in respect to
out the necessity to equip the system with some        others. Other comments include the lack of di-
form of user model to account for the expectation      versity between some sentences (like questions 1
of different kinds of users. It is worth noting that   and 5, often judged as redundant), and the inade-
 quacy of this Likert scale to evaluate some ques-         [Braunger and Maier2017] Patricia Braunger and Wolf-
 tions, like 5 and 9: they consider a more subjective          gang Maier. 2017. Natural language input for in-car
                                                               spoken dialog systems: How natural is natural? In
 scale (“poco” [“few”] - “molto” [“a lot”]) more ap-
                                                               Proceedings of the 18th Annual SIGdial Meeting on
 propriate, perceiving the whole process as a single           Discourse and Dialogue, pages 137–146.
 experience.
    While the linguistic stress test can be a valuable     [Conte et al.2017] Giorgia Conte, Cristina Bosco, and
                                                              Alessandro Mazzei. 2017. Dealing with ital-
 tool for the improvement of the system, the ques-            ian adjectives in noun phrase: a study oriented
 tionnaire concerning the user experience should be           to natural language generation. In Proceedings
 revised for addressing some critics that we col-             of the Fourth Italian Conference on Computational
 lected. In particular, the questionnaire should be           Linguistics (CLiC-it 2017), Rome, Italy, December
                                                              11-13, 2017., December.
 augmented with more specific questions.
                                                           [Cutugno et al.2018] Francesco    Cutugno,      Maria
 5   Conclusion and Future Work                               Di Maro, Sara Falcone, Marco Guerini, Bernardo
                                                              Magnini, and Antonio Origlia. 2018. Overview
 We presented the MuMe system, a prototype of                 of the evalita 2018 evaluation of italian dialogue
 a rule-based dialogue system and its evaluation              systems (idial) task. In EVALITA@ CLiC-it.
 through the IDIAL method.                                 [Deriu et al.2019] Jan Deriu, Alvaro Rodrigo, Arantxa
    Since the MuMe project is still in development,           Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko
 there is much room for improvement. The most                 Agirre, and Mark Cieliebak. 2019. Survey on eval-
                                                              uation methods for dialogue systems. arXiv preprint
 pressing problem to be addressed in future devel-
                                                              arXiv:1905.04071.
 opment is the generation of a response more mean-
 ingful to the user. The application of a natural lan-     [Ghezzi et al.2018] Ilaria Ghezzi, Cristina Bosco, and
 guage generation pipeline for Italian (e.g. (Mazzei          Alessandro Mazzei. 2018. Auxiliary selection in
                                                              italian intransitive verbs: A computational investi-
 et al., 2016; Mazzei, 2016; Conte et al., 2017;              gation based on annotated corpora. In Proceedings
 Ghezzi et al., 2018)) could help to these ends.              of the Fifth Italian Conference on Computational
                                                              Linguistics (CLiC-it 2018), pages 1–6, Berlin.
 Acknowledgments                                              CEUR.

 This project has been partially supported by the          [Hu et al.2018] Tianran Hu, Anbang Xu, Zhe Liu,
                                                              Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo,
 MuMe Project (Muoversi Meglio), funded by the                and Rama Akkiraju. 2018. Touch your heart: A
 Piedmont Region and EU in the frame of the                   tone-aware chatbot for customer care on social me-
 F.E.S.R. 2014/2020.                                          dia. In Proceedings of the 2018 CHI Conference
                                                              on Human Factors in Computing Systems, page 415.
                                                              ACM.
 References                                                [Lison2015] Pierre Lison. 2015. A hybrid approach to
[Abdul-Kader and Woods2015] Sameera A Abdul-                   dialogue management based on probabilistic rules.
   Kader and JC Woods. 2015. Survey on chatbot                 Computer Speech & Language, 34(1):232 – 255.
   design techniques in speech conversation systems.       [Liu et al.2016] Chia-Wei Liu, Ryan Lowe, Iulian V
   International Journal of Advanced Computer                  Serban, Michael Noseworthy, Laurent Charlin, and
   Science and Applications, 6(7).                             Joelle Pineau. 2016. How not to evaluate your di-
                                                               alogue system: An empirical study of unsupervised
[Aust et al.1995] Harald Aust, Martin Oerder, Frank            evaluation metrics for dialogue response generation.
   Seide, and Volker Steinbiss. 1995. The philips au-          arXiv preprint arXiv:1603.08023.
   tomatic train timetable information system. Speech
   Communication, 17(3-4):249–262.                         [Manning et al.2014] Christopher Manning, Mihai Sur-
                                                              deanu, John Bauer, Jenny Finkel, Steven Bethard,
[Bobrow et al.1977] Daniel G. Bobrow, Ronald M. Ka-           and David McClosky. 2014. The stanford corenlp
   plan, Martin Kay, Donald A. Norman, Henry                  natural language processing toolkit. In Proceedings
   Thompson, and Terry Winograd. 1977. Gus, a                 of 52nd annual meeting of the association for
   frame-driven dialog system. Artif. Intell., 8(2):155–      computational linguistics: system demonstrations,
   173, April.                                                pages 55–60.

[Bohlin et al.1999] Peter Bohlin, Johan Bos, Staffan       [Mathur and Singh2018] Vinayak Mathur and Arpit
   Larsson, Ian Lewin, Colin Matheson, and David              Singh.     2018.    The rapidly changing land-
   Milward. 1999. Survey of existing interactive sys-         scape of conversational agents. arXiv preprint
   tems. Deliverable D1, 3:1–23.                              arXiv:1803.08419.
[Mazzei et al.2016] Alessandro Mazzei,       Cristina
   Battaglino, and Cristina Bosco. 2016. Simplenlg-it:
   adapting simplenlg to italian. In Proceedings of
   the 9th International Natural Language Generation
   conference, pages 184–192, Edinburgh, UK,
   September 5-8. Association for Computational
   Linguistics.
[Mazzei2016] Alessandro Mazzei.     2016.    Build-
   ing a computational lexicon by using SQL. In
   Pierpaolo Basile, Anna Corazza, Francesco Cu-
   tugno, Simonetta Montemagni, Malvina Nissim,
   Viviana Patti, Giovanni Semeraro, and Rachele
   Sprugnoli, editors, Proceedings of Third Italian
   Conference on Computational Linguistics (CLiC-it
   2016) & Fifth Evaluation Campaign of Natural
   Language Processing and Speech Tools for Italian.
   Final Workshop (EVALITA 2016), Napoli, Italy,
   December 5-7, 2016., volume 1749, pages 1–5.
   CEUR-WS.org, December.
[McTear et al.2016] Michael McTear, Zoraida Calle-
   jas, and David Griol. 2016. The Conversational
   Interface: Talking to Smart Devices. Springer Pub-
   lishing Company, Incorporated, 1st edition.
[Palmero Aprosio and Moretti2016] A. Palmero Apro-
    sio and G. Moretti. 2016. Italy goes to Stanford:
    a collection of CoreNLP modules for Italian. ArXiv
    e-prints, September.
[Serban et al.2018] Iulian Vlad Serban, Ryan Lowe, Pe-
    ter Henderson, Laurent Charlin, and Joelle Pineau.
    2018. A survey of available corpora for building
    data-driven dialogue systems: The journal version.
    Dialogue & Discourse, 9(1):1–49.
[Strötgen and Gertz2013] Jannik Strötgen and Michael
    Gertz. 2013. Multilingual and cross-domain tem-
    poral tagging. Language Resources and Evaluation,
    47(2):269–298.
[Traum and Larsson2003] David Traum and Staffan
    Larsson. 2003. The Information State Approach
    to Dialogue Management. In Current and New
    Directions in Discourse and Dialogue, pages 325–
    353. Springer.