Overview of the Evalita 2018 itaLIan Speech acT labEliNg (iLISTEN) Task Pierpaolo Basile and Nicole Novielli Università degli Studi di Bari Aldo Moro Dipartimento di Informatica Via E. Orabona, 4 - 70125 Bari (ITALY) {pierpaolo.basile|nicole.novielli}@uniba.it Abstract gente all’informazione basati su dialogo. Il task ha visto la partecipazione di due English. We describe the first edition of team, uno accademico e uno industriale. the “ itaLIan Speech acT labEliNg” (iLIS- Nonostante la complessità del task pro- TEN) task at the EVALITA 2018 cam- posto, entrabi i team hanno ampiamente paign (Caselli et al., 2018). The task con- superato la baseline. sists in automatically annotating dialogue turns with speech act labels, i.e. with the communicative intention of the speaker, 1 Introduction such as statement, request for information, Speech acts have been extensively investigated in agreement, opinion expression, or general linguistics (Austin, 1962; Searle, 1969), and com- answer. The task is justified by the large putational linguistics (Traum, 2000; Stolcke et al., number of applications that could benefit 2000) since long. Specifically, the task of auto- from automatic speech act annotation of matic speech act recognition has been addressed natural language interactions such as tools leveraging both supervised (Stolcke et al., 2000; for the intelligent information access, that Vosoughi and Roy, 2016) and unsupervised ap- is by relying on natural dialogues. We re- proaches (Novielli and Strapparava, 2011). This ceived two runs from two teams, one from interest is justified by the large number of applica- academia and the other one from industry. tions that could benefit from automatic speech act In spite of the inherent complexity of the annotation of natural language interactions. tasks, both systems largely outperformed In particular, a recent research trend has the baseline. emerged to investigate methodologies to enable intelligent access to information, that is by rely- Italiano. Descriviamo la prima edizione ing on natural dialogues as interaction metaphor. del task di “itaLIan Speech acT labEl- In this perspective, chat-oriented dialogue systems iNg” (iLISTEN) organizzato nell’ambito are attracting the increasing attention of both re- della campagna di valutazione EVALITA search and practitioners interested in the simula- 2018. Il task consiste nell’annotazione tion of natural dialogues with embodied conversa- automatica di turni di dialogo con tional agents (Klüwer, 2011), conversational inter- la label di speech act corrispondente. faces for smart devices (McTear et al., 2016) and Ciascuna categoria di speech act de- the Internet of Things (Kar and Haldar, 2016). As nota l’intenzione comunicativa del par- a consequence, we are assisting to the flourishing lante, ossia l’intenzione di formulare of dedicated research venues on chat-oriented in- un’affermazione oggettiva, l’espressione teraction. It is the case of WOCHAT1 , the Special di un’opinione, la richiesta di infor- Session on Chatbots and Conversational Agents, mazioni, una risposta, un’espressione now at its second edition, as well as the Nat- di consenso. Riteniamo che il task ural Language Generation for Dialogue Systems sia rilevante per la il dominio della special session2 , both co-located with the Annual linguistica computazionale e non solo, 1 alla luce del recente interesse da parte http://workshop.colips.org/wochat/ @sigdial2017/ della comunità scentifica nei confronti dei 2 https://sites.google.com/view/ paradigmi di interazione e accesso intelli- nlg4ds2017 SIGdial Meeting on Discourse and Dialogue. The dialogues were collected using a Wizard While not representing any deep understanding of Oz tool as dialogue manager. Sixty subjects of the interaction dynamics, speech acts can be (aged between 21–28) were involved in the study, successfully employed as a coding standard for in two interaction mode conditions: thirty of them natural dialogues tasks. In this report, we describe interacted with the system in a written-input set- the first edition of the “itaLIan Speech acT labEl- ting, using keyboard and mouse; the remaining iNg” (iLISTEN) task at the EVALITA 2018 cam- thirty dialogues were collected with users interact- paign (Caselli et al., 2018). Among the various ing with the ECA in a spoken-input condition. The challenges posed by the problem of enabling con- dialogues collected using the spoken interaction versational access to information, this shared task mode were manually transcribed based on audio- tackles the problem of recognition of the illocu- recording of the dialogue sessions. tionary force, i.e. the speech act, of a dialogue During the interaction, the ECA played the role turn, that is the communicative goal of the speaker. of an artificial therapist and the users were free to The remainder of the paper is organized as fol- interact with it in natural language, without any lows. We start by explaining the task in Sec- particular constraint: they could simply answer the tion 2. In Section 3, we provide a detailed de- question of the agent or taking the initiative and scription of the dataset of dialogues, the annota- ask questions in their turn, make comments about tion schema, and the data format and distribution the agent behavior or competence, argument in fa- protocol. Then, we report about the evaluation vor or against the agent’s suggestion or persua- methodology (see Section 4) and describe the par- sion attempts. The Wizard, on his behalf, had to ticipating systems and their performance (see Sec- choose among a set of about 80 predefined pos- tion 5). We provide final remarks in Section 6. sible system moves. As such, the system moves (see Table 2) are provided only as a context in- 2 Task Description formation but are not subject to evaluation and do The task consists in automatically annotating di- not contribute to the final ranking of the partici- alogue turns with speech act labels, i.e. with pant systems. Conversely, the participating sys- the communicative intention of the speaker, such tems are evaluated on the basis of the performance as statement, request for information, agreement, observed for the user dialogue turns (see Table 1). opinion expression, general answer, etc. Table 1 3.2 Annotation Schema reports the full set of speech act labels used for the classification task, with definition, examples, and Speech acts can be identified with the commu- distribution in our corpus. Regarding the evalua- nicative goal of a given utterance, i.e. it rep- tion procedure, we assess the ability of each sys- resents its meaning at the level of its illocution- tem to issue the correct speech act label among ary force (Austin, 1962). In defining dialogue those included in the taxonomy used for annota- act taxonomies, researchers have been trying to tion, described in the Section 3. Please, note that solve the trade-off between the need for formal the participating systems are requested to issue la- semantics and the need for computational feasi- bels only for the speech act used for labeling the bility, also taking into account the specificity of user’s dialogue turns, as futher detailed in the fol- the many application domains that have been in- lowing. vestigated (see (Traum, 2000) for an exhaustive overview). The Dialogue Act Markup in Several 3 Development and Test Data Layers (DAMSL) represents an attempt by (Core and Allen, 1997) to define a domain independent 3.1 A Dataset of Dialogues framework for speech act annotation. We leverage the corpus of natural language dia- Defining a speech act markup language is out logues collected in the scope of previous research of the scope of the present study. Therefore, we about interaction with Embodied Conversational adopt the original annotation of the Italian advice- Agents (ECAs) (Clarizio et al., 2006), in order to giving dialogues. Table 1 shows the set of nine speed up the process of building a gold standard. labels employed for the purpose of this study, with The corpus contains overall transcripts of 60 di- definitions and examples. These labels are used alogues, 1,576 user dialogue turns, 1,611 system for the annotation of the users’ dialogue turns and turns and about 22,000 words. are the object of classification for this task. In ad- Table 1: The set of user speech act labels employed in our annotation schema. The participating systems are required to issue a label for the user moves only. Speech Act Description Example Freq. OPENING Dialogue opening or self-introduction ‘Ciao, io sono Antonella’ 2% CLOSING Dialogue closing, e.g. farewell, ‘Va bene, ci vediamo prossimamente’ 2% wishes, intention to close the conver- sation INFO-REQUEST Utterances that are pragmatically, se- ‘E cosa mi dici delle vitamine?’ 25% mantically, and syntactically ques- tions SOLICIT-REQ-CLARIF Request for clarification (please ex- ‘Mmm, si ma in che senso?’ 7% plain) or solicitation of system reac- tion STATEMENT Descriptive, narrative, personal state- ‘Penso che dovrei controllare maggior- 33% ments mente il consumo di dolciumi.’ GENERIC-ANSWER Generic answer ‘Si’, ‘No’, ‘Non so.’ 10% AGREE-ACCEPT Expression of agreement, e.g. accep- ‘Si, so che è importante.’ 5% tance of a proposal, plan or opinion REJECT Expression of disagreement, e.g. re- ‘Ho sentito tesi contrastanti al proposito.’ 5% jection of a proposal, plan, or opinion KIND-ATT-SMALLTALK Expression of kind attitude through ‘Thank you.’, ‘Sei per caso offesa per 11% politeness, e.g. thanking, apologizing qualcosa che ho detto?’ or smalltalk dition, in Table 1 we report the speech act labels 3.3 Data Format and Distribution used for the dialogue moves of the system, i.e. the We provide both the training and testing dialogues conversational agent playing the role of the artifi- in the XML format following the structure pro- cial therapist. The speech act taxonomy refines the posed in Figure 1. Each participating initially had DAMSL categories to allow appropriate tagging access to the training data only. Later, the unla- of the communicative intention with respect to the beled test data were released during the evaluation application domain, i.e. persuasion dialogues in period. The development and test data set con- the healthy eating domain. tain 40 and 20 dialogues, respectively, equally dis- tributed with respect to the interaction mode (text- In Table 3 we provide an excerpt from a dia- vs. speech-based interaction). logue from our gold standard. The system moves (dialogue moves and corresponding speech act la- 4 Evaluation bels) are chosen from a set of predefined dialogue moves that can be played by the ECA. As such, Regarding the evaluation procedure, we assess the they are not interesting for the evaluation and rank- ability of each system to issue the correct speech ing of participating systems and are provided only act label for the user moves. The speech act label as contextual information. Conversely, the final used for annotation of the user moves are reported ranking of the participating systems is based on in Table 1. the performance observed only on the prediction Specifically, we compute precision, recall and of speech acts for the users’ move, with respect F1-score (macroaveraging) with respect to our to the set of labels provided in Table 1. Please, gold standard. This approach, while more verbose note that the two sets of speech act labels for the than a simple accuracy test, arise from the need to user and the system moves, in Table 1 and Table correctly address the unbalanced distribution of la- 2, respectively, only partially overlap. This is due bels in the dataset. Furthermore, by providing de- to the fact that the set of agent’s moves includes tailed performance metrics, we intend to enhance also speech acts (such as persuasion attempts) that interesting discussion on the nature of the problem are observed only for the agent, given its caregiver and the data, as they might emerge from the par- role in the dialogue systems. Vice versa, some ticipants’ final reports. As a baseline, we use the speech act labels (such as clarification questions) most frequent label for the user speech acts (i.e., are relevant only for the user moves. STATEMENT). Table 2: The set of system speech act labels in our annotation schema. These labels are provided as context information, i.e. the participating systems are not required to issue a label for the system moves. Speech Act Description Example Freq. OPENING Initial self-introduction by the ECA ‘Ciao, il mio nome è Valentina e sono qui 4% per darti suggerimenti su come miglio- rare la tua dieta.’ CLOSING Dialogue closing, e.g. farewell, ‘Grazie e arrivederci!’ 4% wishes, intention to close the conver- sation QUESTION Question about the user eating habits ‘...’ 32% or information interests TALK-ABOUT-SELF Statement describing own abilities, ‘Non sono in grado di mangiare, e quindi 3% role and skills non seguo diete particolari.’ ANSWER Provision of generic information after ‘Purtroppo sono poco esperta in questo 17% a user request campo.’ OFFER-GIVE-INFO Generic offer of help or specific infor- ‘Idealmente un pasto dovrebbe contenere 8% mation about the diet domain. da 75 a 100 grammi di carne, un frutto di media grandezza, un piccolo piatto di pasta o 50 grammi di gelato. I cibi pronti permettono di controllare le porzioni ri- portando sulla confezione il valore ener- getico in calorie.’ PERSUASION-SUGGEST Persuasion attempt performed by the ‘Lo so: mangiare, e forse vivere, in modo 24% conversational agent about dieting. disordinato, sembra a molti piacevole,e forse hanno ragione. Ma a lungo andare gli effetti negativi si fanno sentire.’ ENCOURAGE-SORRY Statements aimed at enhancing the ‘Questa e’ una buona idea! ’ 9% user motivation or to apologize. Table 3: An excerpt from one of the dialogues in our dataset. Actor ID Speech Act Dialogue Move SYSTEM 5-S1 OPENING Ciao, il mio nome è Valentina. Sono qui per darti dei suggerimenti su come migliorare la tua dieta USER 5-U1 OPENING Ciao, il mio nome è Oriana. SYSTEM 5-S2 INFO-REQUEST Quali sono le tue abitudini alimentari? USER 5-U2 STATEMENT Ho delle abitudini disordinate, mangio anche fuori orario. SYSTEM 5-S3 INFO-REQUEST Ti piace mangiare? USER 5-U3 STATEMENT Si, ma dipende dal periodo: se sono sotto stress mangio meno, quando sono più rilassata mangio molto SYSTEM 5-S4 INFO-REQUEST Segui una dieta variata? USER 5-U3 GENERIC-ANSWER No. Figure 1: Data format Table 4: Overall micro- and macro-averaged Precision, Recall, and F-score for the participating systems Micro Macro System Prec Rec F Prec Rec F UNITOR.kelp 0.7328 0.7328 0.7328 0.6810 0.6274 0.6531 X2Check.c2c 0.6848 0.6848 0.6848 0.6076 0.5844 0.5957 Baseline 0.3403 0.3403 0.3403 0.0378 0.1111 0.0564 Table 5: Precision, Recall, and F-score values by speech act labels Unitor X2Check Class Prec Rec F Prec Rec F OPENING 1.0000 1.0000 1.0000 1.0000 0.7273 0.8421 CLOSING 0.7778 0.7000 0.7368 0.8182 0.9000 0.8571 INFO-REQUEST 0.7750 0.8304 0.8017 0.7355 0.7946 0.7639 SOLICITATION-REQ-CLARIF 0.4000 0.3333 0.3636 0.4444 0.3333 0.3810 STATEMENT 0.7500 0.9444 0.8361 0.6667 0.8957 0.7644 GENERIC-ANSWER 0.8571 0.9231 0.8889 0.7581 0.9038 0.8246 AGREE-ACCEPT 0.6471 0.4583 0.5366 0.5714 0.5000 0.5333 REJECT 0.4286 0.0769 0.1304 0.0000 0.0000 0.0000 KIND-ATT-SMALLTALK 0.5000 0.3864 0.4359 0.4737 0.2045 0.2857 5 Participants and Results The best performance (0.6531) is provided by the UNITOR system. Both systems are able The task was open to everyone from industry and to overcome the baseline also for micro-F. The academia. Sixteen participants registered, but only baseline has a low macro-F since it predicts al- two teams actually submitted the results for the ways the same class (STATEMENT) and for the evaluation. A short description of each system fol- other classes the F-measure is zero. As ex- lows: pected, the micro-F overcomes the macro-F since UNITOR - The system described in (Croce and some classes are hard to predict due to the low Basili, 2018) is a supervised system which number of examples in the training data, such relies on a Structured Kernel-based Support as AGREE, SOLICITATION-REQ-CLARIF and Vector Machine for making the classification REJECT. Precision, Recall, and F-score values by of the dialogue turns sensitive to the syntac- speech act labels are showed in Table 5. tic and semantic information of each utter- We also provide the confusion matrix for each ance. The Structured Kernel is a Smoothed system, respectively Table 6 for UNITOR and Ta- Partial Tree Kernel (Croce et al., 2011) that ble 7 for X2Check. We observe that, for both exploits both the parse tree and the cosine systems, the class REJECT is the most difficult similarity between the word vectors in a dis- to classify. This evidence is consistent with the tributional semantics model. The authors use findings from previous research on the same cor- the tree parser provided by SpaCy3 and the pus of dialogues (Novielli and Strapparava, 2011). Kelp framework4 for SVM. In particular, we observe that dialogue moves be- longing to the REJECT class are often misclassi- X2Check - The team did not submit the report. fied as STATEMENT. More in general, the main The performance of the participating systems is cause of error is the misclassification as STATE- evaluated based on the macro (and micro) preci- MENT. One possible reason is that statements rep- sion and recall (Sebastiani, 2002). However, the resent the majority class, thus inducing a bias in official task measure used to rank the systems is the classifiers. Another possible explanation, is the macro-F. Results are reported in Table 4. that dialogue moves that appear to be linguistically 3 consistent with the typical structure of statements https://spacy.io/ 4 KeLP is a Java Kernel-based Learning Platform: http: have been annotated differently, according to the //www.kelp-ml.org/ actual communicative role they play. Table 6: Confusion Matrix of the UNITOR system w.r.t. gold standard. In column the number of classes from the gold standard, while rows report the system decisions. In bold correct classifications. STATEMENT KIND-ATT. GEN.-ANSW. REJECT CLOSING SOL.-CLAR. OPENING AGREE INFO-REQ. STATEMENT 153 6 3 24 0 3 0 2 13 KIND-ATT. 4 17 0 5 1 2 0 3 2 GEN.-ANSW. 1 0 48 0 0 1 0 6 0 REJECT 0 3 0 3 0 0 0 0 1 CLOSING 0 0 0 0 7 1 0 1 0 SOL.-CLAR. 0 6 0 2 1 8 0 1 2 OPENING 0 0 0 0 0 0 11 0 0 AGREE 0 3 1 1 0 0 0 11 1 INFO-REQ. 4 9 0 4 1 9 0 0 93 Table 7: Confusion Matrix of the X2Check system w.r.t. gold standard. In column the number of classes from the gold standard, while rows report the system decisions. In bold correct classifications. STATEMENT KIND-ATT. GEN.-ANSW. REJECT CLOSING SOL.-CLAR. OPENING AGREE INFO-REQ. STATEMENT 146 15 3 30 1 2 1 2 19 KIND-ATT. 2 9 0 0 0 1 0 5 2 GEN.-ANSW. 5 3 47 2 0 3 0 2 0 REJECT 0 0 0 0 0 0 0 0 0 CLOSING 0 0 0 1 9 0 0 1 0 SOL.-CLAR. 1 4 0 2 0 8 1 0 2 OPENING 0 0 0 0 0 0 8 0 0 AGREE 2 5 1 0 0 1 0 12 0 INFO-REQ. 7 8 1 4 0 9 1 2 89 6 Final Remarks and Conclusions the proposed approaches are able to generalize. We presented the first edition of the new shared task about itaLIan Speech acT labEliNg (iLIS- References TEN) at EVALITA 2018. The task fits in the fast- John L. Austin. 1962. How to do things with words. growing research trend focusing on conversational William James Lectures. Oxford University Press. access to the information, e.g. using chatbots or Tommaso Caselli, Nicole Novielli, Viviana Patti, and conversational agents. The task consists in auto- Paolo Rosso. 2018. EVALITA 2018: Overview of matically annotating dialogue turns with speech the 6th Evaluation Campaign of Natural Language act labels, representing the communicative inten- Processing and Speech Tools for Italian. In Tom- tion of the speaker. The corpus of dialogues has maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of Sixth Evalua- been collected in the scope of previous research on tion Campaign of Natural Language Processing and natural language interaction with embodied con- Speech Tools for Italian. Final Workshop (EVALITA versational agents. Specifically, the participating 2018), Turin, Italy. CEUR.org. systems had to annotate the speech acts associated Giuseppe Clarizio, Irene Mazzotta, Nicole Novielli, to the user dialogue moves while the agent’s dia- and Fiorella De Rosis. 2006. Social attitude towards logue turns were provided as context. a conversational character. pages 2–7. We received two runs from two teams, one from Mark G. Core and James F. Allen. 1997. Coding Di- academia and the other one from industry. In alogs with the DAMSL Annotation Scheme. spite of the inherent complexity of the tasks, both Danilo Croce and Roberto Basili. 2018. A Marko- systems largely outperformed the baseline, repre- vian Kernel-based Approach for itaLIan Speech acT sented by the trivial classifier always predicting labEliNg. In Tommaso Caselli, Nicole Novielli, the majority class for users’ moves. The best per- Viviana Patti, and Paolo Rosso, editors, Proceed- forming system leverages syntactic features and ings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian relies on a Structured Kernel-based Support Vec- (EVALITA’18), Turin, Italy. CEUR.org. tor Machine. Follow-up editions might involve ex- tending the benchmark with dialogues from dif- Danilo Croce, Alessandro Moschitti, and Roberto Basili. 2011. Structured lexical similarity via con- ferent domains. Similarly, dialogues in different volution kernels on dependency trees. In Proceed- languages might be also included in the gold stan- ings of EMNLP. dard, as done for Automatic Misogyny Identifica- Elisabetta Fersini, Debora Nozza, and Paolo Rosso. tion task at EVALITA 2018 (Fersini et al., 2018). 2018. Overview of the Evalita 2018 Task on Au- This would enable to assess to what extent the task tomatic Misogyny Identification (AMI). In Tom- is inherently dependent on the language and how maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evalua- tion campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. Rohan Kar and Rishin Haldar. 2016. Applying Chat- bots to the Internet of Things: Opportunities and Ar- chitectural Elements. CoRR, abs/1611.03799. Tina Klüwer. 2011. “I Like Your Shirt” - Dia- logue Acts for Enabling Social Talk in Conversa- tional Agents. In Intelligent Virtual Agents, pages 14–27. Michael McTear, Zoraida Callejas, and David Griol Barres. 2016. The Conversational Interface: Talking to Smart Devices. Springer International Publishing. Nicole Novielli and Carlo Strapparava. 2011. Dia- logue act classification exploiting lexical semantics. In Conversational Agents and Natural Language In- teraction: Techniques and Effective Practices, chap- ter 4, pages 80–106. IGI Global. John R. Searle. 1969. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge, London. Fabrizio Sebastiani. 2002. Machine learning in auto- mated text categorization. ACM computing surveys (CSUR), 34(1):1–47. Andreas Stolcke, Noah Coccaro, Rebecca Bates, Paul Taylor, Carol Van Ess-Dykema, Klaus Ries, Eliza- beth Shriberg, Daniel Jurafsky, Rachel Martin, and Marie Meteer. 2000. Dialogue Act Modeling for Automatic Tagging and Recognition of Conversa- tional Speech. Comput. Linguist., 26(3):339–373, September. David R. Traum. 2000. 20 Questions for Dialogue Act Taxonomies. Journal of Semantics, 17(1):7–30. Soroush Vosoughi and Deb Roy. 2016. A Semi- automatic Method for Efficient Detection of Stories on Social Media. In Proc. of the 10th AAAI Conf. on Weblogs and Social Media., ICWSM 2016, pages 711–714.