Natural Language Generation in Dialogue Systems for Customer Care Mirko Di Lascio♥ , Manuela Sanguinetti♥♦ , Luca Anselma♥ , Dario Mana♣ , Alessandro Mazzei♥ , Viviana Patti♥ , Rossana Simeoni♣ ♥ Dipartimento di Informatica, Università degli Studi di Torino, Italy ♦ Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Italy ♣ TIM, Torino, Italy ♥ {first.last}@unito.it, ♦ {first.last}@unica.it, ♣ {first.last}@telecomitalia.it Abstract user’s utterance: we call this information lin- guistic channel (L-channel). However, especially English. In this paper we discuss the role in the customer-care domain, this assumption is of natural language generation (NLG) in only partially true. For instance, in the sentence modern dialogue systems (DSs). In partic- “Scusami ma vorrei sapere come mai mi vengono ular, we will study the role that a linguis- fatti certi addebiti?” (“Excuse me, I’d like to tically sound NLG architecture can have know why I’m charged certain fees?”), even a very in a DS. Using real examples from a new advanced NLU module can produce only a vague corpus of dialogue in customer-care do- information about the user’s request to the Dia- main, we will study how the non-linguistic logueManager. Indeed, in order to provide good contextual data can be exploited by using enough responses, the DialogueManager resorts NLG. to other two sources of information: the domain context channel (DC-channel) and the user model 1 Introduction channel (UM-channel). The DC-channel is funda- mental to produce the content of the answer, while In this paper we present the first results of an the UM-channel is necessary to give also the cor- ongoing project on the design of a dialogue sys- rect form. tem for customer care in the telco field. In It is worth noting that both channels, that are most of the dialogue systems (DSs), the gen- often neglected in the design of commercial DSs eration side of the communication is quite lim- for customer-care domain, have central roles in ited to the use of templates (Van Deemter et the design of (linguistically sound) natural lan- al., 2005). Templates are pre-compiled sen- guage generation (NLG) systems (Reiter and Dale, tences with empty slots that can be filled 2000). In particular, considering the standard ar- with appropriate fillers. Most of commercial chitecture for data-to-text NLG systems (Reiter, DSs, following the classical cascade architecture 2007; Gatt and Krahmer, 2018), the analysis of N LU nderstanding ↔ DialogueM anager ↔ the DC-channel exactly corresponds to the con- N LGeneration (McTear et al., 2016), use ma- tent selection task and the UM-channel influences chine learning-based Natural Language Under- both the sentence planning and sentence realiza- standing (NLU) techniques to identify important tion phases. In other words, the central claims concepts (e.g., intent and entities in (Google, of this paper are that in commercial DSs for cus- 2020)) that will be used by the dialogue man- tomer care: (1) L-channel is often not informative ager (i) to update the state of the system and (ii) enough and one needs to use the DC-channel and to produce the next dialogue act (Bobrow et al., the UM-channel for producing a sufficiently good 1977; Traum and Larsson, 2003), possibly filling answer, (2) DC-channel and UM-channel can be the slots in the generation templates. exploited by using standard symbolic1 NLG tech- This classical, and quite common, informa- niques and methods. The remainder of the pa- tion flow/architecture for dialogue processing has, per supports both of these claims while presenting as a working hypothesis, the assumption that our ongoing project on the development of a rule- most of necessary information is provided by the 1 Copyright c 2020 for this paper by its authors. Use The well-known problem of hallucinations in neural net- permitted under Creative Commons License Attribution 4.0 works deters their use in real-world NLG (Rohrbach et al., International (CC BY 4.0). 2018). based NLG prototype to be used in a customer care (a) domain. Section 2 presents the corpus developed Vuoi vedere il dettaglio della fattura? in the first stage of this project, consisting of real (Do you want to see the invoice details?) dialogues containing explanation requests in telco La fattura di gennaio 2020 non è arrivata customer-care domain. Section 3 presents an NLG (I haven’t received the invoice of January 2020 yet) architecture for managing the L-DC-UM channels Ignoring question/feedback - Neutral that can be adopted in a DS for customer care. Fi- nally, Section 4 concludes the paper with few re- Ecco il dettaglio delle voci di spesa presenti marks on the current state of the project and on nella fattura InvoiceNumber del mese di gennaio future work. per la linea PhoneNumber:InvoiceDetails. Vuoi avere il dettaglio di una specifica voce di spesa 2 A Dialogue Corpus for Customer-care presente in fattura? Domain (Here is the detail of the items on the invoice InvoiceNumber of January for the PhoneNumber:InvoiceDetails. This study builds upon the analysis of a corpus of dialogues between customers and a DS for cus- Do you want the detail of a specific item in the invoice?) tomer service developed by an Italian telecom- Straight wrong response munications company. The dialogues, which take Non mi interessa questa fattura, mi serve gennaio 2020 place by means of a textual chat, mainly deal with (I don’t need this invoice, I need the one of January 2020) requests for commercial assistance, both on land- Repetition - Somewhat frustrated line and mobile phones. For the purpose of this Vuoi cambiare argomento? study, the corpus was extracted by selecting, from (Do you want to change topic?) a sample of dialogues held over 24 hours, a re- Topic change duced subset that included requests for explana- tions from customers. The selection criteria were Ciao conceived so as to include all the dialogues where (Whatever) at least one message from the user contained a Non-cooperativity - Somewhat frustrated clearly stated request for explanation. The kind (b) Scusami, non ho capito bene la tua richiesta... of requests identified in this collection basically reflects the problems typically encountered with Puoi ripeterla in modo più semplice? a telecom service provider, such as undue or un- (I’m sorry, I didn’t get your question... familiar charges in the bill or in the phone credit Could you rephrase it in a simpler way?) (about 52% of the overall number of requests in No non hai capito niente this dataset). (No you got it all wrong) The resulting corpus consists of 142 dialogues, Non cooperativity - Somewhat angry with an average of 11 turns per dialogue, and an Mi spiace non aver compreso correttamente. average length of 9 tokens in customer messages Ho la possibilità di farti chattare con un operatore. and 38 tokens in the bot messages. Such dif- umano più esperto. Vuoi? ference in the message length is due to the way the assistant’s responses are currently structured, (I’m sorry I haven’t understood that correctly. in that they usually include detailed information I can put you in contact with a human representative. on invoice items or options available, while, on Is this what you want?) the other hand, customer’s messages are most of- Empathy ten quite concise. Also, the relatively high num- Figure 1: Excerpts from the annotated dataset. ber of turns per dialogue might be explained with Annotation of errors is highlighted in red, that of the high occurrence in the corpus of repeated or customer’s emotions in blue and the agent’s empa- rephrased messages, both by the chatbot and by thy in orange. the customer, due to recurring misunderstandings on both sides. As a matter of fact, the presence of such phe- in this project, led us to the design of an annotation nomena in the corpus, along with the overall goals process that involved different dimensions, such set forth for the development of the NLG module as errors in conversation and emotions. By er- ror, in this context, we mean any event that might approximately 21% of customers’ errors. On the have a negative impact on the flow of the inter- chatbot side, the most frequent error type is repre- action, and more generally on its quality, poten- sented by those cases in which the agent misinter- tially resulting in breakdowns (i.e. whenever one prets a previous customer’s message and proposes party leaves the conversation without completing to move on to another topic rather than providing the given task (Martinovsky and Traum, 2003)). a proper response (30% of cases). As for the sec- The error tagset used in this corpus is partially in- ond annotation dimension, i.e. the one regarding spired by three of the popular Gricean maxims, i.e. customers’ emotions, most of the messages have a those of Quantity, Relation and Manner (Grice, neutral tone (about 86% of user turns), but, among 1989) (each one including further sub-types, not non-neutral messages, the two main negative emo- described here), and it has been conceived so as tions defined in this scheme, namely anger and to include error categories that may apply to both frustration, are the ones most frequently encoun- conversation parties. The second dimension, in- tered in user messages (both with a frequency of stead, is meant to include, on the one hand, cus- 41%), while the cases of messages with a positive tomers’ emotions (as perceived in their messages), emotion constitute less than 1%, and usually trans- and, on the other hand, the chatbot’s empathic re- late into some form of gratitude, appreciation, or sponses (if any). In particular, as regards cus- simple politeness. tomers’ emotions, besides two generic labels for All these dimensions are functional to a fur- neutral and positive messages, we mostly focused ther development of the NLG module, in that they on negative emotions, especially anger and frus- provide, through different perspectives, useful sig- tration, also introducing for these ones two finer- nals of how, and at which point in the conversa- grained labels that define their lower or higher in- tion, the template response currently used by the tensity. While a full description of the annotation chatbot might be improved using the NLG mod- scheme is beyond the scope of this paper, Figure 1 ule. Broadly speaking, framing the error taxon- shows two brief examples of how we applied this omy within the Grice’s cooperative principle pro- scheme to the sample dataset2 . An overview of the vides a useful support for the generation module scheme with a discussion on the main findings and to understand, in case an error is reported, how to annotation issues can be found in Sanguinetti et al. structure the chatbot response so as to improve the (2020). interaction quality in terms of informativeness and Due to privacy concerns and the related relevance (as also discussed in Section 3). anonymization issues that may arise (as further discussed in Section 4), the corpus cannot yet be 3 Balancing information sources in NLG publicly released. However, in an attempt to pro- for DS vide a qualitative analysis of the annotated data, we collected some basic statistics on the distribu- In this Section, we illustrate a DS architecture that tion of errors and emotions labeled in this sample explicitly accounts for the L-DC-UM information set. Overall, we report an amount of 326 errors channels. In particular, we point out that DC and (about 21% of the total number of turns) from both UM channels can be managed by using standard parties; among them, the error class that includes NLG methods. violations of the maxim of Relevance is by far the A commonly adopted architecture for NLG in most frequent one (65% of the errors). Such vi- data-to-text systems is a pipeline composed of olations may take different forms, also depending four modules: data analyzer, text planner, sentence on whether they come from the customer or the planner and surface realizer (Reiter, 2007; Pauws chatbot. As regards the customer, errors of such et al., 2019). Each module tackles a specific is- kind typically take place when the user does not sue: (1) the data analyzer determines what can take into account the previous message from the be said, i.e. a domain-specific analysis of input chatbot, thus providing irrelevant responses that data; (2) the text planner determines what to say, do not allow to move forward with the conver- i.e. which information will be communicated; (3) sation and make any progress; these cases cover the sentence planner determines how to commu- 2 nicate, with particular attention to the design of For further details on the scheme and the definition of all tags, the annotation guidelines are available in this document: the features related to the given content and lan- https://cutt.ly/cdMcnyM guage (e.g. lexical choices, verb tense, etc.); (4) DS NLU DM NLG DC-channel UM-channel Text Planning L-channel Content Sentence Planning Selection Realization USER Figure 2: A dialogue system architecture accounting for L-DC-UM channels. the surface realizer produces the sentences by us- shown in Figure 2, a more informative answer can ing the results of the previous modules and consid- be produced considering the UM-channel and the ering language-specific constraints as well. Note DC-channel. that by definition NLG does not account for lin- As a working hypothesis, we assume that the guistic input (that is, L-channel), all the modules user model consists uniquely in the age of the user. account for the context of the communication. In By assuming that the user is 18 years old, we can other words, data analysis and text planning ex- say that the DS should use an informal register, plicitly process the information about the input i.e. the Italian second person singular (tu) rather data (the DC-channel), and text planning and sen- than the more formal third person singular (lei). It tence planning process the information about the is worth noting that the current accounting of the audience (the UM-channel). Moreover, by us- user model is too simple and there is room for im- ing the nomenclature defined in (Reiter and Dale, provement both in the formalization of the model, 2000), the specific task of content selection de- and in the effect of the user model on the gener- cides what to say, that is the atomic nucleus of ated text. Taking into account the classification of information that will be communicated. the user model acquisition given by (Reiter et al., In our project, we adopt a complete NLG archi- 2003), it is interesting to note that the dialogic na- tecture in the design of the DS (Figure 2). In Fig- ture of the system allow for the possibility to ex- ure 2, we show the contributions of the L-DC-UM plicitly ask users about their knowledge and pref- channels in the interaction flow. It is worth noting erences on the specific domain. that we assigned the content selection task to the Moreover, we assume that the DC-channel con- DM module rather than to the text planning of the sists of all the transactions of the last 7 months, NLG module. Indeed, the content selection task is for example: T1, with an amount of 9.99A C (M1- crucially the point where all the three information M7); T2 with an amount of 2A C (M5-M7, appear- channels need to be merged in order to decide the ing twice in M7); and T3 with an amount of 1.59A C content of the DS answer to the user question. (M7) (see Table 1). In order to understand the contribution of the M1 M2 M3 M4 M5 M6 M7 three information channels to the final message construction, we describe below the main steps of T1 9.99 9.99 9.99 9.99 9.99 9.99 9.99 the module design using the following customer’s T2 0 0 0 0 2 2 2, 2 message, retrieved from the corpus, as an exam- T3 0 0 0 0 0 0 1.59 ple: Table 1: A possible transactions history. Scusami ma vorrei sapere come mai mi vengono fatti alcuni addebiti?. (“Excuse me, I’d like to Looking at the data in Table 1, different forms know why I’m charged certain fees?”) of automatic reasoning could be applied in order Here, the customer requests for an explana- to evaluate the relevance of each singular trans- tion about some (unspecified) charges on her/his action of the user. At this stage of the project, bill, making the whole message not informative we aim to adapt the theory of importance-effect enough. In this case, the DS can deduce from the from (Biran and McKeown, 2017) to our specific L-channel only a generic request of information domain, where the relevant information is in the on transactions. However, using the architecture form of relational database entries. The idea is to consider the time evolution of a specific transac- (Demberg et al., 2011). tion category, giving more emphasis to informa- Finally, we add some closing remarks on the tion contents that can be classified as exceptional corpus availability and its anonymization. The evidences. Informally, we can say that the transac- publication of a dataset of conversations between tions T2 and T3 have a more irregular evolution in customers and a company virtual assistant is a time with respect to T1, therefore they should be great opportunity for the company and for its sur- mentioned with more emphasis in the final mes- rounding communities of academics, designers, sage. and developers. However, it entails a number of The current implementation of the DS is based obstacles to overcome. Rules and laws by regu- on a trivial NLU (regular-expressions), a sym- lating bodies must be strictly followed – see, for bolic sentence planner and realizer (for Italian) example, the GDPR regulation3 . This means, first (Anselma and Mazzei, 2018; Mazzei et al., 2016). of all, including within the to-be-published dataset By considering all the three L-UM-DC channels, only those conversations made by customers who the answer generated by the DS is: have given their consent to this type of treatment Il totale degli addebiti è AC15, 58. Hai pagato of their data. Moreover, it is mandatory to obscure A C4, 00 (2×A C2, 00) per l’Offerta Base Mobile e both personal and sensitive customer data. Such A C1, 59 per l’Opzione ChiChiama e RiChiama. In- obfuscation activities are particularly difficult in fine, hai pagato il rinnovo dell’offerta 20 GB mo- the world of chatbots, where customers are free to bile. (“The total charge is A C15.58. You have been input unrestricted text in the conversations. Reg- charged A C4.00(2×A C2.00) for the Mobile Base Of- ular expressions can be used in order to recognize fer and AC1.59 for the Who’sCalling and CallNow the pieces of data to be obscured, such as email ad- options. Finally, you have been charged for the dresses, telephone numbers, social security num- renewal of the 20 GB mobile offer.”) bers, bank account identifiers, dates of birth, etc. More sophisticated techniques needed be adopted 4 Conclusion and Future Work to identify and obscure, within the text entered by customers, names, surnames, home and work ad- In this paper we have discussed the main fea- dresses. Even more complex and open is the prob- tures of the design of a DS system for telco cus- lem of anonymizing sensitive customer data. For tomer care. In particular, we outlined the peculiar- example, consider the case of a disabled customer ities of this domain, describing the construction who reveals his/her sanitary condition to the vir- of a specifically-designed dialogue corpus and dis- tual assistant, in order to obtain a legitimate bet- cussing a possible integration of standard DS and ter treatment from the company: the text reveal- NLG architectures in order to manage these pe- ing the health condition of the customer must be culiarities. This is an ongoing project and we are obscured. Other relevant sensitive data include considering various enhancements: (1) we will in- racial or ethnic origins, religious or philosophical tegrate emoji prediction capabilities into the pro- beliefs, political opinions, etc. Some of these tech- posed architecture in order to allow the DS to auto- niques, used for identifying certain types of data matically attach an appropriate emoji at the end of to be obscured, have a certain degree of precision the generated response, relying on previous work that may even be far, given the current state of the for Italian (Ronzano et al., 2018); we would also art, from what a trained human analyst could do. take into account the current user emotions, while Therefore, it is also necessary to consider the need generating an appropriate emoji – it may be the for the dataset being published to be reviewed and case that an emoji that is adequate when the con- edited by specialized personnel before the actual versation is characterized by a neutral tone, sud- publication. With this in mind, the techniques denly becomes inappropriate if the user is frus- of data recognition mentioned above - regular ex- trated or angry (Pamungkas, 2019; Cercas Curry pressions, Named Entity Recognition, etc. - could and Rieser, 2019); (2) we would like to enhance also be exploited to develop tools that can speed the system so as to adapt the generated responses up the task of completing and verifying the accu- to other aspects of the users, such as their mental rate anonymization of the dataset. models, levels of domain expertise, and personal- ity traits; (3) we want to evaluate the DS follow- 3 https://eur-lex.europa.eu/eli/reg/ ing the user-based comparative schema adopted in 2016/679/oj Acknowledgements Michael McTear, Zoraida Callejas, and David Griol. 2016. The Conversational Interface: Talking to The work of Mirko Di Lascio, Alessandro Smart Devices. Springer Publishing Company, In- Mazzei, Manuela Sanguinetti e Viviana Patti has corporated, 1st edition. been partially funded by TIM s.p.a. (Studi e Endang Wahyu Pamungkas. 2019. Emotionally-aware Ricerche su Sistemi Conversazionali Intelligenti, chatbots: A survey. CoRR, abs/1906.09774. CENF CT RIC 19 01). Steffen Pauws, Albert Gatt, Emiel Krahmer, and Ehud Reiter. 2019. Making effective use of healthcare data using data-to-text technology. In Data Science References for Healthcare, pages 119–145. Springer. Luca Anselma and Alessandro Mazzei. 2018. De- signing and testing the messages produced by a vir- Ehud Reiter and Robert Dale. 2000. Building Natural tual dietitian. In Proceedings of the 11th Interna- Language Generation Systems. Cambridge Univer- tional Conference on Natural Language Generation, sity Press, New York, NY, USA. Tilburg University, The Netherlands, November 5-8, Ehud Reiter, Somayajulu Sripada, and Sandra 2018, pages 244–253. Williams. 2003. Acquiring and using limited user Or Biran and Kathleen McKeown. 2017. Human- models in NLG. In Proceedings of the 9th Euro- centric justification of machine learning predictions. pean Workshop on Natural Language Generation In Proceedings of the Twenty-Sixth International (ENLG-2003) at EACL 2003. Joint Conference on Artificial Intelligence, IJCAI- Ehud Reiter. 2007. An architecture for data-to-text 17, pages 1461–1467. systems. In Proc. of the 11th European Work- Daniel G. Bobrow, Ronald M. Kaplan, Martin Kay, shop on Natural Language Generation, ENLG ’07, Donald A. Norman, Henry Thompson, and Terry pages 97–104, Stroudsburg, PA, USA. Association Winograd. 1977. Gus, a frame-driven dialog sys- for Computational Linguistics. tem. Artif. Intell., 8(2):155–173, April. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Amanda Cercas Curry and Verena Rieser. 2019. A Trevor Darrell, and Kate Saenko. 2018. Object hal- crowd-based evaluation of abuse response strate- lucination in image captioning. In Proceedings of gies in conversational agents. In Proceedings of the 2018 Conference on Empirical Methods in Nat- the 20th Annual SIGdial Meeting on Discourse ural Language Processing, pages 4035–4045, Brus- and Dialogue, pages 361–366, Stockholm, Sweden, sels, Belgium, Nov. Association for Computational September. Association for Computational Linguis- Linguistics. tics. Francesco Ronzano, Francesco Barbieri, En- Vera Demberg, Andi Winterboer, and Johanna D. dang Wahyu Pamungkas, Viviana Patti, and Moore. 2011. A strategy for information presenta- Francesca Chiusaroli. 2018. Overview of the tion in spoken dialog systems. Computational Lin- EVALITA 2018 Italian Emoji Prediction (ITAMoji) guistics, 37(3):489–539. Task. In Proceedings of the Sixth Evaluation Cam- paign of Natural Language Processing and Speech Albert Gatt and Emiel Krahmer. 2018. Survey of the Tools for Italian. Final Workshop (EVALITA 2018), state of the art in natural language generation: Core volume 2263 of CEUR Workshop Proceedings. tasks, applications and evaluation. J. Artif. Intell. CEUR-WS.org. Res., 61:65–170. Manuela Sanguinetti, Alessandro Mazzei, Viviana Google. 2020. Dialogflow documentation. Patti, Marco Scalerandi, Dario Mana, and Rossana https://dialogflow.com. Online; accessed 2020-08- Simeoni. 2020. Annotating Errors and Emotions 10 11:24:07 +0200. in Human-Chatbot Interactions in Italian. In Pro- ceedings of the 14th Linguistic Annotation Work- Paul Grice. 1989. Studies in the Way of Words. Har- shop (LAW@COLING 2020). Association for Com- vard University Press, Cambridge, Massachussets. putational Linguistics. Bilyana Martinovsky and David Traum. 2003. The David Traum and Staffan Larsson. 2003. The Informa- error is the clue: Breakdown in human-machine in- tion State Approach to Dialogue Management. In teraction. In In Proceedings of the ISCA Workshop Current and New Directions in Discourse and Dia- on Error Handling in Dialogue Systems. logue, pages 325–353. Springer. Alessandro Mazzei, Cristina Battaglino, and Cristina Kees Van Deemter, Emiel Krahmer, and Mariët The- Bosco. 2016. SimpleNLG-IT: adapting Sim- une. 2005. Real versus template-based natural lan- pleNLG to Italian. In Proceedings of the 9th Inter- guage generation: A false opposition? Comput. Lin- national Natural Language Generation conference, guist., 31(1):15–24, March. pages 184–192, Edinburgh, UK, September 5-8. As- sociation for Computational Linguistics.