Simulating Domain Changes in Conversational Agents Through Dialogue Adaptation Tiziano Labruna1,2 , Bernardo Magnini1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo - Trento, Italy 2 Free University of Bozen-Bolzano, Piazza Università 1, Bozen-Bolzano, Italy Abstract A major bottleneck for the large diffusion of data-driven conversational agents is that conversational domains are subject to continuous changes, which soon make initial dialogue models inadequate to manage new situations. In the current context, updating training data is usually carried on manually, and, in addition, there are no tools for simulating the impact of a certain domain change on the performance of the dialogue system. This position paper advocates that substantial progress in the capacity to simulate domain changes is based on the ability to automatically adapt training and test dialogues to those changes. We discuss the potential of a simulation framework for task-oriented dialogues, as well as the research challenges that need to be addressed. Keywords Dialogue Systems, Conversational Agents, Domain Adaptation 1. Introduction Task-oriented dialogue systems [2, 3, 4] allow users to achieve specific tasks (e.g., booking a restaurant, buying a train ticket, ordering some food) through dialogues in natural language. While in recent years there has been a large diffusion of such conversational systems, a major bottleneck for their development, even for more complex tasks, is that conversational domains are very dynamic and are subject to continuous changes, which soon make initial dialogue models inadequate to manage new situations. As an example, a chatbot for giving information about covid-19 needs to be frequently updated, as new regulations are introduced and others are changed. A similar issue happens in the case of booking restaurants in a region, where new restaurants open and others introduce new food. In such situations initial dialogue models (e.g., intent and slot-filling) soon become obsolete and the system performance rapidly decreases. The current practice in case of domain changes consists of manually updating the training dialogues, typically adding sentences with new intents and entities that reflect the changes. However, this practice is extremely expensive and requires specialized competencies. In addition, there are no tools for simulating the impact that a certain domain change might have on the performance of the dialogue system and its components. Being able to approximate the impact of, for instance, adding or removing a certain slot in the system knowledge base, would allow a NL4AI 2022: Sixth Workshop on Natural Language for Artificial Intelligence, November 30, 2022, Udine, Italy [1] $ tlabruna@fbk.eu (T. Labruna); magnini@fbk.eu (B. Magnini)  0000-0001-7713-7679 (T. Labruna); 0000-0002-0740-5778 (B. Magnini) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Representation of a typical data-driven conversational system flow. The user sends a message, the message is parsed by a dialogue state tracking component, the output is passed to a dialogue policy component, which decides the best next action of the system, and finally, a natural language generation component generates the utterance to be returned to the user. Each component is based on a model, which is, in turn, trained on some dialogues, typically created by hand and linked to a knowledge base. The dialogues will vary when there are domain changes. more precise estimation of the re-training costs, with a significant saving of time and money. Although dialogue simulators have been proposed (e.g., Simdial [5]), to the best of our knowledge, none of them is designed to simulate domain changes. In this position paper, we rise a number of research challenges that need to be considered when designing a dialogue simulator able to account for domain changes. First, we need to fix a reference architecture for the dialogue system, including the main dialogue components (e.g., intent detection and slot filling, dialogue manager, response generation). Second, define the experimental parameters of the simulator, i.e., which data can be manipulated, such as the kind and amount of changes and the models for the dialogue components. Finally, a relevant challenge for a dialogue simulator able to manage domain changes concerns performance evaluation. More specifically, in order to evaluate a certain change (e.g., a new type of food for a restaurant is introduced, which was not present before), we need a gold standard (i.e., test dialogues) reflecting the changes we intend to simulate. In this paper, we suggest that recent dialogue adaptation techniques [6, 7] can be applied for the automatic creation of test dialogues to be used in a dialogue simulator. We also suggest that the generative power of recent pre-trained language models may offer encouraging opportunities in the direction of automatic dialogue adaptation. 2. Dialogue System’s Architecture Figure 1 depicts a general architecture of a data-driven conversational agent, showing three main components: Natural Language Understanding (NLU), Dialogue Manager (DM) and Natural Language Generation (NLG). The user sends the message to the agent, the NLU component is responsible for extracting relevant information from the message and passing it to the DM component, which, based on that information, decides which action to take; finally, the NLG component takes the action as input and returns a natural language message to be sent back to the user. • Natural Language Understanding. The goal of the NLU component [8] is to extract relevant information from the user message. This information typically consists of an intent (the communicative goal of the user’s utterance) and a certain number of entities that can be contained in the message. The prediction of the former is known as Intent Recognition, while the prediction of the latter is called Entity Extraction or Slot Filling. The prediction of intents and entities is usually evaluated in terms of accuracy and f1-score. • Dialogue Manager. The DM component takes an intent and a certain number of entities (possibly empty) as input and returns the best next action to take, as the output, which typically consists of an intent and some slot-values. While taking this decision, the dialogue manager also considers some state variables, such as the conversation history up to a certain point in the past. The selection of the best action is usually evaluated in terms of accuracy. • Natural Language Generation. As the last component of this process, NLG is respon- sible for converting the output of DM into words. This means that it needs to take a structured representation of information and produce a natural language utterance that will be returned to the user. The correct generation of the utterance is evaluated using string comparison metrics (a common one is BLEU [9]). 3. A Simulator for Domain Changes We propose a methodology to investigate the impact of domain changes in dialogue systems based on a Domain Changes Simulator (DCS), an architecture that simulates different types and different amounts of domain changes, chooses a model for every dialogue component and produce a report on the performances of the models given a certain configuration of the simulator. Domain changes. Domain knowledge in a task-oriented dialogue is typically represented in a Knowledge Base (KB), where instances of concepts (e.g., restaurant) are described through slots that can assume a range of values. We consider the following domain changes: • Concept changes. Concepts (also referred to as domains in the literature) delimit the topics that can be discussed in a conversation. A concept can be removed (e.g., an agent does not cover information on restaurants anymore), or added (e.g., an agent starts giving information about hotels). • Slot changes. Slots are associated with concepts and describe the characteristics of the concept instances. Slots can be added (e.g., there is new interest whether a restaurant has parking), or can be removed (e.g. we are no longer interested whether a hotel has an internet_connection). • Instance changes. Instances are individual entities in the conversational domain (e.g., a specific restaurant). Instances can be added (e.g., a new restaurant opens), or removed (e.g. a restaurant closes down). • Slot-value changes. Slot-values are used to describe properties of instances (e.g., the Mario’s restaurant offers Italian food). A new slot-value can be introduced in the KB (e.g. Caribbean food starts to be served by some restaurants), a slot-value can disappear from the Knowledge Base (e.g. no more restaurants serve Indian food ), or an instance can change its slot-value (e.g., when a restaurant changes its menu). We are interested in simulating and assessing the impact, of all the changes described above, through configurations of the DCS dialogue simulator. As a first attempt to simulate domain changes in a task-oriented dialogue, we are experimenting on the RASA platform [10] simulating changes over the MultiWOZ dataset [11]. Dialogue Models. In addition to domain changes, the DCS simulator should be able to con- sider different models for the dialogue components described in section 2. The NLU component requires an annotated collection of user utterances in natural language, in order to be able to recognise and extract intents and entities from a message; the DM component requires a set of dialogues with annotations of intents and entities from the user and related intents of the answers from the system; the NLG component requires a list of natural language utterances for every system’s intent. Each model can be more or less robust to domain changes, thus having different degrees of generalization and requiring more or less exhaustiveness of the training data. Some models, for example, are able to perform few-shot or even zero-shot learning, by leveraging techniques such as schema-guided algorithms [12, 13]. 4. Evaluating the Impact of Domain Changes The main purpose of the DCS simulator is to access how the performances of the dialogue components evolve when domain changes occur so that it is possible to estimate their impact on the system. A crucial issue here is to develop test data for each component and for each configuration of domain change we are interested to evaluate. Test data vary according to the dialogue component: we need dialogue annotated with intents and slot-value pairs for NLU, actions to be performed by the system at each dialogue turn for the Dialogue Manager and reference system responses for the NLG component. While in principle such test data should be collected through human intervention (e.g., Wizard of Oz), this is practically impossible given the high number of potential configurations we want to simulate. To overcome this issue, we are proposing a dialogue adaptation strategy for the automatic creation of the test data to be used by the DCS simulator. The idea behind dialogue adaptation is that domain knowledge described in the 𝐾𝐵 is somehow reflected in training and test dialogues, and that, when a domain change occurs, it is possible to adapt the initial dialogues so that the change is adequately reflected. More formally, we define the problem of Dialogue Adaptation as follows: starting from a dialogue 𝐷0 collected for a certain knowledge base 𝐾𝐵0 , the goal is to modify 𝐷0 such that it reflects a knowledge base 𝐾𝐵1 , where 𝐾𝐵0 and 𝐾𝐵1 share the same domain ontology 𝑂 (i.e., they share domain entities and slots). Dialogue adaptation has different complexity depending on the changes introduced in Section 3. Concept changes require that all dialogues referring to a certain concept (e.g., restaurant) are removed, or substituted with dialogues with a different concept. This is highly complex, as it implies the ability to automatically regenerate a full dialogue. All dialogue components are affected by concept changes. Slot changes require that full portions of a dialogue, e.g., a turn referring to a certain slot, are changed to reflect a newly introduced slot. Instance changes mainly affect DM decisions about the next action. Finally, slot-value changes have reduced complexity and can be addressed through local substitutions within single turns. As regards possible approaches to domain adaptation, different strategies can be employed, spanning from rule-based to generative approaches. We have experimented with both of them, on a subset of domain changes and for the NLU component only, in our previous works [14, 15, 7], showing that the use of a pre-trained language model, fine-tuned on the target domain 𝐾𝐵, achieves promising results. However, a simulation environment imposes more strict constraints on dialogue adaptation, not only regarding the capacity to manage different types of domain changes, but also the capacity to simulate fine-grained amounts of such change (e.g., add 20% of new restaurant instances serving poke food, and remove all dialogues mentioning parking). 5. Conclusion This position paper suggests a long-term research direction aimed at simulating the impact that domain changes might have on the performance of a conversational system. Such a simulator allows to manipulate and set a number of experimental parameters, including several types and different amounts of changes, and various algorithms for training dialogue components, making it easier and less expensive to develop and maintain a conversational system. A major research challenge for a domain change simulator is the capacity to automatically generate test dialogues that approximate the domain changes. This capacity is crucial for evaluation purposes, and can be achieved through incremental substitutions in the initial training dialogues, exploiting the generative power (e.g., masked tokens, prompting) of pre-trained language models. References [1] D. Nozza, L. Passaro, M. Polignano, Preface to the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI), in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Pro- ceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022), November 30, 2022, CEUR-WS.org, 2022. [2] M. McTear, Conversational ai: Dialogue systems, conversational agents, and chatbots, Synthesis Lectures on Human Language Technologies 13 (2020) 1–251. [3] S. Young, M. Gašić, B. Thomson, J. D. Williams, Pomdp-based statistical spoken dialog systems: A review, Proceedings of the IEEE 101 (2013) 1160–1179. doi:10.1109/JPROC. 2012.2225812. [4] M. Henderson, B. Thomson, J. D. Williams, The second dialog state tracking challenge, in: Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Association for Computational Linguistics, Philadelphia, PA, U.S.A., 2014, pp. 263–272. URL: https://www.aclweb.org/anthology/W14-4337. doi:10.3115/v1/ W14-4337. [5] T. Zhao, M. Eskenazi, Zero-shot dialog generation with cross-domain latent actions, in: Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 1–10. URL: https://www. aclweb.org/anthology/W18-5001. doi:10.18653/v1/W18-5001. [6] T. Labruna, B. Magnini, Addressing slot-value changes in task-oriented dialogue systems through dialogue domain adaptation, in: Proceedings of RANLP 2021, 2021. URL: https: //aclanthology.org/2021.ranlp-1.89.pdf. [7] T. Labruna, B. Magnini, Fine-tuning bert for generative dialogue domain adaptation, in: Text, Speech, and Dialogue, 2022, pp. 490–501. [8] S. Louvan, B. Magnini, Recent neural methods on slot filling and intent classification for task-oriented dialogue systems: A survey, in: Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 480–496. URL: https://www.aclweb.org/ anthology/2020.coling-main.42. doi:10.18653/v1/2020.coling-main.42. [9] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [10] T. Bocklisch, J. Faulkner, N. Pawlowski, A. Nichol, Rasa: Open source language under- standing and dialogue management, arXiv preprint arXiv:1712.05181 (2017). [11] P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, O. Ramadan, M. Gašić, Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling, arXiv preprint arXiv:1810.00278 (2018). [12] J. Cao, Y. Zhang, A comparative study on schema-guided dialogue state tracking, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 782–796. [13] Z. Lin, B. Liu, S. Moon, P. Crook, Z. Zhou, Z. Wang, Z. Yu, A. Madotto, E. Cho, R. Subba, Leveraging slot descriptions for zero-shot cross-domain dialogue state tracking, arXiv preprint arXiv:2105.04222 (2021). [14] T. Labruna, B. Magnini, From cambridge to pisa: A journey into cross-lingual dialogue domain adaptation for conversational agents (2021). [15] T. Labruna, B. Magnini, Addressing slot-value changes in task-oriented dialogue systems through dialogue domain adaptation, International Conference Recent Advances In Natural Language Processing (2021).