<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Are you a Good Assistant? Assessing LLM Trustability in Task-oriented Dialogues</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tiziano</forename><surname>Labruna</surname></persName>
							<email>tlabruna@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Free University of Bozen-Bolzano</orgName>
								<address>
									<addrLine>3 Dominikanerplatz 3 -Piazza Domenicani 3</addrLine>
									<postCode>39100</postCode>
									<settlement>Bozen-Bolzano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<addrLine>Via Sommarive 18</addrLine>
									<postCode>38123</postCode>
									<settlement>Povo, Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sofia</forename><surname>Brenna</surname></persName>
							<email>sbrenna@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Free University of Bozen-Bolzano</orgName>
								<address>
									<addrLine>3 Dominikanerplatz 3 -Piazza Domenicani 3</addrLine>
									<postCode>39100</postCode>
									<settlement>Bozen-Bolzano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<addrLine>Via Sommarive 18</addrLine>
									<postCode>38123</postCode>
									<settlement>Povo, Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giovanni</forename><surname>Bonetta</surname></persName>
							<email>gbonetta@fbk.eu</email>
							<affiliation key="aff1">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<addrLine>Via Sommarive 18</addrLine>
									<postCode>38123</postCode>
									<settlement>Povo, Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bernardo</forename><surname>Magnini</surname></persName>
							<email>magnini@fbk.eu</email>
							<affiliation key="aff1">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<addrLine>Via Sommarive 18</addrLine>
									<postCode>38123</postCode>
									<settlement>Povo, Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Are you a Good Assistant? Assessing LLM Trustability in Task-oriented Dialogues</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">FCC9EECC5E0E17BAC8B311B6A940C877</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>task-oriented dialogues</term>
					<term>constraint satisfaction</term>
					<term>knowledge base coherence</term>
					<term>Llama3 8B</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Despite the impressive capabilities of recent Large Language Models (LLMs) to generate human-like text, their ability to produce contextually appropriate content for specific communicative situations is still a matter of debate. This issue is particularly crucial when LLMs are employed as assistants to help solve tasks or achieve goals within a given conversational domain. In such scenarios, the assistant is expected to access specific knowledge (e.g., a database of restaurants, a calendar of appointments) that is not directly accessible to the user and must be consistently utilised to accomplish the task. In this paper, we conduct experiments to evaluate the trustworthiness of automatic assistants in task-oriented dialogues. Our findings indicate that state-of-the-art open-source LLMs still face significant challenges in maintaining logical consistency with a knowledge base of facts, highlighting the need for further advancements in this area.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Conversational assistants <ref type="bibr" target="#b0">[1]</ref> are widely used to help human users achieve specific goals through dialogue. In a typical scenario (e.g., booking a restaurant, scheduling an appointment, selecting a song in a playlist, etc.), the assistant interprets the user's goals, searches a database for relevant options, and provides the user with responses (e.g., a restaurant reservation, a new appointment in a calendar, a song playing on a smartphone). A key ability for an assistant is to maintain consistency between user requests and domain knowledge <ref type="bibr" target="#b1">[2]</ref>. This is crucial because, in a typical setting, the user does not know the actual content of the database (e.g., all the restaurants in a city) and, as a consequence, cannot verify whether the assistant's response is correct.</p><p>While in traditional approaches <ref type="bibr" target="#b2">[3]</ref>, this consistency was ensured by a dedicated component responsible for retrieving information from a domain database, recent end-to-end approaches <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref> rely on a single LLM-based model for utterance understanding, domain knowledge retrieval, and response generation. In this setting, the LLM must generate responses that are as aligned with the database as possible. However, the ability of current endto-end assistants to maintain consistency between the generated responses and the actual content of the domain knowledge is questionable (e.g., due to LLM confabulations), and there is a clear lack of empirical evidence on this crucial issue.</p><p>To be more concrete, Figure <ref type="figure" target="#fig_0">1</ref> shows an example of an inconsistent dialogue with respect to the conversational knowledge base. Here, although there are two Spanish restaurants in the knowledge base, the system (turn S1) informs the user that there are three Spanish restaurants, providing incorrect information. This is an example of inconsistency generated by an LLM, which is the focus of this research.</p><p>Our aim is to shed new light on the trustworthiness of an LLM playing the role of an assistant in a task-oriented conversational domain while interacting with a user. We aim to answer the following research questions: (i) How can we operationally define the consistency between a task-oriented dialogue and the domain database behind the dialogue? (ii) How can we quantify the degree of trustworthiness of an assistant-LLM? (iii) Can we collect empirical evidence on a sufficiently large amount of taskoriented dialogues?</p><p>To address these research questions, we set up an experimental framework allowing large-scale analysis, where task-oriented dialogues are first automatically generated by two instances of a state-of-the-art LLM, LLama-3 8B <ref type="bibr" target="#b5">[6]</ref>, and then a more powerful LLM, GPT-4o <ref type="bibr" target="#b6">[7]</ref>, is used to detect potential inconsistencies between a dialogue and a corresponding domain knowledge base. We hope that new large-scale experimental data can be used to develop more reliable and effective task-oriented dialogue systems, ultimately enhancing the capabilities of conversational agents in various applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology and Experimental Setting</head><p>Our experimental setting consists of two phases. In the preliminary phase, referred to as the Human-Llama Interaction phase (cfr. Section 3), we test the capabilities of an open-source LLM (i.e. LLama-3) to generate adequate task-oriented dialogues through interactive conversations with humans.</p><p>In the second phase, referred to as the Llama-Llama Interaction phase (cfr. Section 4), we automate both the generation and evaluation of task-oriented dialogues, creating a Llama-Llama generated MultiWOZ dialogue corpus, The Dining Llamas of Oz 1 . Following in this section, the description of the MultiWOZ dataset and the metrics used to check and quantify the reliability of the generated dialogs in both phases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">The MultiWOZ 2.3 Dataset</head><p>Since the primary focus of this work is about taskoriented dialogues, we used the MultiWOZ (Multi-Domain Wizard-Of-Oz) dataset <ref type="bibr" target="#b7">[8]</ref>, one of the most prominent datasets in this area. MultiWOZ has been extensively employed to develop and test models for natural language understanding, dialogue management, and natural language generation. 1 The generated dataset is publicly available at: https://github.com/tLabruna/The-Dining-Llamas-of-Oz MultiWOZ is a widely known task-oriented dialogue dataset collected via the Wizard of Oz approach. The dataset comprises over 10,000 dialogues between a customer and the Cambridge InfoTown assistant, designed to help customers navigate Cambridge's amenities. The conversations span over seven different domain concepts, including train ticket reservations, tourist attraction searches, and restaurant reservations. For our experiments, we selected data related to the restaurant domain (version 2.3 <ref type="bibr" target="#b8">[9]</ref>).</p><p>The MultiWOZ dialogues were collected with a system that provides information to the user relying on a specific database, known as the Knowledge Base (KB), describing properties of the Cambridge domain. Each domain concept has its own KB; for our experiments, we consider only the restaurant KB. The restaurant KB holds information about 110 different instances (i.e., restaurants), where each instance comprises a series of properties (e.g., Name, Food, Area) and corresponding values (e.g., The Old Cambridge, british, north).</p><p>All system turns in the dialogues are expected to consistently rely on the information contained in the KB to provide accurate information to the user.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Consistency Metrics</head><p>To assess the consistency of a generated turn against its Knowledge Base, we analysed each system-generated conversational turn referring to any piece of information provided in the KB. Each turn was assessed based on two separate binary metrics:</p><p>• KB-Alignment: Assesses whether the system turn is consistent with the KB, meaning that does not contradict any information provided in the KB. • KB-Grounding: Assesses whether the system turn refrains from hallucinating and introducing information not present in the KB, ensuring all mentioned details are grounded in the existing KB.</p><p>For instance, the assessments for the system turns in Figure <ref type="figure" target="#fig_0">1</ref> would be as follows: T4 (KB-Alignment = 0, KB-Grounding = 1), T6 (KB-Alignment = 0, KB-Grounding = 0). In addition to this, we used two evaluation metrics to assess the overall quality of each turn and provide a global evaluation of the whole corpus:</p><p>• Correct Turns: Indicates the percentage of turns that have both KB-Alignment and KB-Grounding annotated as 1. • Correct Dialogues: Indicates the percentage of dialogues that have all turns with both KB-Alignment and KB-Grounding annotated as 1.</p><p>These metrics offer a comprehensive understanding of the dialogue system's ability to maintain consistency and accuracy throughout the conversation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Human-Llama Interaction Phase</head><p>In this phase, we simulated the dialogue collection approach of the MultiWOZ dataset through the human-Llama interactive generation of novel dialogues. Although this phase required substantial human effort, it was crucial for obtaining an initial high-quality set of dialogues.</p><p>We aimed to generate dialogues where a human interacts with a system played by Llama-3 8B in two languages: English and Italian. The model was prompted to play the role of the Cambridge InfoTown system. The system's goal was to guide the user towards reserving a restaurant in Cambridge. For each dialogue, we utilised 10 restaurant instances taken from the MultiWOZ KB. We selected 6 distinct sets of instances, which had the following characteristics:</p><p>1. All with the same Food; 2. All with different Food (or as different as possible); 3. All with the same Price; 4. All with different Price (or as different as possible); 5. All with the same Area; 6. All with different Area (or as different as possible).</p><p>We chose the slots Food, Price, and Area to differentiate the sets since they are the informable slots within the Restaurant concept.</p><p>The human users were instructed to follow a scenario that involved reserving a restaurant, providing a realistic context for the dialogues. Five distinct instructions were employed for the interactive generation of a human-LLM dialogue, each paired with the 6 sets of KB instances, resulting in a total of 30 dialogue scenarios. The process was repeated in both English and Italian, leading to the creation of 30 dialogues in each language, for a total of 60 dialogues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Manual Evaluation</head><p>The manual evaluations were conducted by three annotators who assessed the dialogues based on the binary metrics KB-Alignment and KB-Grounding. Each of the 60 dialogues was annotated by at least two different annotators to ensure reliability. The inter-annotator agreement between human evaluators was measured using Cohen's Kappa (𝜅) to provide a measure of the inter-rater reliability (IRR) level. As per Table <ref type="table" target="#tab_0">1</ref>, we obtained an average 𝜅 in both metrics and languages that indicates substantial agreement on Landis and Koch's agreement scale <ref type="bibr" target="#b9">[10]</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Automated Evaluation</head><p>We instructed GPT-4o<ref type="foot" target="#foot_0">2</ref> to perform the same evaluations as the human annotators. This consisted in feeding the model with a given KB/dialogue pair, asking it to output two lists of turn assessments: one for the KB-Grounding and another for the KB-Alignment. Then we computed the agreement between GPT-4o's evaluations and the human evaluations. The precise prompt used to instruct GPT-4o can be found in Appendix B. Although the agreement with GPT-4o (see Table <ref type="table" target="#tab_0">1</ref>) was slightly lower than the substantial agreement observed between human annotators, it was still classified as moderate on Landis and Koch's agreement scale <ref type="bibr" target="#b9">[10]</ref>. Due to these results we assumed GPT-4o to be a valuable automatic judge and deployed it the same way for the LLama-LLama evaluation phase (cfr. Section 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">The Dining Llamas of Oz</head><p>After recognising the ability of Llama-3 to generate dialogues and the evaluation skills of GPT-4o (cfr. Section 3.2), we conducted further experiments by generating 1,311 dialogues using Llama-3 8B and following the Mul-tiWOZ dataset. For each dialogue of the original dataset, we utilised the instructions provided to the human user in the Wizard-of-Oz setting to guide a Llama acting as the user, interacting with a Llama acting as the system.</p><p>During the dialogue generation phase, we randomly selected 70 instances from the entire Knowledge Base for each simulated dialogue, ensuring that each dialogue was staged in a varied KB scenario. This approach, a.k.a LLama-Llama phase, allowed us to create a large set of automatically generated dialogues, each based on a different subset of the KB. We call this generated dataset "The Dining Llamas of Oz," which comprises 1,049 training instances, with 131 instances each for the validation and test sets.</p><p>Table <ref type="table" target="#tab_1">2</ref> presents statistics for the dataset, including the average number of turns per dialogue, the average length in number of tokens for user and system turns, and the Standardized Type-Token Ratio (STTR) <ref type="bibr" target="#b10">[11]</ref> for user and system turns. The STTR is calculated by merging all turns, segmenting them into chunks (we used a segmentation size of 1000), and computing the average TTR for all chunks. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Turn-by-Turn Evaluation</head><p>To assess the quality of the Dining Llamas of Oz dataset, we employed GPT-4o, as in our previous experiments.</p><p>Using the same approach as in Section 3.2, we obtained a KB-Alignment score of 49.73% and a KB-Grounding score of 38.59% for the entire dataset. To verify the annotation quality of these new dialogues, we manually annotated 30 dialogues from the evaluation split and compared these annotations with GPT-4o's evaluations on the same dialogues. This initial comparison resulted in a not ideal 𝜅 of 0.15 for KB-Alignment and 0.06 for KB-Grounding (slight agreement). To enhance these performance metrics and establish a reliable evaluation pipeline, we revised our approach: instead of passing the entire dialogue to GPT-4o, we evaluated one turn at a time. The detailed methodology was as follows:</p><p>1. Provide GPT-4o with a user utterance and the corresponding system response, and prompt it to determine if the system's response references the KB. 2. If GPT-4o indicates a reference to the KB: a) Prompt GPT-4o with the same user-system turn and the KB to determine if the system's turn shows KB-Alignment. b) Prompt GPT-4o with the same user-system turn and the KB to determine if the system's turn shows KB-Grounding.</p><p>The full prompt is available at Appendix B. This method allows for a more precise scoring of each turn, though it increases OpenAI API usage and associated costs. We discovered that this turn-by-turn evaluation approach significantly improved the agreement: we obtained a 𝜅 of 0.68 for KB-Alignment and 0.49 for KB-Grounding (moderate/substantial agreement). Consequently, we decided to use this technique for automated evaluation.</p><p>Using this approach, we assessed 262 dialogues (from the evaluation and test splits) using GPT-4o. This provided a broader understanding of the KB consistency of Llama-generated dialogues across a larger dataset. The KB consistency evaluation is summarised in Table <ref type="table" target="#tab_2">3</ref>. The turns were filtered by removing those that were judged to have no reference to the KB. In addition to evaluating the metrics for all 262 dialogues, we further analysed the dataset by dividing it based on two criteria: the success of the dialogues and the dialogue length. For the success criterion, we distinguished between dialogues with a user instruction that, in the original MultiWOZ dataset, led to a successful restaurant booking (successful dialogues) and those that did not lead to any restaurant reservation (unsuccessful dialogues). For the dialogue length criterion, we distinguished between dialogues that had three or fewer turns (a maximum of three user utterances and three system utterances) and those that had four or more turns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>Our investigation into the performance of state-of-theart Large Language Models (LLMs) like Llama-3 in taskoriented dialogue systems reveals several critical insights about their current limitations. The central finding is that while these models exhibit advanced capabilities in generating text, their quality in managing task-oriented dialogues remains unsatisfactory.</p><p>Initially, we compared human evaluations with GPT-4o's evaluations to assess its effectiveness in evaluating dialogue quality. This comparison was instrumental in determining that GPT-4o could be useful for dialogue evaluation, but it highlighted that the model's performance degrades significantly when scaled from a smaller to a larger Knowledge Base. The annotation agreement dropped notably as the number of KB instances increased from 10 to 70, indicating that GPT-4o struggles with larger, more complex datasets.</p><p>To address this, we shifted our approach to a turn-byturn evaluation method. After extensive experimentation and prompt engineering, this method yielded improved results in terms of annotation agreement. However, this approach proved to be highly resource-intensive, pushing up costs significantly due to increased OpenAI API usage.</p><p>Our automated evaluations on 262 dialogues provided some revealing observations, as shown in Table <ref type="table" target="#tab_2">3</ref>. Notably, only around 40% of system turns demonstrated KB-Alignment and KB-Grounding. When considering both metrics together for Correct Turns and Correct Dialogues, the results were even more concerning: just 26% of turns and less than 9% of dialogues met the criteria for both metrics. These numbers underscore the inadequacy of current systems, indicating that a system producing such a low percentage of correct dialogues is not practical for real-world applications. Further analysis showed that dialogues with successful bookings performed better than those with failed bookings. Specifically, dialogues with successful bookings had 28.59% of correct turns and 11.29% of correct dialogues, compared to dialogues with failed bookings, which had 9 percentage points fewer correct turns and only 0.5% correct dialogues. This discrepancy likely arises because when no suitable restaurants are available, the Llama model tends to hallucinate, providing restaurants not present in the KB. While these restaurants may exist in Cambridge, they are absent from the provided dataset, highlighting the model's failure to adhere to the instructions given in the prompt.</p><p>We also explored the impact of dialogue length on performance. Shorter dialogues achieved nearly 30% correct turns and 11.23% correct dialogues, while longer dialogues showed a significant drop: 7 percentage points fewer correct turns and only 3.17% correct dialogues. This suggests that as the conversation progresses, the likelihood of errors increases, possibly due to the model's difficulty in managing and integrating information from previous turns.</p><p>Overall, our findings highlight that current state-ofthe-art open-source LLMs, such as Llama-3, are still unable to effectively serve as task-oriented dialogue systems while maintaining consistency with a provided KB. This underscores the need for further advancements in LLM capabilities and evaluation methodologies before such systems can be reliably used in practical applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Limitations</head><p>While our study makes significant contributions to understanding the capabilities of state-of-the-art LLMs in performing task-oriented-dialogue tasks, it is important to acknowledge certain limitations that may affect the generalizability and scalability of our findings. The turnby-turn evaluation approach, while effective in enhancing evaluation accuracy, proved to be computationally expensive. The quality of GPT-4o's evaluations was highly dependent on effective prompt engineering. Crafting the right prompts to ensure accurate evaluation results was challenging and time-consuming. Additionally, employing a diverse set of models for generating and evaluating dialogues could provide more comprehensive findings. Using multiple models might help in understanding the strengths and limitations of different approaches, potentially offering a more robust analysis of dialogue quality and consistency. This could also help in mitigating the limitations inherent in any single model or evaluation approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and Future Work</head><p>In this study, we explored the capabilities of state-ofthe-art LLMs in generating task-oriented dialogues, focusing on maintaining consistency with a provided KB and avoiding hallucinations. Our experiments demonstrated that Llama-3, despite its advancements, struggles to perform reliably in these settings. The model showed significant limitations, especially in dialogues that led to failed outcomes (where the desired restaurant was not in the KB) and longer interactions. As a side contribution, we release The Dining Llamas of Oz, a corpus of 1,311 dialogues generated through user-Llama and system-Llama interactions, to aid future research. Our findings highlight the need for further development to improve LLM reliability and accuracy in task-oriented dialogue applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Llama Prompts</head><p>The following prompt has been used to instruct a Llama to play the role of a Cambridge InfoTown system, in English:</p><p>"You are the Cambridge TownInfo Centre, a system designed to help users maximize their experience in the city of Cambridge. Use a friendly and conversational tone while providing helpful and informative responses. All the information you provide must strictly rely on the Knowledge Base that you have been provided with. Ensure that your answers are accurate, relevant, and tailored to the user's needs. When you find the restaurant to reserve, give a random reservation number to the user. Be brief."</p><p>The following prompt has been used to instruct a Llama to play the role of a Cambridge InfoTown system, in Italian:</p><p>"Sei l'assistente Cambridge InfoCittà, un sistema progettato per aiutare gli utenti a trarre il meglio dalla loro esperienza nella città di Cambridge. Usa un tono amichevole e onversazionale, fornendo risposte informative e utili. Tutte le informazioni che fornisci devono basarsi strettamente sulla Knowledge Base che ti è stata data. Assicurati che le tue risposte siano accurate, pertinenti, e mirate ai bisogni dell'utente. Sii breve."</p><p>The following prompt has been used to instruct a Llama to play the role of a user looking for a restaurant in Cambridge, in English:</p><p>"You are a turist in the city of Cambridge and you are looking for a restaurant to dine in. Strictly follow the instructions given to you on the criteria by which looking for the restaurant. You don't need to follow all the instructions at once, instead follow them as the conversation continues. Be very brief, and go straight to the point. At the end, thank the system and say goodbye. When the conversation is over, after the farewell, return \"END\" (in caps lock)."</p><p>The following prompt has been used to instruct a Llama to play the role of a user looking for a restaurant in Cambridge, in Italian:</p><p>"Sei un turista nella città di Cambridge e stai cercando un ristorante dove cenare. Basati strettamente sulle istruzioni che ti vengono fornite riguardo i criteri in base ai quali cercare il ristorante. Non seguire tutte le istruzioni subito, invece seguile passo passo durante la conversazione. Sii molto breve e vai subito al punto."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. GPT Prompts</head><p>The following system prompt has been used has general instruction for telling GPT to behave like a dialogue evaluator:</p><p>"You are a dialogue evaluator. Given a dialogue you have to return a list of symbols separated by commas, where each symbol is an evaluation of each turn in the dialogue. Only system turns must be considered."</p><p>The following prompt has been used to instruct GPT to determine if a system turn talks about information contained in a KB:</p><p>"Given the following user and system turns, return 1 if the system turn contains information that requires verification from an external source to ensure its accuracy, 0 otherwise."</p><p>The following prompt has been used to instruct GPT to determine if a system turn constitute a KB-Error:</p><p>"Given the following user turn, system turn, and Knowledge Base (KB), return 0 if the system contradicts the KB (e.g. says that a restaurant is at north, but it's actually at south), 1 otherwise."</p><p>The following prompt has been used to instruct GPT to determine if a system turn constitute an KB-Grounding error:</p><p>"Given the following user turn, system turn, and Knowledge Base, return 1 if the system doesn't mention properties outside of the Knowledge Base, 0 otherwise (e.g. says that the restaurant serves british and indian, but only indian is present in the KB)."</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An inconsistent dialogue with respect to a Knowledge Base (KB). Red values indicate inconsistencies between the system-generated text and the KB, whereas the green elements in bold indicate correct information.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Cohen's 𝜅 values for inter-annotator agreement on human-LLama generated dialogues.</figDesc><table><row><cell>Annotators</cell><cell>Metric</cell><cell cols="2">ITA ENG</cell></row><row><cell>human-human</cell><cell>KB-Alignment</cell><cell>0.71</cell><cell>0.65</cell></row><row><cell>human-human</cell><cell cols="2">KB-Grounding 0.79</cell><cell>0.59</cell></row><row><cell cols="2">human-GPT-4o KB-Alignment</cell><cell>0.60</cell><cell>0.58</cell></row><row><cell cols="3">human-GPT-4o KB-Grounding 0.58</cell><cell>0.39</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Statistics of the Llama-Llama dialogues dataset.</figDesc><table><row><cell>Statistic</cell><cell>Value</cell></row><row><cell>Number of Dialogues</cell><cell>1311</cell></row><row><cell>Average Dialogue Length</cell><cell>6.21</cell></row><row><cell>Average User Turns Length</cell><cell>25.69</cell></row><row><cell cols="2">Average System Turns Length 124.52</cell></row><row><cell>User Turns STTR</cell><cell>0.29</cell></row><row><cell>System Turns STTR</cell><cell>0.41</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Turn-by-turn GPT-4o evaluation of KB consistency in The Dining Llamas of Oz validation and test splits.</figDesc><table><row><cell>Dialogues</cell><cell cols="2"># Dialogues # Turns</cell><cell cols="4">KB-Alignment Grounding KB-Correct Turns Dialogues Correct</cell></row><row><cell>All</cell><cell>262</cell><cell>656</cell><cell>41.46%</cell><cell>38.26%</cell><cell>26.35%</cell><cell>8.78%</cell></row><row><cell>Successful Bookings</cell><cell>196</cell><cell>494</cell><cell>42.51%</cell><cell>41.50%</cell><cell>28.59%</cell><cell>11.29%</cell></row><row><cell>Failing Bookings</cell><cell>66</cell><cell>162</cell><cell>38.27%</cell><cell>28.40%</cell><cell>19.62%</cell><cell>0.5%</cell></row><row><cell>Short dialogues</cell><cell>187</cell><cell>411</cell><cell>42.09%</cell><cell>38.44%</cell><cell>29.02%</cell><cell>11.23%</cell></row><row><cell>Long dialogues</cell><cell>75</cell><cell>245</cell><cell>40.41%</cell><cell>37.96%</cell><cell>22.80%</cell><cell>3.17%</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">GPT-4o was used via the Microsoft Azure APIs. The API version was 2024-02-01. The cost for the API interactions was about $400.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Aknowledgments</head><p>This work has been partially supported by the PNRR project FAIR -Future AI Research (PE00000013), under the NRRP MUR program funded by NextGenerationEU.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Conversational ai: Dialogue systems, conversational agents, and chatbots</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mctear</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Synthesis Lectures on Human Language Technologies</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="1" to="251" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Addressing domain changes in task-oriented conversational agents through dialogue adaptation</title>
		<author>
			<persName><forename type="first">T</forename><surname>Labruna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop</title>
				<meeting>the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="149" to="158" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Pomdp-based statistical spoken dialog systems: A review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Young</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gašić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thomson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Williams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE</title>
				<meeting>the IEEE</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">101</biblScope>
			<biblScope unit="page" from="1160" to="1179" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Recent neural methods on slot filling and intent classification for taskoriented dialogue systems: A survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Louvan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.coling-main.42</idno>
		<ptr target="https://www.aclweb.org/anthology/2020.coling-main.42.doi:10.18653/v1/2020.coling-main.42" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics</title>
				<meeting>the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics<address><addrLine>Barcelona, Spain (Online</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="480" to="496" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Recent neural methods on dialogue state tracking for task-oriented dialogue systems: A survey</title>
		<author>
			<persName><forename type="first">V</forename><surname>Balaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sheikhalishahi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue</title>
				<meeting>the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="239" to="251" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Albert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Almahairi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Babaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bashlykov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhosale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bikel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Blecher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Ferrer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cucurull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Esiobu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fernandes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fuller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Goswami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hartshorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Inan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kardas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kerkez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Khabsa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kloumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korenev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Koura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liskovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mihaylov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Molybog</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Poulton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reizenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rungta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Saladi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schelten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Subramanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">X</forename><surname>Kuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Zarov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kambadur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stojnic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Edunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.09288</idno>
		<title level="m">Llama 2: Open foundation and fine-tuned chat models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Gpt-4 technical report</title>
		<author>
			<persName><forename type="first">J</forename><surname>Openai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Achiam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Adler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">L</forename><surname>Akkaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Aleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Almeida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Altenschmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Altman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Anadkat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Avila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Babuschkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Balaji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Balcom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Baltescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bavarian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Belgum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Berdine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bernadett-Shapiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bogdonoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Boiko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-L</forename><surname>Boyd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Brakman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Brockman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brooks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Brundage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Button</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Campbell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Carey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Carlson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Carmichael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chantzis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">W</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cummings</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Currier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Decareaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Degry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Deutsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Deville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dohan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dowling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dunning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ecoffet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Eleti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Eloundou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Farhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Fedus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Felix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fishman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Forte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fulford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Georges</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gibson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Goel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gogineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gontijo-Lopes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Grafstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Greene</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Harris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heaton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Heidecke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hickey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hickey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hoeschele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Houghton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huizinga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Jomoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jonn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Jun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Łukasz</forename><surname>Kaftan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kamali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">S</forename><surname>Kanitscheider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Keskar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kilpatrick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kirchner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Knight</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Łukasz</forename><surname>Kokotajlo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kondraciuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kondrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Konstantinidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kosic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Lampe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Leike</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lowe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Makanju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Malfacini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Markovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mayer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mayne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Mcgrew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mckinney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mcleavey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mcmillan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mcneil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Medina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Menick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Metz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishchenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Monaco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Morikawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mossing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Murati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Murk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mély</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nakano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nayak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ngo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Noh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ouyang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>O'keefe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pachocki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Paino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Palermo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pantuliano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Parascandolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Parish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Parparita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Passos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pavlov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Perelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Avila Belbute Peres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Petrov</surname></persName>
		</author>
		<author>
			<persName><surname>De Oliveira Pinto</surname></persName>
		</author>
		<author>
			<persName><surname>Michael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pokorny</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">H</forename><surname>Pokrass</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Pong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Powell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Power</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Power</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Proehl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Puri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rae</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Raymond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Real</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rimbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rotsted</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Roussez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Saltarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sanders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schnurr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schulman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Selsam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sheppard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sherbakov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shieh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shoker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sidor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Simens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Sitkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Slama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sohl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sokolowsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">P</forename><surname>Staudacher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Such</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Summers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Tezak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Thompson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tillet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Tootoonchian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tseng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tuggle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Turley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F C</forename><surname>Tworek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Uribe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vallone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vijayvergiya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Wainwright</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Weinmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Welihinda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Welinder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wiethoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Willner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wolrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Workman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zaremba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zellers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhuk</surname></persName>
		</author>
		<author>
			<persName><surname>Zoph</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2303.08774.arXiv:2303.08774" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">MultiWOZ -a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling</title>
		<author>
			<persName><forename type="first">P</forename><surname>Budzianowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-H</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B.-H</forename><surname>Tseng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Casanueva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ultes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Ramadan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gašić</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D18-1547</idno>
		<ptr target="https://www.aclweb.org/anthology/D18-1547.doi:10.18653/v1/D18-1547" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="5016" to="5026" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Multiwoz 2.3: A multidomain task-oriented dialogue dataset enhanced with annotation corrections and co-reference annotation</title>
		<author>
			<persName><forename type="first">T</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Takanabu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021</title>
				<meeting><address><addrLine>Qingdao, China</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">October 13-17, 2021. 2021</date>
			<biblScope unit="page" from="206" to="218" />
		</imprint>
	</monogr>
	<note>Proceedings, Part II 10</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">The measurement of observer agreement for categorical data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Landis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">G</forename><surname>Koch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">biometrics</title>
		<imprint>
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Type/token ratios: What do they really tell us?</title>
		<author>
			<persName><forename type="first">B</forename><surname>Richards</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of child language</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="201" to="209" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
