<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Last Utterance Proactivity Prediction in Task-oriented Dialogues</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sofia</forename><surname>Brenna</surname></persName>
							<email>sbrenna@fbk.eu</email>
							<affiliation key="aff0">
								<orgName type="institution">Free University of Bozen-Bolzano</orgName>
								<address>
									<addrLine>3 Dominikanerplatz 3 -Piazza Domenicani 3</addrLine>
									<postCode>39100</postCode>
									<settlement>Bozen-Bolzano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<addrLine>Via Sommarive 18</addrLine>
									<postCode>38123</postCode>
									<settlement>Povo, Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bernardo</forename><surname>Magnini</surname></persName>
							<email>magnini@fbk.eu</email>
							<affiliation key="aff1">
								<orgName type="institution">Fondazione Bruno Kessler</orgName>
								<address>
									<addrLine>Via Sommarive 18</addrLine>
									<postCode>38123</postCode>
									<settlement>Povo, Trento</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Last Utterance Proactivity Prediction in Task-oriented Dialogues</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D6D053728DE699B4BE5316237C07474E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>task-oriented dialogues</term>
					<term>pragmatics</term>
					<term>proactivity</term>
					<term>automated annotation</term>
					<term>large language models</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>While current LLMs achieve excellent performance in information seeking tasks, their conversational abilities when participants need to collaborate to jointly achieve a communicative goals (e.g., booking a restaurant, fixing an appointment, etc.) are still far from those exhibited by humans. Among various collaborative strategies, in the paper we focus on proactivity, i.e., when a participant offers useful information that was not explicitly requested. We propose a new task, called last utterance proactivity prediction aimed at assessing the capacity of an LLM to detect proactive utterances in a dialogue. In the task, a model is given a small portion of a dialogue (that is, a dialogue snippet) and asked to determine whether the last utterance of the snippet is proactive or not. There are several benefits in using dialogue snippets: (i) they are more manageable than full dialogues, allowing to reduce complexity; (ii) several phenomena in dialogue, including proactivity, depend on a short context, which allows a model to learn from snippets, rather than full dialogues; and (iii) dialogue snippets make it easier to experiment on balanced datasets, overcoming the skew distribution of proactivity in whole dialogues. In the paper, we first introduce a dataset for the last utterance proactivity prediction task. The dataset has then been used to instruct an LLM to classify proactivity. We run a series of experiments showing that predicting proactive utterance in a dialogue is feasible in a few-shot configuration, opening the road towards models that are able to generate proactive utterances like humans do.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>While current Large Language Models (LLMs) achieve excellent performance in information seeking tasks, their conversational abilities when participants need to collaborate to jointly achieve a communicative goals (e.g., booking a restaurant, fixing an appointment, etc.) are still far from those exhibited by humans. In the paper, we specifically focus on proactivity <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>, a collaborative behaviour investigated in the context of dialogue pragmatics <ref type="bibr" target="#b5">[6]</ref>. Proactivity refers to the act of taking initiative, anticipating potential problems, and actively providing information and contributing to the conversation with ideas, suggestions or solutions. Proactivity involves participants actively participating in the dialogue, addressing concerns, and promoting a collaborative environment. The following is an example of a dialogue in which proactive utterances are underlined. After completing my Bachelor's degree in Rome, I would like to move back towards home, to Florence. a:</p><p>We currently do not have any offers that fit your needs in the Florence area, however, there are job opportunities in Rome.</p><p>In order to model proactivity, we follow a similar approach as Shaikh et al., which focuses on grounding acts, a class of collaborative behaviors investigated in dialogue pragmatics. The main idea is that grounding acts can be: (i) identified and annotated by a Large Language Model; (ii) modeled through appropriate fine tuning of the model itself. In addition, our work is related to recent approaches that use large language models as annotators <ref type="bibr" target="#b8">[9]</ref>, <ref type="bibr" target="#b9">[10]</ref>, <ref type="bibr" target="#b10">[11]</ref>. In the long-term, our research goal is to instruct LLMs to be as proactive as humans are.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Last Utterance Proactivity Prediction</head><p>The goal of the paper is to show the feasibility of automatic detection of proactive utterances in taskoriented dialogues. We propose a task, Last Utterance Proactivity Prediction, where a portion of a dialogue (i.e., a dialogue snippet) is given to a model, which has to predict whether the last utterance of the snippet is proactive or not proactive. Using dialogue snippets, instead of full dialogues, brings several benefits: (i) dialogue snippets are much more manageable than full dialogues, allowing us to reduce the complexity of understanding and annotation; (ii) several phenomena in dialogue, including proactivity, depend on a short context, which allows a model to learn from snippets, rather than full dialogues; and (iii) dialogue snippets make it easier to experiment on balanced datasets, overcoming the skew distribution of proactivity in whole dialogues (it has been estimated that about 85% of the utterance in a task-oriented dialogue is not proactive).</p><p>We started with D-PRO<ref type="foot" target="#foot_0">2</ref> , a corpus of manually annotated task-oriented dialogues, which includes 151 dialogues from different sources, amounting to 2,855 turns and over 6,000 utterances, and carried out the following steps:</p><p>• we transformed the whole-dialogue annotation task to a one-utterance annotation task: given a short dialogue context, the model needs to establish whether the final utterance is either 'proactive' or 'not_proactive'; • in order to shorten the provided dialogue context, we collected 4 conversational turns' worth of excerpts (snippets) from each dialogue. We believe 4 turns to be a convenient context for proactivity annotation since statistics in the D-PRO Corpus on turn-adjacency between proactive utterances and the turn that triggers them revealed that an average of 77.7% proactive utterances are a direct response to the previous turn's utterances; • to restore balance among labels we choose the same number of snippets that ended without proactivity as the snippets that ended with proactivity.</p><p>A relevant consequence of the reduction of the provided dialogue context, is a significant reduction of the input prompt length for a LLM, and therefore a reduction of computational need.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experimental Setting</head><p>This section reports the main features of the setting we used to experiment the last utterance proactivity task introduced in Section 2.</p><p>Dataset for the experiments. The dataset for the experiments has been derived from D-PRO, a corpus equipped with manually curated proactivity-oriented annotations. D-PRO comprises 151 dialogues from 5 task-oriented dialogue sub-corpora, namely, Italian Whatsapp Corpus ( <ref type="bibr" target="#b11">[12]</ref>), the Italian Nespole! Corpus ( <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>), Jilda ( <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b14">15]</ref>), the Italian Ubuntu Chat Corpus ( <ref type="bibr" target="#b15">[16]</ref>), and Multiwoz 2.2 ( <ref type="bibr" target="#b16">[17]</ref>). Most of the dialogues are in Italian, with the only exception of the Multiwoz 2.2 dialogues and some dialogues from the Italian Whatsapp Corpus due to code mixing and code switching employed by the speakers. D-PRO proactivity annotations are performed at the utterance level. The composition of our experimental dataset is as follows: from D-PRO we gathered as many 4-turn proactive dialogue snippets as there were proactive utterances, so that each snippet ended with a different proactive utterance. Then, we extracted as many non-proactive snippets as there were nonproactive utterances, so that each snippet ended with a different non-proactive utterance. Finally, we selected an equal amount of non proactive dialogue snippets as proactive ones at random to restore balance between the two types of snippets.</p><p>Data splitting. From each of the 5 corpora in D-PRO we randomly selected 30 dialogue snippets as a train set (to be used as few-shot examples), 50 snippets as a validation set (to be used for parameter optimization), and 100 snippets as a test set <ref type="foot" target="#foot_1">3</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model.</head><p>For the choice of the best model for proactivity prediction we carried out a number of experiments, reported in Section 4.1. The model selected is Openai's GPT-4o-2024-08-06, used with temperature = 0. Prompt optimisation. A prompt engineering phase took pace, so that various prompt proposals were tested in the same setting (the same train snippets used as few-shot examples, same validation snippets used to evaluate the model). Figure <ref type="figure" target="#fig_1">1</ref> shows the final prompt used in our experiments. The prompt consists of two main parts: (i) the system prompt, which contains the general task instructions given to the model; (ii) the messages prompt, that is further dividend into alternating user messages and assistant messages: this is the part of the prompt where the model receives few-shot examples (user messages) with answers (assistant message). The final user/assistant pair contains the target dialogue which is being evaluated by the model at current time.</p><p>Baseline. A random chance baseline (accuracy = 0.5, see Table <ref type="table" target="#tab_9">8</ref>) is created by eliminating any system and message prompt except for "Output exclusively either "proactive" or "not_proactive. " and providing the target dialogue snippet.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Parameter Setting</head><p>We first present several optimization trials performed on a single corpus (MultiWoz) in order to select the best LLM (4.1), to assess the impact of the number (4.2) and of the order (4.3) of few-shot example  snippets. Secondly, we run test on the DEV set of each of the five corpora in order to select the best few-shot snippets order for each corpus (4.4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Setting the Large Language Model</head><p>Once the best prompt had been established, we tested the APIs of various models to pick the best cost/performance trade-off, reported in Table <ref type="table" target="#tab_0">1</ref>. GPT-4o-2024-08-06 was selected as the best performing model (Accuracy: 0.74, F1: 0.68) with lower fares than GPT-4o-2024-05-13 and better scores than both GPT-4o-mini and GPT-4o-mini-2024-07-18. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Setting the Number of Few-shot Dialogue Snippets</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Assessing the Impact of Few-shot Dialogue Snippets Order</head><p>This experiment assesses the stability of the model while changing the order of the few-shot examples.</p><p>As literature points out <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20]</ref>, LLMs suffer of a position bias when handling a longer context: we found that this is the case also in our experiments and that there is up to 10 points of a difference in accuracy when testing with different orders of the same set of examples.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Setting Few-Shot Snippets Order</head><p>In this experiment we select the best order of the few-shot snippets for each of the five dialogue datasets (i.e., Whatsapp, Nespole, Ubuntu, Jilda and MultiWoz) for the last utterance proactivity prediction task. We tested 5 different orders of the dialogue snippets randomly shuffling the same set of snippets selected in section 4.2. We used the following configuration for the experiments: 12 random few-shot snippets; 50 validation (DEV) snippets; 5 random shuffles of the few-shot snippets; average of the performances of the 5 shuffles for each corpus. The selection of the optimal order is given by the highest average accuracy and F1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head><p>In this section we present the results of the last utterance proactivity prediction task in two different configurations: using few-shots from individual corpora (5.1), and mixing few-shots from all corpora (5.3), introducing transfer learning as well. Lastly, we describe an experiment in Section 5.4 that attempts to evaluate the model's stability in corrupted context scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Few-Shots from Individual Corpus and Testing on Individual Corpus</head><p>We tested the model by using the prompt and few-shot configuration that gave the best results on the development set for each corpus individually. As Table <ref type="table" target="#tab_6">5</ref> shows, while IAA with the ground truth labels is still not optimal, and scores pretty low on both Whatsapp and Ubuntu (fair agreement <ref type="foot" target="#foot_2">4</ref> ), we reach a  <ref type="table" target="#tab_5">4</ref>. On average, Nespole achieved the best accuracy (0.86), followed by MultiWoz (0.77), Jilda (0.75), Whatsapp and Ubuntu (both 0.64). For all dateset the results are largely above the baseline (i.e., 0.50 accuracy, equivalent to chance: see also Table <ref type="table" target="#tab_9">8</ref>, Baselines -Full Context column.), showing that the model has correctly learned our definition of proactivity. The fact that Nespole has obtained the best results is somehow surprising, given that this corpus is quite complex: utterances are longer than in the other corpora, and so are dependencies between a proactive utterance and its own trigger utterance. Longer utterances, on the other hand, mean that the model is given a slightly richer context on which to base its judgments, which may help with the annotation process. As far as lowest scores are concerned, the poor results for Whatsapp and especially Ubuntu were expected, since these are less structured (both syntactically and grammatically), more chaotic, and multi-party dialogue corpora, where proactivity is much more difficult to be unanimously detected also by humans and where the human-human IAA scores the lowest (0.63 and 0.41 respectively).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Few-Shots from All Corpora and Testing on Individual Corpus</head><p>Secondly, we run some in-context learning <ref type="bibr" target="#b21">[22]</ref> experiments to check whether few-shot examples from different corpora could improve the performance on one individual target corpus. The idea is drawn from works on multi-task learning <ref type="bibr" target="#b22">[23]</ref>, where more than one task is learned simultaneously by the model, and transfer learning, where improvement is obtained in a new task through the transfer of knowledge from a related task that has already been learned <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b24">25]</ref>. Our intuition is that example variety on very similar tasks may be the key to improvement on single target task. Following this line, we combine dialogue snippets from all the five corpora as few-shot examples, so that we have 5 sets of 12 snippets each. Given the position bias hold by the model, we decided to keep the intra-corpus examples order the same as the one used in 5.1, and to randomly shuffle the inter-corpus order 5 times to select the optimal few-shot prompt (same methodology in 4.4 and 5.1). The outcomes of the tests with the optimal prompt are reported Table <ref type="table" target="#tab_7">6</ref>, that is directly comparable to Table <ref type="table" target="#tab_6">5</ref>. We found out that the only corpus in our experiment that suffered the mixed few-shot prompting is Jilda, with a significant drop in performance, while every other corpus has very similar or slightly higher scores. Ablation tests on the Jilda corpus led to an accuracy of 0.71 while removing the most chaotic corpora (Ubuntu and Whatsapp), proving still that the individual corpus few-shot approach works best for this one corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Few-Shot from All Corpora, Testing on All Corpora</head><p>We finally tested on the cumulative test set of all corpora, with mixed few-shot examples. We experimented with two configurations of few-shot snippets: i. 60 snippets in optimal order as in Table <ref type="table" target="#tab_7">6</ref>; ii. 15 snippets in total, with 3 random snippets per corpus. Outcomes in Table <ref type="table" target="#tab_8">7</ref> show best and average results over 5 runs for both the 60 and the 15 few-shot examples setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Testing with Corrupted Context: Masking the Trigger Utterance</head><p>We investigated the performance of the LLM in a corrupted context situation. According to definition, the two key characteristics of proactivity are not being solicited and being beneficial to the dialogue goals, hence proactivity can only be defined in terms of the previous context. When we corrupt the context before the snippet's final utterance, we may be compromising the data required for the proactivity annotation task. We implemented two corrupted context situations: triggering utterance removed, where the text of the utterance that triggers the final utterance in the dialogue snippet is removed from the snippet, and triggering utterance masked, where the text of the trigger utterance is masked by a placeholder. In both circumstances, the presence of a corrupted utterance is indicated by the utterance number, whereas only the content is erased or masked. Since we need to test the effect of the context corruption, the model still learns the original full context task from the few-shot examples. We anticipate that the LLM's performance will suffer since a critical component of the dialogue (the triggering utterance) has been compromised. Also, we expect the number of false positives to increase significantly. This is due to the possibility that by eliminating the trigger utterance, we will also eliminate the element that renders the final utterance either proactive or non-proactive. Specifically, we are deleting the element of the context that allows us to determine whether the content of the last utterance is novel (i.e., proactive) or unrequested by the trigger. Since the request is missing from the corrupted setting, the triggered response seems to be proactive rather than solicited, resulting in an increase in "proactive" labels. Results, presented in Table <ref type="table" target="#tab_9">8</ref>, confirm our intuition, showing that the model accuracy moves from 0.80 of the full context to 0.66 and 0.64 when the triggering utterance is removed and masked, respectively. The majority of the performance drop is attributable to an increase in false positives, from 2 in the full context to 8 in the corrupted context, as well as a drop in true negatives, which supports our hypothesis. On the other hand, our experiments show that even with insufficiently task-specific few-shot examples, the model can perform significantly better than the random chance baselines (see also 3) through a solid instruction prompt. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Work</head><p>We introduced a new task, namely, last utterance proactivity prediction, aiming at assessing the capacity of Large Language Models to detect and annotate proactive behaviours in task-oriented dialogues.</p><p>The task allows us to shorten the context from a whole dialogue to a dialogue snippet, simplify the annotation process, and balance the dataset for positive and negative labels. We showed that a few-shot approach with GPT-4o achieves encouraging performance on a test set composed of dialogue snippets collected from five different corpora, and that in particular for the Nespole corpus the agreement between the model labels and the human-annotated gold labels is nearly equivalent to the agreement between humans. As for future work, there are several ongoing activities. First, we are still investigating techniques to further improve the performance on the task, especially in testing on the combined dialogues from all corpora. Then, we plan to use the GPT-4o model to automatically annotate a large amount (i.e., about 100K) of dialogue snippets, in order to create a training corpus, which, in turn, will be used to instruct an open source model (e.g., Llama 3 8B) to detect proactivity.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>a: Do you have any preference about where to work? b:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Prompt given to the LLM.</figDesc><graphic coords="3,111.54,65.60,372.20,174.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Testing across various GPT Models though Openai APIs.</figDesc><table><row><cell>Metric</cell><cell cols="4">gpt-4o-2024-05-13 gpt-4o-2024-08-06 gpt-4o-mini gpt-4o-mini-2024-07-18</cell><cell>St.Dev.</cell></row><row><cell>Total Accuracy</cell><cell>0.64</cell><cell>0.74</cell><cell>0.70</cell><cell>0.68</cell><cell>0.04</cell></row><row><cell>Total Precision</cell><cell>0.67</cell><cell>0.88</cell><cell>0.73</cell><cell>0.71</cell><cell>0.09</cell></row><row><cell>Total Recall</cell><cell>0.56</cell><cell>0.56</cell><cell>0.64</cell><cell>0.60</cell><cell>0.04</cell></row><row><cell>Total F1 Score</cell><cell>0.61</cell><cell>0.68</cell><cell>0.68</cell><cell>0.65</cell><cell>0.03</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Testing proactivity prediction with different numbers of few-shot dialogue snippets on the DEV set.</figDesc><table><row><cell>Few-shot examples</cell><cell>0</cell><cell>10</cell><cell>12</cell><cell>15</cell><cell>20</cell><cell>25</cell><cell>30</cell><cell>St.Dev.</cell></row><row><cell>Total Accuracy</cell><cell cols="7">0.66 0.72 0.76 0.74 0.72 0.72 0.70</cell><cell>0.02</cell></row><row><cell>Total Precision</cell><cell cols="7">0.63 0.79 0.81 0.88 0.82 0.82 0.78</cell><cell>0.04</cell></row><row><cell>Total Recall</cell><cell cols="7">0.76 0.60 0.68 0.56 0.56 0.56 0.56</cell><cell>0.06</cell></row><row><cell>Total F1 Score</cell><cell cols="7">0.69 0.68 0.74 0.68 0.67 0.67 0.65</cell><cell>0.03</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Testing proactivity prediction with different orders of few-shot dialogue snippets on DEV set. The variability of the results is measured with standard deviation (St.Dev.).</figDesc><table><row><cell>Shuffles</cell><cell>0</cell><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell><cell>5</cell><cell>St.Dev.</cell></row><row><cell>Total Accuracy</cell><cell cols="6">0.76 0.66 0.68 0.72 0.74 0.66</cell><cell>0.04</cell></row><row><cell>Total Precision</cell><cell cols="6">0.84 0.79 0.71 0.76 0.80 0.70</cell><cell>0.05</cell></row><row><cell>Total Recall</cell><cell cols="6">0.64 0.44 0.60 0.64 0.64 0.56</cell><cell>0.10</cell></row><row><cell>Total F1 Score</cell><cell cols="6">0.73 0.56 0.65 0.70 0.71 0.62</cell><cell>0.07</cell></row><row><cell cols="7">Cohen's Kappa 0.55 0.36 0.39 0.47 0.51 0.35</cell><cell>0.09</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>reports the performance of the LLM on the DEV set augmenting (from 0 to 30) the amount of few shot dialogue snippets. It can be noted that increasing the few shots examples till 15 increases the precision of the model, while more examples results in worse precision. On the other side, the highest recall is obtained with 0 examples (resulting in more false positive cases). The best accuracy (0.76) and F1 Score (0.74) are obtained with 12 examples.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc></figDesc><table /><note>reports the results of the experiments under six random changes of 12 examples.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4</head><label>4</label><figDesc>Optimising proactivity prediction with different few-shot dialogue snippets order on each dataset DEV set. Configuration: 12 few-shot snippets and 50 validation snippets. s-n indicates re-shuffled sets of 12 few-shot snippets; highest scores across re-shuffles are marked in bold. Results that show a statistically significant increase compared to the average (p &lt; 0.05) are highlighted in green.</figDesc><table><row><cell>Dataset</cell><cell>Metric</cell><cell>s-0</cell><cell>s-1</cell><cell>s-2</cell><cell>s-3</cell><cell>s-4</cell><cell cols="2">Average St.Dev.</cell></row><row><cell>Whatsapp</cell><cell>Accuracy</cell><cell>0.66</cell><cell>0.72</cell><cell>0.62</cell><cell>0.72</cell><cell>0.68</cell><cell>0.68</cell><cell>0.05</cell></row><row><cell></cell><cell>Precision</cell><cell>0.70</cell><cell>0.87</cell><cell>0.69</cell><cell>0.82</cell><cell>0.76</cell><cell>0.77</cell><cell>0.09</cell></row><row><cell></cell><cell>Recall</cell><cell>0.56</cell><cell>0.52</cell><cell>0.44</cell><cell>0.56</cell><cell>0.52</cell><cell>0.52</cell><cell>0.06</cell></row><row><cell></cell><cell>F1 Score</cell><cell>0.62</cell><cell>0.65</cell><cell>0.54</cell><cell>0.67</cell><cell>0.62</cell><cell>0.62</cell><cell>0.06</cell></row><row><cell></cell><cell>Cohen's Kappa</cell><cell>0.31</cell><cell>0.43</cell><cell>0.23</cell><cell>0.43</cell><cell>0.35</cell><cell>0.35</cell><cell>0.10</cell></row><row><cell>Nespole</cell><cell>Accuracy</cell><cell>0.84</cell><cell>0.86</cell><cell>0.80</cell><cell>0.80</cell><cell>0.82</cell><cell>0.82</cell><cell>0.03</cell></row><row><cell></cell><cell>Precision</cell><cell>0.84</cell><cell>0.80</cell><cell>0.78</cell><cell>0.76</cell><cell>0.81</cell><cell>0.80</cell><cell>0.03</cell></row><row><cell></cell><cell>Recall</cell><cell>0.84</cell><cell>0.96</cell><cell>0.84</cell><cell>0.88</cell><cell>0.84</cell><cell>0.87</cell><cell>0.06</cell></row><row><cell></cell><cell>F1 Score</cell><cell>0.84</cell><cell>0.87</cell><cell>0.81</cell><cell>0.81</cell><cell>0.82</cell><cell>0.83</cell><cell>0.03</cell></row><row><cell></cell><cell>Cohen's Kappa</cell><cell>0.67</cell><cell>0.72</cell><cell>0.59</cell><cell>0.59</cell><cell>0.63</cell><cell>0.64</cell><cell>0.06</cell></row><row><cell>Ubuntu</cell><cell>Accuracy</cell><cell>0.68</cell><cell>0.64</cell><cell>0.68</cell><cell>0.64</cell><cell>0.68</cell><cell>0.66</cell><cell>0.02</cell></row><row><cell></cell><cell>Precision</cell><cell>0.68</cell><cell>0.65</cell><cell>0.67</cell><cell>0.65</cell><cell>0.71</cell><cell>0.67</cell><cell>0.02</cell></row><row><cell></cell><cell>Recall</cell><cell>0.68</cell><cell>0.60</cell><cell>0.72</cell><cell>0.60</cell><cell>0.60</cell><cell>0.64</cell><cell>0.06</cell></row><row><cell></cell><cell>F1 Score</cell><cell>0.68</cell><cell>0.62</cell><cell>0.69</cell><cell>0.62</cell><cell>0.65</cell><cell>0.65</cell><cell>0.04</cell></row><row><cell></cell><cell cols="2">Cohen's Kappa 0.35</cell><cell>0.27</cell><cell>0.35</cell><cell>0.27</cell><cell>0.35</cell><cell>0.32</cell><cell>0.05</cell></row><row><cell>Jilda</cell><cell>Accuracy</cell><cell>0.70</cell><cell>0.74</cell><cell>0.74</cell><cell>0.70</cell><cell>0.68</cell><cell>0.71</cell><cell>0.02</cell></row><row><cell></cell><cell>Precision</cell><cell>0.69</cell><cell>0.75</cell><cell>0.73</cell><cell>0.78</cell><cell>0.74</cell><cell>0.74</cell><cell>0.04</cell></row><row><cell></cell><cell>Recall</cell><cell>0.72</cell><cell>0.72</cell><cell>0.76</cell><cell>0.56</cell><cell>0.56</cell><cell>0.66</cell><cell>0.09</cell></row><row><cell></cell><cell>F1 Score</cell><cell>0.71</cell><cell>0.73</cell><cell>0.75</cell><cell>0.65</cell><cell>0.64</cell><cell>0.70</cell><cell>0.04</cell></row><row><cell></cell><cell>Cohen's Kappa</cell><cell>0.39</cell><cell>0.47</cell><cell>0.47</cell><cell>0.39</cell><cell>0.35</cell><cell>0.41</cell><cell>0.05</cell></row><row><cell>Multiwoz</cell><cell>Accuracy</cell><cell>0.82</cell><cell>0.76</cell><cell>0.76</cell><cell>0.76</cell><cell>0.66</cell><cell>0.75</cell><cell>0.03</cell></row><row><cell></cell><cell>Precision</cell><cell>0.94</cell><cell>0.88</cell><cell>0.81</cell><cell>0.81</cell><cell>0.79</cell><cell>0.85</cell><cell>0.06</cell></row><row><cell></cell><cell>Recall</cell><cell>0.68</cell><cell>0.60</cell><cell cols="2">0.68 0.68</cell><cell>0.44</cell><cell>0.62</cell><cell>0.04</cell></row><row><cell></cell><cell>F1 Score</cell><cell>0.79</cell><cell>0.71</cell><cell>0.74</cell><cell>0.74</cell><cell>0.56</cell><cell>0.71</cell><cell>0.03</cell></row><row><cell></cell><cell cols="2">Cohen's Kappa 0.63</cell><cell>0.51</cell><cell>0.51</cell><cell>0.51</cell><cell>0.30</cell><cell>0.49</cell><cell>0.06</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 5</head><label>5</label><figDesc>Individual corpus testing. For comparison, DEV Average column reports averaged results over the five corpora on the developement set.</figDesc><table><row><cell>Metric</cell><cell cols="5">Whatsapp Nespole Ubuntu Jilda Multiwoz</cell><cell>Average</cell><cell>DEV Average</cell></row><row><cell>Accuracy</cell><cell>0.64</cell><cell>0.86</cell><cell>0.64</cell><cell>0.75</cell><cell>0.77</cell><cell>0.73</cell><cell>0.73</cell></row><row><cell>Precision</cell><cell>0.69</cell><cell>0.85</cell><cell>0.68</cell><cell>0.74</cell><cell>0.81</cell><cell>0.75</cell><cell>0.76</cell></row><row><cell>Recall</cell><cell>0.50</cell><cell>0.88</cell><cell>0.52</cell><cell>0.78</cell><cell>0.69</cell><cell>0.67</cell><cell>0.66</cell></row><row><cell>F1 Score</cell><cell>0.58</cell><cell>0.86</cell><cell>0.59</cell><cell>0.76</cell><cell>0.75</cell><cell>0.71</cell><cell>0.70</cell></row><row><cell>Cohen's Kappa</cell><cell>0.27</cell><cell>0.72</cell><cell>0.27</cell><cell>0.50</cell><cell>0.54</cell><cell>0.46</cell><cell>0.44</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 6</head><label>6</label><figDesc>Testing the model with few-shot examples taken from all five corpora. Results are given for the best inter-corpus order over five runs.</figDesc><table><row><cell>Metric</cell><cell cols="5">Whatsapp Nespole Ubuntu Jilda Multiwoz</cell><cell>Average</cell></row><row><cell>Accuracy</cell><cell>0.63</cell><cell>0.87</cell><cell>0.66</cell><cell>0.69</cell><cell>0.77</cell><cell>0.72</cell></row><row><cell>Precision</cell><cell>0.74</cell><cell>0.88</cell><cell>0.74</cell><cell>0.69</cell><cell>0.78</cell><cell>0.77</cell></row><row><cell>Recall</cell><cell>0.4</cell><cell>0.86</cell><cell>0.5</cell><cell>0.6</cell><cell>0.68</cell><cell>0.62</cell></row><row><cell>F1 Score</cell><cell>0.52</cell><cell>0.87</cell><cell>0.6</cell><cell>0.69</cell><cell>0.76</cell><cell>0.68</cell></row><row><cell>Cohen's Kappa</cell><cell>0.27</cell><cell>0.74</cell><cell>0.32</cell><cell>0.37</cell><cell>0.54</cell><cell>0.44</cell></row></table><note>moderate agreement in Jilda and Multiwoz, and a substantial agreement in Nespole (0.72) that is just below the IAA score between human annotators (0.77). Results are consistent with the outcomes that we obtained on the development set in Table</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 7</head><label>7</label><figDesc>Best and average results over five runs with different inter-corpus order.</figDesc><table><row><cell cols="3">Few-shot examples 60-best 15-best</cell><cell cols="2">60-average 15-average</cell></row><row><cell>Accuracy</cell><cell>0.71</cell><cell>0.69</cell><cell>0.69</cell><cell>0.68</cell></row><row><cell>Precision</cell><cell>0.75</cell><cell>0.72</cell><cell>0.74</cell><cell>0.73</cell></row><row><cell>Recall</cell><cell>0.63</cell><cell>0.60</cell><cell>0.59</cell><cell>0.56</cell></row><row><cell>F1 Score</cell><cell>0.69</cell><cell>0.66</cell><cell>0.66</cell><cell>0.63</cell></row><row><cell>Cohen's Kappa</cell><cell>0.42</cell><cell>0.38</cell><cell>0.38</cell><cell>0.36</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 8</head><label>8</label><figDesc>Proactivity prediction with corrupted dialogue snippets on MultiWoz. Highlighted TN and FP are statistically different from the test with full context (p-value = 0.04123); results with Trigger Utterance both Empty and Masked are statistically lower (p &lt; 0.01) than in Full Context setting.</figDesc><table><row><cell></cell><cell></cell><cell>TESTS</cell><cell></cell><cell></cell><cell>BASELINES</cell><cell></cell></row><row><cell cols="4">Trigger Utterance Full Context Empty Masked</cell><cell>Full Context</cell><cell>Empty</cell><cell>Masked</cell></row><row><cell>True Positives</cell><cell>17</cell><cell>16</cell><cell>15</cell><cell>20</cell><cell>22</cell><cell>23</cell></row><row><cell>True Negatives</cell><cell>23</cell><cell>17</cell><cell>17</cell><cell>5</cell><cell>5</cell><cell>5</cell></row><row><cell>False Positives</cell><cell>2</cell><cell>8</cell><cell>8</cell><cell>20</cell><cell>20</cell><cell>20</cell></row><row><cell>False Negatives</cell><cell>8</cell><cell>9</cell><cell>10</cell><cell>5</cell><cell>3</cell><cell>2</cell></row><row><cell>Accuracy</cell><cell>0.80</cell><cell>0.66</cell><cell>0.64</cell><cell>0.50</cell><cell>0.54</cell><cell>0.56</cell></row><row><cell>Precision</cell><cell>0.89</cell><cell>0.67</cell><cell>0.65</cell><cell>0.50</cell><cell>0.52</cell><cell>0.53</cell></row><row><cell>Recall</cell><cell>0.68</cell><cell>0.64</cell><cell>0.60</cell><cell>0.80</cell><cell>0.88</cell><cell>0.92</cell></row><row><cell>F1 Score</cell><cell>0.77</cell><cell>0.65</cell><cell>0.62</cell><cell>0.62</cell><cell>0.66</cell><cell>0.68</cell></row><row><cell>Cohen's Kappa</cell><cell>0.59</cell><cell>0.31</cell><cell>0.26</cell><cell>-0.01</cell><cell>0.07</cell><cell>0.11</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://github.com/sofiabrenna/dpro</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">The set sizes were established on the capacity of the least proactive sub-corpus, MultiWOZ, which featured 90 proactive utterances, hence 90 proactive snippets and 90 non-proactive snippets.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">According to Landis and Koch's scale<ref type="bibr" target="#b20">[21]</ref>.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Preface to the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bonetta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Hromei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Siciliani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Stranisci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</title>
				<meeting>the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Proactive systems and influenceable users: Simulating proactivity in task-oriented dialogues</title>
		<author>
			<persName><forename type="first">V</forename><surname>Balaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue-Full Papers</title>
				<meeting>the 24th Workshop on the Semantics and Pragmatics of Dialogue-Full Papers<address><addrLine>Waltham, New Jersey</addrLine></address></meeting>
		<imprint>
			<publisher>SEMDIAL</publisher>
			<date type="published" when="2020-07">July. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Pro-active systems and influenceable users: Simulating pro-activity in task-oriented dialogues</title>
		<author>
			<persName><forename type="first">V</forename><surname>Balaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue</title>
				<meeting>the 24th Workshop on the Semantics and Pragmatics of Dialogue</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">P.-M</forename><surname>Strauss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Minker</surname></persName>
		</author>
		<title level="m">Proactive spoken dialogue interaction in multi-party environments</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-S</forename><surname>Chua</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.02750</idno>
		<title level="m">A survey on proactive dialogue systems: Problems, methods, and prospects</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Levinson</surname></persName>
		</author>
		<title level="m">Pragmatics</title>
				<meeting><address><addrLine>Cambridge, United Kingdom</addrLine></address></meeting>
		<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="1983">1983</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Becoming jilda</title>
		<author>
			<persName><forename type="first">I</forename><surname>Sucameli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Simi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Speranza</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Italian Conference on Computational Linguistics CLIC-it 2020</title>
				<meeting>the Seventh Italian Conference on Computational Linguistics CLIC-it 2020<address><addrLine>Bologna</addrLine></address></meeting>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Grounding gaps in language model generations</title>
		<author>
			<persName><forename type="first">O</forename><surname>Shaikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gligorić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khetan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gerstgrasser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long Papers</title>
		<meeting>the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="6279" to="6296" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Labruna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brenna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zaninello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.14556</idno>
		<title level="m">Unraveling chatgpt: A critical analysis of ai-generated goal-oriented dialogues and annotations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Is gpt-3 a good data annotator?</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2212.10450</idno>
		<ptr target="https://arxiv.org/abs/2212.10450.doi:10.48550/ARXIV.2212.10450" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech</title>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kwak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>An</surname></persName>
		</author>
		<idno>ArXiv abs/2302.07736</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Sequential Organisation in WhatsApp Conversations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Hewett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Tesi di laurea triennale non pubblicata</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
		<respStmt>
			<orgName>Libera Università di Berlino ; semestre estivo</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The nespole! voip dialogue database</title>
		<author>
			<persName><forename type="first">S</forename><surname>Burger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Besacier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Coletti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Metze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Morel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Seventh European Conference on Speech Communication and Technology</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The nespole! voip multilingual corpora in tourism and medical domains</title>
		<author>
			<persName><forename type="first">N</forename><surname>Mana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Burger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cattoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Besacier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Maclaren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mcdonough</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Metze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Eighth European Conference on Speech Communication and Technology</title>
				<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Toward data-driven collaborative dialogue systems: The jilda dataset</title>
		<author>
			<persName><forename type="first">I</forename><surname>Sucameli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Speranza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Simi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Italian Journal of Computational Linguistics</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems</title>
		<author>
			<persName><forename type="first">R</forename><surname>Lowe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Pow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">V</forename><surname>Serban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pineau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the SIGDIAL 2015 Conference</title>
				<meeting>the SIGDIAL 2015 Conference</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="285" to="294" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Zang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rastogi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sunkara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2007.12720</idno>
		<title level="m">Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Large language models are not robust multiple choice selectors</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Twelfth International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.08939</idno>
		<title level="m">Premise order matters in reasoning with large language models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Lost in the middle: How language models use long contexts</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Paranjape</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bevilacqua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="157" to="173" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The measurement of observer agreement for categorical data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Landis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">G</forename><surname>Koch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">biometrics</title>
		<imprint>
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">Q</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sui</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.00234</idno>
		<title level="m">A survey on in-context learning</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">An overview of multi-task learning</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">National Science Review</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="30" to="43" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">A survey of transfer learning</title>
		<author>
			<persName><forename type="first">K</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Khoshgoftaar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Big data</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="1" to="40" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Transfer learning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Torrey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shavlik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Handbook of research on machine learning applications and trends: algorithms, methods, and techniques</title>
				<imprint>
			<publisher>IGI global</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="242" to="264" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
