<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pisa, Italy
* Corresponding author.
$ tlabruna@fbk.eu (T. Labruna); sbrenna@fbk.eu (S. Brenna);
gbonetta@fbk.eu (G. Bonetta); magnini@fbk.eu (B. Magnini)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Are you a Good Assistant? Assessing LLM Trustability in Task-oriented Dialogues</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tiziano Labruna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Brenna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Bonetta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, Povo, Trento, 38123</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>3 Dominikanerplatz 3 - Piazza Domenicani 3, Bozen-Bolzano, 39100</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>R1: N=Grafton Hotel Restaurant A=east F=british P=expensive R2: N=The River Bar Steakhouse and Grill A=centre F=modern european P=cheap R3: N=Two Two A=north F=french P=expensive</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>T4 LLama: I've got just the thing for you! I've found a few options that serve British cuisine. [...] One option is the Grafton Hotel Restaurant [...]. Another option is the River Bar Steakhouse and Grill, located in the centre of the city. They have a mix of British and modern European dishes [...]</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Despite the impressive capabilities of recent Large Language Models (LLMs) to generate human-like text, their ability to produce contextually appropriate content for specific communicative situations is still a matter of debate. This issue is particularly crucial when LLMs are employed as assistants to help solve tasks or achieve goals within a given conversational domain. In such scenarios, the assistant is expected to access specific knowledge (e.g., a database of restaurants, a calendar of appointments) that is not directly accessible to the user and must be consistently utilised to accomplish the task. In this paper, we conduct experiments to evaluate the trustworthiness of automatic assistants in task-oriented dialogues. Our findings indicate that state-of-the-art open-source LLMs still face significant challenges in maintaining logical consistency with a knowledge base of facts, highlighting the need for further advancements in this area.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;task-oriented dialogues</kwd>
        <kwd>constraint satisfaction</kwd>
        <kwd>knowledge base coherence</kwd>
        <kwd>Llama3 8B</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Conversational assistants [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are widely used to help
human users achieve specific goals through dialogue. In a
typical scenario (e.g., booking a restaurant, scheduling an
appointment, selecting a song in a playlist, etc.), the
assistant interprets the user’s goals, searches a database for
relevant options, and provides the user with responses
(e.g., a restaurant reservation, a new appointment in a
calendar, a song playing on a smartphone). A key
ability for an assistant is to maintain consistency between
user requests and domain knowledge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This is crucial
because, in a typical setting, the user does not know the
actual content of the database (e.g., all the restaurants in
a city) and, as a consequence, cannot verify whether the
assistant’s response is correct.
      </p>
      <p>
        While in traditional approaches [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], this consistency
was ensured by a dedicated component responsible for
retrieving information from a domain database, recent
end-to-end approaches [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] rely on a single LLM-based
model for utterance understanding, domain knowledge
retrieval, and response generation. In this setting, the
LLM must generate responses that are as aligned with the
database as possible. However, the ability of current
endto-end assistants to maintain consistency between the
generated responses and the actual content of the domain
inconsistency generated by an LLM, which is the focus of
this research.
      </p>
      <p>Our aim is to shed new light on the trustworthiness of
an LLM playing the role of an assistant in a task-oriented
conversational domain while interacting with a user. We
aim to answer the following research questions: (i) How
can we operationally define the consistency between a
task-oriented dialogue and the domain database behind
the dialogue? (ii) How can we quantify the degree of
trustworthiness of an assistant-LLM? (iii) Can we collect
empirical evidence on a suficiently large amount of
taskoriented dialogues?</p>
      <p>
        To address these research questions, we set up an
experimental framework allowing large-scale analysis,
where task-oriented dialogues are first automatically
generated by two instances of a state-of-the-art LLM,
LLama3 8B [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and then a more powerful LLM, GPT-4o [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], is
used to detect potential inconsistencies between a
dialogue and a corresponding domain knowledge base. We
hope that new large-scale experimental data can be used
to develop more reliable and efective task-oriented
dialogue systems, ultimately enhancing the capabilities of
conversational agents in various applications.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology and Experimental</title>
    </sec>
    <sec id="sec-3">
      <title>Setting</title>
      <sec id="sec-3-1">
        <title>Our experimental setting consists of two phases. In the</title>
        <p>preliminary phase, referred to as the Human-Llama
Interaction phase (cfr. Section 3), we test the capabilities
of an open-source LLM (i.e. LLama-3) to generate
adequate task-oriented dialogues through interactive
conversations with humans.</p>
        <p>In the second phase, referred to as the Llama-Llama
Interaction phase (cfr. Section 4), we automate both the
generation and evaluation of task-oriented dialogues,
creating a Llama-Llama generated MultiWOZ dialogue
corpus, The Dining Llamas of Oz1. Following in this
section, the description of the MultiWOZ dataset and the
metrics used to check and quantify the reliability of the
generated dialogs in both phases.</p>
        <p>MultiWOZ is a widely known task-oriented dialogue
dataset collected via the Wizard of Oz approach. The
dataset comprises over 10,000 dialogues between a
customer and the Cambridge InfoTown assistant, designed to
help customers navigate Cambridge’s amenities. The
conversations span over seven diferent domain
concepts, including train ticket reservations, tourist
attraction searches, and restaurant reservations. For our
experiments, we selected data related to the restaurant domain
(version 2.3 [9]).</p>
        <p>The MultiWOZ dialogues were collected with a system
that provides information to the user relying on a specific
database, known as the Knowledge Base (KB), describing
properties of the Cambridge domain. Each domain
concept has its own KB; for our experiments, we consider
only the restaurant KB. The restaurant KB holds
information about 110 diferent instances (i.e., restaurants),
where each instance comprises a series of properties (e.g.,
Name, Food, Area) and corresponding values (e.g., The
Old Cambridge, british, north).</p>
        <p>All system turns in the dialogues are expected to
consistently rely on the information contained in the KB to
provide accurate information to the user.</p>
        <sec id="sec-3-1-1">
          <title>2.2. Consistency Metrics</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>To assess the consistency of a generated turn against its</title>
        <p>Knowledge Base, we analysed each system-generated
conversational turn referring to any piece of information
provided in the KB. Each turn was assessed based on two
separate binary metrics:
• KB-Alignment: Assesses whether the system
turn is consistent with the KB, meaning that does
not contradict any information provided in the
KB.
• KB-Grounding: Assesses whether the system
turn refrains from hallucinating and introducing
information not present in the KB, ensuring all
mentioned details are grounded in the existing
KB.</p>
      </sec>
      <sec id="sec-3-3">
        <title>1The generated dataset is publicly available</title>
        <p>https://github.com/tLabruna/The-Dining-Llamas-of-Oz
at:</p>
      </sec>
      <sec id="sec-3-4">
        <title>For instance, the assessments for the system turns in</title>
        <p>Figure 1 would be as follows: T4 (KB-Alignment = 0,
KB2.1. The MultiWOZ 2.3 Dataset Grounding = 1), T6 (KB-Alignment = 0, KB-Grounding
Since the primary focus of this work is about task- = 0). In addition to this, we used two evaluation metrics
oriented dialogues, we used the MultiWOZ (Multi- to assess the overall quality of each turn and provide a
Domain Wizard-Of-Oz) dataset [8], one of the most global evaluation of the whole corpus:
prominent datasets in this area. MultiWOZ has been
extensively employed to develop and test models for
natural language understanding, dialogue management, and
natural language generation.
• Correct Turns: Indicates the percentage of
turns that have both KB-Alignment and
KB</p>
        <p>Grounding annotated as 1.
• Correct Dialogues: Indicates the percentage
of dialogues that have all turns with both
KBAlignment and KB-Grounding annotated as 1.</p>
      </sec>
      <sec id="sec-3-5">
        <title>These metrics ofer a comprehensive understanding of the dialogue system’s ability to maintain consistency and accuracy throughout the conversation.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Human-Llama Interaction Phase</title>
      <sec id="sec-4-1">
        <title>In this phase, we simulated the dialogue collection ap</title>
        <p>proach of the MultiWOZ dataset through the
humanLlama interactive generation of novel dialogues.
Although this phase required substantial human efort, it
was crucial for obtaining an initial high-quality set of
dialogues.</p>
        <p>We aimed to generate dialogues where a human
interacts with a system played by Llama-3 8B in two
languages: English and Italian. The model was prompted
to play the role of the Cambridge InfoTown system. The
system’s goal was to guide the user towards reserving a
restaurant in Cambridge. For each dialogue, we utilised
10 restaurant instances taken from the MultiWOZ KB.
We selected 6 distinct sets of instances, which had the
following characteristics:
1. All with the same Food;
2. All with diferent Food (or as diferent as
possible);
3. All with the same Price;
4. All with diferent Price (or as diferent as
possible);
5. All with the same Area;
6. All with diferent Area (or as diferent as
possible).</p>
      </sec>
      <sec id="sec-4-2">
        <title>We chose the slots Food, Price, and Area to diferen</title>
        <p>tiate the sets since they are the informable slots within
the Restaurant concept.</p>
        <p>The human users were instructed to follow a scenario
that involved reserving a restaurant, providing a realistic
context for the dialogues. Five distinct instructions were
employed for the interactive generation of a human-LLM
dialogue, each paired with the 6 sets of KB instances,
resulting in a total of 30 dialogue scenarios. The process
was repeated in both English and Italian, leading to the
creation of 30 dialogues in each language, for a total of
60 dialogues.</p>
        <sec id="sec-4-2-1">
          <title>3.1. Manual Evaluation</title>
          <p>The manual evaluations were conducted by three
annotators who assessed the dialogues based on the binary
metrics KB-Alignment and KB-Grounding. Each of the 60
dialogues was annotated by at least two diferent
annotators to ensure reliability. The inter-annotator agreement
between human evaluators was measured using Cohen’s
in both metrics and languages that indicates substantial
agreement on Landis and Koch’s agreement scale [10].
We instructed GPT-4o2 to perform the same evaluations
as the human annotators. This consisted in feeding the
model with a given KB/dialogue pair, asking it to output
two lists of turn assessments: one for the KB-Grounding
and another for the KB-Alignment. Then we computed
the agreement between GPT-4o’s evaluations and the
human evaluations. The precise prompt used to instruct
GPT-4o can be found in Appendix B. Although the
agreement with GPT-4o (see Table 1) was slightly lower than
the substantial agreement observed between human
annotators, it was still classified as moderate on Landis and
Koch’s agreement scale [10]. Due to these results we
assumed GPT-4o to be a valuable automatic judge and
deployed it the same way for the LLama-LLama evaluation
phase (cfr. Section 4).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. The Dining Llamas of Oz</title>
      <p>After recognising the ability of Llama-3 to generate
dialogues and the evaluation skills of GPT-4o (cfr. Section
3.2), we conducted further experiments by generating
1,311 dialogues using Llama-3 8B and following the
MultiWOZ dataset. For each dialogue of the original dataset,
we utilised the instructions provided to the human user
in the Wizard-of-Oz setting to guide a Llama acting as
the user, interacting with a Llama acting as the system.</p>
      <p>During the dialogue generation phase, we randomly
selected 70 instances from the entire Knowledge Base for
each simulated dialogue, ensuring that each dialogue
was staged in a varied KB scenario. This approach, a.k.a
LLama-Llama phase, allowed us to create a large set of
automatically generated dialogues, each based on a
diferent subset of the KB. We call this generated dataset "The
Dining Llamas of Oz," which comprises 1,049 training
instances, with 131 instances each for the validation and
test sets.</p>
      <p>Kappa ( ) to provide a measure of the inter-rater reliabil- 2GPT-4o was used via the Microsoft Azure APIs. The API version
ity (IRR) level. As per Table 1, we obtained an average  was 2024-02-01. The cost for the API interactions was about $400.</p>
      <p>Table 2 presents statistics for the dataset, including approach significantly improved the agreement: we
obthe average number of turns per dialogue, the average tained a  of 0.68 for KB-Alignment and 0.49 for
KBlength in number of tokens for user and system turns, Grounding (moderate/substantial agreement).
Conseand the Standardized Type-Token Ratio (STTR) [11] for quently, we decided to use this technique for automated
user and system turns. The STTR is calculated by merg- evaluation.
ing all turns, segmenting them into chunks (we used a Using this approach, we assessed 262 dialogues (from
segmentation size of 1000), and computing the average the evaluation and test splits) using GPT-4o. This
proTTR for all chunks. vided a broader understanding of the KB consistency of
Llama-generated dialogues across a larger dataset. The
Table 2 KB consistency evaluation is summarised in Table 3. The
Statistics of the Llama-Llama dialogues dataset. turns were filtered by removing those that were judged
to have no reference to the KB. In addition to evaluating
Statistic Value the metrics for all 262 dialogues, we further analysed the
Number of Dialogues 1311 dataset by dividing it based on two criteria: the success
Average Dialogue Length 6.21 of the dialogues and the dialogue length. For the success
Average User Turns Length 25.69 criterion, we distinguished between dialogues with a user
Average System Turns Length 124.52 instruction that, in the original MultiWOZ dataset, led
User Turns STTR 0.29 to a successful restaurant booking (successful dialogues)
System Turns STTR 0.41 and those that did not lead to any restaurant reservation
(unsuccessful dialogues). For the dialogue length
criterion, we distinguished between dialogues that had three
4.1. Turn-by-Turn Evaluation or fewer turns (a maximum of three user utterances and
three system utterances) and those that had four or more
To assess the quality of the Dining Llamas of Oz dataset, turns.
we employed GPT-4o, as in our previous experiments.</p>
      <p>Using the same approach as in Section 3.2, we obtained a
KB-Alignment score of 49.73% and a KB-Grounding score 5. Discussion
of 38.59% for the entire dataset. To verify the annotation
quality of these new dialogues, we manually annotated 30
dialogues from the evaluation split and compared these
annotations with GPT-4o’s evaluations on the same
dialogues. This initial comparison resulted in a not ideal
 of 0.15 for KB-Alignment and 0.06 for KB-Grounding
(slight agreement). To enhance these performance metrics
and establish a reliable evaluation pipeline, we revised
our approach: instead of passing the entire dialogue to
GPT-4o, we evaluated one turn at a time. The detailed
methodology was as follows:
Our investigation into the performance of
state-of-theart Large Language Models (LLMs) like Llama-3 in
taskoriented dialogue systems reveals several critical insights
about their current limitations. The central finding is
that while these models exhibit advanced capabilities in
generating text, their quality in managing task-oriented
dialogues remains unsatisfactory.</p>
      <p>Initially, we compared human evaluations with
GPT4o’s evaluations to assess its efectiveness in evaluating
dialogue quality. This comparison was instrumental in
determining that GPT-4o could be useful for dialogue
1. Provide GPT-4o with a user utterance and the evaluation, but it highlighted that the model’s
perforcorresponding system response, and prompt it to mance degrades significantly when scaled from a smaller
determine if the system’s response references the to a larger Knowledge Base. The annotation agreement
KB. dropped notably as the number of KB instances increased
2. If GPT-4o indicates a reference to the KB: from 10 to 70, indicating that GPT-4o struggles with
a) Prompt GPT-4o with the same user-system larger, more complex datasets.</p>
      <p>turn and the KB to determine if the sys- To address this, we shifted our approach to a
turn-bytem’s turn shows KB-Alignment. turn evaluation method. After extensive experimentation
b) Prompt GPT-4o with the same user-system and prompt engineering, this method yielded improved
turn and the KB to determine if the sys- results in terms of annotation agreement. However, this
tem’s turn shows KB-Grounding. approach proved to be highly resource-intensive, pushing
up costs significantly due to increased OpenAI API usage.</p>
      <p>The full prompt is available at Appendix B. This Our automated evaluations on 262 dialogues provided
method allows for a more precise scoring of each turn, some revealing observations, as shown in Table 3.
Nothough it increases OpenAI API usage and associated tably, only around 40% of system turns demonstrated
costs. We discovered that this turn-by-turn evaluation KB-Alignment and KB-Grounding. When considering
both metrics together for Correct Turns and Correct Dia- to acknowledge certain limitations that may afect the
logues, the results were even more concerning: just 26% generalizability and scalability of our findings. The
turnof turns and less than 9% of dialogues met the criteria for by-turn evaluation approach, while efective in
enhancboth metrics. These numbers underscore the inadequacy ing evaluation accuracy, proved to be computationally
exof current systems, indicating that a system producing pensive. The quality of GPT-4o’s evaluations was highly
such a low percentage of correct dialogues is not practical dependent on efective prompt engineering. Crafting the
for real-world applications. right prompts to ensure accurate evaluation results was</p>
      <p>Further analysis showed that dialogues with successful challenging and time-consuming. Additionally,
employbookings performed better than those with failed book- ing a diverse set of models for generating and evaluating
ings. Specifically, dialogues with successful bookings had dialogues could provide more comprehensive findings.
28.59% of correct turns and 11.29% of correct dialogues, Using multiple models might help in understanding the
compared to dialogues with failed bookings, which had strengths and limitations of diferent approaches,
poten9 percentage points fewer correct turns and only 0.5% tially ofering a more robust analysis of dialogue quality
correct dialogues. This discrepancy likely arises because and consistency. This could also help in mitigating the
when no suitable restaurants are available, the Llama limitations inherent in any single model or evaluation
model tends to hallucinate, providing restaurants not approach.
present in the KB. While these restaurants may exist in
Cambridge, they are absent from the provided dataset,
highlighting the model’s failure to adhere to the instruc- 7. Conclusions and Future Work
tions given in the prompt.</p>
      <p>We also explored the impact of dialogue length on In this study, we explored the capabilities of
state-ofperformance. Shorter dialogues achieved nearly 30% cor- the-art LLMs in generating task-oriented dialogues,
forect turns and 11.23% correct dialogues, while longer cusing on maintaining consistency with a provided KB
dialogues showed a significant drop: 7 percentage points and avoiding hallucinations. Our experiments
demonfewer correct turns and only 3.17% correct dialogues. strated that Llama-3, despite its advancements, struggles
This suggests that as the conversation progresses, the to perform reliably in these settings. The model showed
likelihood of errors increases, possibly due to the model’s significant limitations, especially in dialogues that led
dificulty in managing and integrating information from to failed outcomes (where the desired restaurant was
previous turns. not in the KB) and longer interactions. As a side
contri</p>
      <p>Overall, our findings highlight that current state-of- bution, we release The Dining Llamas of Oz, a corpus
the-art open-source LLMs, such as Llama-3, are still un- of 1,311 dialogues generated through user-Llama and
able to efectively serve as task-oriented dialogue systems system-Llama interactions, to aid future research. Our
while maintaining consistency with a provided KB. This ifndings highlight the need for further development to
underscores the need for further advancements in LLM improve LLM reliability and accuracy in task-oriented
capabilities and evaluation methodologies before such dialogue applications.
systems can be reliably used in practical applications.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations</title>
      <sec id="sec-6-1">
        <title>While our study makes significant contributions to understanding the capabilities of state-of-the-art LLMs in performing task-oriented-dialogue tasks, it is important</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Aknowledgments</title>
      <sec id="sec-7-1">
        <title>This work has been partially supported by the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by NextGenerationEU.</title>
      </sec>
      <sec id="sec-7-2">
        <title>K. Button, T. Cai, R. Campbell, A. Cann, B. Carey,</title>
        <p>C. Carlson, R. Carmichael, B. Chan, C. Chang,
F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen,
M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung,
D. Cummings, J. Currier, Y. Dai, C. Decareaux,
T. Degry, N. Deutsch, D. Deville, A. Dhar, D.
Dohan, S. Dowling, S. Dunning, A. Ecofet, A. Eleti,
T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P.</p>
        <p>Fishman, J. Forte, I. Fulford, L. Gao, E. Georges,
C. Gibson, V. Goel, T. Gogineni, G. Goh, R.
GontijoLopes, J. Gordon, M. Grafstein, S. Gray, R. Greene,
J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris,
Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey,
W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu,
X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang,
R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun,
T. Kaftan, Łukasz Kaiser, A. Kamali, I.
Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim,
C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight,
D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich,
A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo,
M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung,
D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin,
T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini,
S. Manning, T. Markov, Y. Markovski, B. Martin,
K. Mayer, A. Mayne, B. McGrew, S. M.
McKinney, C. McLeavey, P. McMillan, J. McNeil, D.
Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko,
P. Mishkin, V. Monaco, E. Morikawa, D.
Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair,
R. Nakano, R. Nayak, A. Neelakantan, R. Ngo,
H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino,
J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish,
E. Parparita, A. Passos, M. Pavlov, A. Peng, A.
Perelman, F. de Avila Belbute Peres, M. Petrov, H. P.
de Oliveira Pinto, Michael, Pokorny, M. Pokrass,
V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl,
R. Puri, A. Radford, J. Rae, A. Ramesh, C.
Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted,
H. Roussez, N. Ryder, M. Saltarelli, T. Sanders,
S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr,
J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov,
J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler,
M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky,
Y. Song, N. Staudacher, F. P. Such, N. Summers,
I. Sutskever, J. Tang, N. Tezak, M. B.
Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P.
Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A.
Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J.</p>
        <p>Wang, A. Wang, B. Wang, J. Ward, J. Wei, C.
Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng,
M. Wiethof, D. Willner, C. Winter, S. Wolrich,
H. Wong, L. Workman, S. Wu, J. Wu, M. Wu,
K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba,
R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng,
J. Zhuang, W. Zhuk, B. Zoph, Gpt-4 technical re- città di Cambridge. Usa un tono amichevole e
port, 2024. URL: https://arxiv.org/abs/2303.08774. onversazionale, fornendo risposte
arXiv:2303.08774. informative e utili. Tutte le informazioni
[8] P. Budzianowski, T.-H. Wen, B.-H. Tseng, che fornisci devono basarsi strettamente
I. Casanueva, S. Ultes, O. Ramadan, M. Gašić, sulla Knowledge Base che ti è stata data.
MultiWOZ - a large-scale multi-domain wizard-of- Assicurati che le tue risposte siano accurate,
Oz dataset for task-oriented dialogue modelling, pertinenti, e mirate ai bisogni dell’utente.
in: Proceedings of the 2018 Conference on Sii breve."
Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, The following prompt has been used to instruct a Llama
Brussels, Belgium, 2018, pp. 5016–5026. URL: to play the role of a user looking for a restaurant in
https://www.aclweb.org/anthology/D18-1547. Cambridge, in English:
[9] dTo.iH:1a0n.,1X8.65L3iu/,vR1./TDa1k8a-n1a5b4u7,.Y. Lian, C. Huang, "You are a turist in the city of Cambridge
D. Wan, W. Peng, M. Huang, Multiwoz 2.3: A multi- and you are looking for a restaurant to dine
domain task-oriented dialogue dataset enhanced in. Strictly follow the instructions given to
with annotation corrections and co-reference an- you on the criteria by which looking for the
notation, in: Natural Language Processing and restaurant. You don’t need to follow all the
Chinese Computing: 10th CCF International Con- instructions at once, instead follow them as
ference, NLPCC 2021, Qingdao, China, October 13– the conversation continues. Be very brief,
17, 2021, Proceedings, Part II 10, Springer, 2021, pp. and go straight to the point. At the end,
206–218. thank the system and say goodbye. When the
[10] J. R. Landis, G. G. Koch, The measurement of ob- conversation is over, after the farewell,
server agreement for categorical data, biometrics return \"END\" (in caps lock)."
(1977). The following prompt has been used to instruct a Llama
[11] B. Richards, Type/token ratios: What do they really to play the role of a user looking for a restaurant in
tell us?, Journal of child language 14 (1987) 201–209. Cambridge, in Italian:</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>A. Llama Prompts</title>
      <sec id="sec-8-1">
        <title>The following prompt has been used to instruct a Llama to play the role of a Cambridge InfoTown system, in English:</title>
        <p>"Sei un turista nella città di Cambridge e
stai cercando un ristorante dove cenare.</p>
        <p>Basati strettamente sulle istruzioni che ti
vengono fornite riguardo i criteri in base ai
quali cercare il ristorante. Non seguire
tutte le istruzioni subito, invece seguile
passo passo durante la conversazione. Sii
molto breve e vai subito al punto."
"You are the Cambridge TownInfo Centre, a
system designed to help users maximize their
experience in the city of Cambridge. Use a B. GPT Prompts
friendly and conversational tone while
providing helpful and informative responses. The following system prompt has been used has
genAll the information you provide must eral instruction for telling GPT to behave like a dialogue
strictly rely on the Knowledge Base that you evaluator:
have been provided with. Ensure that your
answers are accurate, relevant, and tailored "You are a dialogue evaluator. Given a
to the user’s needs. When you find the dialogue you have to return a list of symbols
restaurant to reserve, give a random separated by commas, where each symbol is an
reservation number to the user. Be brief." evaluation of each turn in the dialogue. Only
system turns must be considered."
The following prompt has been used to instruct a Llama The following prompt has been used to instruct GPT
to play the role of a Cambridge InfoTown system, in to determine if a system turn talks about information
Italian: contained in a KB:
"Sei l’assistente Cambridge InfoCittà, un "Given the following user and system turns,
sistema progettato per aiutare gli utenti a return 1 if the system turn contains
trarre il meglio dalla loro esperienza nella information that requires verification from
an external source to ensure its accuracy, 0
otherwise."
The following prompt has been used to instruct GPT to
determine if a system turn constitute a KB-Error:
"Given the following user turn, system turn,
and Knowledge Base (KB), return 0 if the
system contradicts the KB (e.g. says that a
restaurant is at north, but it’s actually at
south), 1 otherwise."</p>
      </sec>
      <sec id="sec-8-2">
        <title>The following prompt has been used to instruct GPT to determine if a system turn constitute an KB-Grounding error:</title>
        <p>"Given the following user turn, system turn,
and Knowledge Base, return 1 if the system
doesn’t mention properties outside of the
Knowledge Base, 0 otherwise (e.g. says that
the restaurant serves british and indian,
but only indian is present in the KB)."</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>McTear</surname>
          </string-name>
          ,
          <article-title>Conversational ai: Dialogue systems, conversational agents, and chatbots</article-title>
          ,
          <source>Synthesis Lectures on Human Language Technologies</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Labruna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Addressing domain changes in task-oriented conversational agents through dialogue adaptation</article-title>
          ,
          <source>in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gašić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>Pomdp-based statistical spoken dialog systems: A review</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>101</volume>
          (
          <year>2013</year>
          )
          <fpage>1160</fpage>
          -
          <lpage>1179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Louvan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Recent neural methods on slot filling and intent classification for taskoriented dialogue systems: A survey</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>480</fpage>
          -
          <lpage>496</lpage>
          . URL: https://www. aclweb.org/anthology/2020.coling-main.
          <volume>42</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2020</year>
          .coling-main.
          <volume>42</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Balaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheikhalishahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Recent neural methods on dialogue state tracking for task-oriented dialogue systems: A survey</article-title>
          ,
          <source>in: Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liskovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Martinet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mihaylov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
            , I. Molybog,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poulton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reizenstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rungta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saladi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schelten</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>X. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Kuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , I. Zarov,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kambadur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>OpenAI</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Akkaya</surname>
            ,
            <given-names>F. L.</given-names>
          </string-name>
          <string-name>
            <surname>Aleman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Altenschmidt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Altman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Anadkat</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Avila</surname>
            , I. Babuschkin,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Balaji</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Balcom</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Baltescu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bavarian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Belgum</surname>
            ,
            <given-names>I. Bello</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berdine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernadett-Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bogdonof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Boiko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Brakman</surname>
          </string-name>
          , G. Brockman,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brooks</surname>
          </string-name>
          , M. Brundage,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>