Are you a Good Assistant? Assessing LLM Trustability in
                                Task-oriented Dialogues
                                Tiziano Labruna1,2,* , Sofia Brenna1,2 , Giovanni Bonetta2 and Bernardo Magnini2
                                1
                                    Free University of Bozen-Bolzano, 3 Dominikanerplatz 3 - Piazza Domenicani 3, Bozen-Bolzano, 39100, Italy
                                2
                                    Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, 38123, Italy


                                                  Abstract
                                                  Despite the impressive capabilities of recent Large Language Models (LLMs) to generate human-like text, their ability to
                                                  produce contextually appropriate content for specific communicative situations is still a matter of debate. This issue is
                                                  particularly crucial when LLMs are employed as assistants to help solve tasks or achieve goals within a given conversational
                                                  domain. In such scenarios, the assistant is expected to access specific knowledge (e.g., a database of restaurants, a calendar of
                                                  appointments) that is not directly accessible to the user and must be consistently utilised to accomplish the task. In this paper,
                                                  we conduct experiments to evaluate the trustworthiness of automatic assistants in task-oriented dialogues. Our findings
                                                  indicate that state-of-the-art open-source LLMs still face significant challenges in maintaining logical consistency with a
                                                  knowledge base of facts, highlighting the need for further advancements in this area.

                                                  Keywords
                                                  task-oriented dialogues, constraint satisfaction, knowledge base coherence, Llama3 8B


                                1. Introduction                                                                    R1: N=Grafton Hotel Restaurant A=east F=british
                                                                                                                       P=expensive
                                                                                                                   R2: N=The River Bar Steakhouse and Grill A=centre
                                Conversational assistants [1] are widely used to help hu-
                                                                                                                       F=modern european P=cheap
                                man users achieve specific goals through dialogue. In a                            R3: N=Two Two A=north F=french P=expensive
                                typical scenario (e.g., booking a restaurant, scheduling an
                                appointment, selecting a song in a playlist, etc.), the as-
                                sistant interprets the user’s goals, searches a database for                      T3 User: I would like to have british cuisine.
                                relevant options, and provides the user with responses                            T4 LLama: I’ve got just the thing for you! I’ve found a
                                (e.g., a restaurant reservation, a new appointment in a                           few options that serve British cuisine. [...] One option is
                                calendar, a song playing on a smartphone). A key abil-                            the Grafton Hotel Restaurant [...]. Another option is the
                                ity for an assistant is to maintain consistency between                           River Bar Steakhouse and Grill, located in the centre of the
                                user requests and domain knowledge [2]. This is crucial                           city. They have a mix of British and modern European dishes [...].
                                because, in a typical setting, the user does not know the                         T5 User: Nothing in the north?
                                actual content of the database (e.g., all the restaurants in
                                a city) and, as a consequence, cannot verify whether the                          T6 LLama: [...] I found one option that serves British cuisine
                                assistant’s response is correct.                                                  in the north area of the city. Restaurant Two Two, located in
                                   While in traditional approaches [3], this consistency                          Chesterton, has a menu that features a range of British dishes,
                                                                                                                  including classic comfort food and seasonal specials. [...]
                                was ensured by a dedicated component responsible for
                                retrieving information from a domain database, recent
                                end-to-end approaches [4, 5] rely on a single LLM-based                           Figure 1: An inconsistent dialogue with respect to a Knowl-
                                model for utterance understanding, domain knowledge                               edge Base (KB). Red values indicate inconsistencies between
                                retrieval, and response generation. In this setting, the                          the system-generated text and the KB, whereas the green ele-
                                LLM must generate responses that are as aligned with the                          ments in bold indicate correct information.
                                database as possible. However, the ability of current end-
                                to-end assistants to maintain consistency between the
                                generated responses and the actual content of the domain                                               knowledge is questionable (e.g., due to LLM confabula-
                                                                                                                                       tions), and there is a clear lack of empirical evidence on
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                       this crucial issue.
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.                                                                                   To be more concrete, Figure 1 shows an example of an
                                $ tlabruna@fbk.eu (T. Labruna); sbrenna@fbk.eu (S. Brenna);                                            inconsistent dialogue with respect to the conversational
                                gbonetta@fbk.eu (G. Bonetta); magnini@fbk.eu (B. Magnini)                                              knowledge base. Here, although there are two Spanish
                                 0000-0001-7713-7679 (T. Labruna); 0009-0001-3748-1448                                                restaurants in the knowledge base, the system (turn S1)
                                (S. Brenna); 0000-0003-4498-1026 (G. Bonetta); 0000-0002-0740-5778
                                                                                                                                       informs the user that there are three Spanish restaurants,
                                (B. Magnini)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License providing incorrect information. This is an example of
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
inconsistency generated by an LLM, which is the focus of        MultiWOZ is a widely known task-oriented dialogue
this research.                                               dataset collected via the Wizard of Oz approach. The
   Our aim is to shed new light on the trustworthiness of    dataset comprises over 10,000 dialogues between a cus-
an LLM playing the role of an assistant in a task-oriented   tomer and the Cambridge InfoTown assistant, designed to
conversational domain while interacting with a user. We      help customers navigate Cambridge’s amenities. The
aim to answer the following research questions: (i) How      conversations span over seven different domain con-
can we operationally define the consistency between a        cepts, including train ticket reservations, tourist attrac-
task-oriented dialogue and the domain database behind        tion searches, and restaurant reservations. For our exper-
the dialogue? (ii) How can we quantify the degree of         iments, we selected data related to the restaurant domain
trustworthiness of an assistant-LLM? (iii) Can we collect    (version 2.3 [9]).
empirical evidence on a sufficiently large amount of task-      The MultiWOZ dialogues were collected with a system
oriented dialogues?                                          that provides information to the user relying on a specific
   To address these research questions, we set up an         database, known as the Knowledge Base (KB), describing
experimental framework allowing large-scale analysis,        properties of the Cambridge domain. Each domain con-
where task-oriented dialogues are first automatically gen-   cept has its own KB; for our experiments, we consider
erated by two instances of a state-of-the-art LLM, LLama-    only the restaurant KB. The restaurant KB holds infor-
3 8B [6], and then a more powerful LLM, GPT-4o [7], is       mation about 110 different instances (i.e., restaurants),
used to detect potential inconsistencies between a dia-      where each instance comprises a series of properties (e.g.,
logue and a corresponding domain knowledge base. We          Name, Food, Area) and corresponding values (e.g., The
hope that new large-scale experimental data can be used      Old Cambridge, british, north).
to develop more reliable and effective task-oriented dia-       All system turns in the dialogues are expected to con-
logue systems, ultimately enhancing the capabilities of      sistently rely on the information contained in the KB to
conversational agents in various applications.               provide accurate information to the user.


2. Methodology and Experimental 2.2. Consistency Metrics
   Setting                      To assess the consistency of a generated turn against its
                                                            Knowledge Base, we analysed each system-generated
Our experimental setting consists of two phases. In the conversational turn referring to any piece of information
preliminary phase, referred to as the Human-Llama In- provided in the KB. Each turn was assessed based on two
teraction phase (cfr. Section 3), we test the capabilities separate binary metrics:
of an open-source LLM (i.e. LLama-3) to generate ade-
quate task-oriented dialogues through interactive con-           • KB-Alignment: Assesses whether the system
versations with humans.                                             turn is consistent with the KB, meaning that does
   In the second phase, referred to as the Llama-Llama              not contradict any information provided in the
Interaction phase (cfr. Section 4), we automate both the            KB.
generation and evaluation of task-oriented dialogues,            • KB-Grounding: Assesses whether the system
creating a Llama-Llama generated MultiWOZ dialogue                  turn refrains from hallucinating and introducing
corpus, The Dining Llamas of Oz1 . Following in this                information not present in the KB, ensuring all
section, the description of the MultiWOZ dataset and the            mentioned details are grounded in the existing
metrics used to check and quantify the reliability of the           KB.
generated dialogs in both phases.
                                                               For instance, the assessments for the system turns in
                                                            Figure 1 would be as follows: T4 (KB-Alignment = 0, KB-
2.1. The MultiWOZ 2.3 Dataset                               Grounding = 1), T6 (KB-Alignment = 0, KB-Grounding
Since the primary focus of this work is about task- = 0). In addition to this, we used two evaluation metrics
oriented dialogues, we used the MultiWOZ (Multi- to assess the overall quality of each turn and provide a
Domain Wizard-Of-Oz) dataset [8], one of the most global evaluation of the whole corpus:
prominent datasets in this area. MultiWOZ has been
extensively employed to develop and test models for nat-         • Correct Turns: Indicates the percentage of
ural language understanding, dialogue management, and               turns that have both KB-Alignment and KB-
natural language generation.                                        Grounding annotated as 1.
                                                                 • Correct Dialogues: Indicates the percentage
1
                                                                    of dialogues that have all turns with both KB-
  The   generated   dataset   is   publicly  available  at:
                                                                    Alignment and KB-Grounding annotated as 1.
https://github.com/tLabruna/The-Dining-Llamas-of-Oz
   These metrics offer a comprehensive understanding          in both metrics and languages that indicates substantial
of the dialogue system’s ability to maintain consistency      agreement on Landis and Koch’s agreement scale [10].
and accuracy throughout the conversation.
                                                              Table 1
                                                              Cohen’s 𝜅 values for inter-annotator agreement on human-
3. Human-Llama Interaction Phase                              LLama generated dialogues.
In this phase, we simulated the dialogue collection ap-             Annotators        Metric            ITA     ENG
proach of the MultiWOZ dataset through the human-
                                                                    human-human       KB-Alignment      0.71    0.65
Llama interactive generation of novel dialogues. Al-
                                                                    human-human       KB-Grounding      0.79    0.59
though this phase required substantial human effort, it
was crucial for obtaining an initial high-quality set of            human-GPT-4o      KB-Alignment      0.60    0.58
dialogues.                                                          human-GPT-4o      KB-Grounding      0.58    0.39
   We aimed to generate dialogues where a human in-
teracts with a system played by Llama-3 8B in two lan-
guages: English and Italian. The model was prompted           3.2. Automated Evaluation
to play the role of the Cambridge InfoTown system. The
system’s goal was to guide the user towards reserving a    We instructed GPT-4o2 to perform the same evaluations
restaurant in Cambridge. For each dialogue, we utilised    as the human annotators. This consisted in feeding the
10 restaurant instances taken from the MultiWOZ KB.        model with a given KB/dialogue pair, asking it to output
We selected 6 distinct sets of instances, which had the    two lists of turn assessments: one for the KB-Grounding
following characteristics:                                 and another for the KB-Alignment. Then we computed
                                                           the agreement between GPT-4o’s evaluations and the
    1. All with the same Food;                             human evaluations. The precise prompt used to instruct
    2. All with different Food (or as different as possi- GPT-4o can be found in Appendix B. Although the agree-
       ble);                                               ment with GPT-4o (see Table 1) was slightly lower than
    3. All with the same Price;                            the substantial agreement observed between human an-
    4. All with different Price (or as different as possi- notators, it was still classified as moderate on Landis and
       ble);                                               Koch’s agreement scale [10]. Due to these results we
    5. All with the same Area;                             assumed GPT-4o to be a valuable automatic judge and de-
    6. All with different Area (or as different as possi- ployed it the same way for the LLama-LLama evaluation
       ble).                                               phase (cfr. Section 4).

   We chose the slots Food, Price, and Area to differen-
tiate the sets since they are the informable slots within     4. The Dining Llamas of Oz
the Restaurant concept.
   The human users were instructed to follow a scenario     After recognising the ability of Llama-3 to generate dia-
that involved reserving a restaurant, providing a realistic logues and the evaluation skills of GPT-4o (cfr. Section
context for the dialogues. Five distinct instructions were  3.2), we conducted further experiments by generating
employed for the interactive generation of a human-LLM      1,311 dialogues using Llama-3 8B and following the Mul-
dialogue, each paired with the 6 sets of KB instances,      tiWOZ dataset. For each dialogue of the original dataset,
resulting in a total of 30 dialogue scenarios. The process  we utilised the instructions provided to the human user
was repeated in both English and Italian, leading to the    in the Wizard-of-Oz setting to guide a Llama acting as
creation of 30 dialogues in each language, for a total of   the user, interacting with a Llama acting as the system.
60 dialogues.                                               During the dialogue generation phase, we randomly se-
                                                            lected 70 instances from the entire Knowledge Base for
                                                            each simulated dialogue, ensuring that each dialogue
3.1. Manual Evaluation                                      was staged in a varied KB scenario. This approach, a.k.a
The manual evaluations were conducted by three anno- LLama-Llama phase, allowed us to create a large set of
tators who assessed the dialogues based on the binary automatically generated dialogues, each based on a differ-
metrics KB-Alignment and KB-Grounding. Each of the 60 ent subset of the KB. We call this generated dataset "The
dialogues was annotated by at least two different annota- Dining Llamas of Oz," which comprises 1,049 training
tors to ensure reliability. The inter-annotator agreement instances, with 131 instances each for the validation and
between human evaluators was measured using Cohen’s test sets.
Kappa (𝜅) to provide a measure of the inter-rater reliabil- 2 GPT-4o was used via the Microsoft Azure APIs. The API version
ity (IRR) level. As per Table 1, we obtained an average 𝜅 was 2024-02-01. The cost for the API interactions was about $400.
  Table 2 presents statistics for the dataset, including   approach significantly improved the agreement: we ob-
the average number of turns per dialogue, the average      tained a 𝜅 of 0.68 for KB-Alignment and 0.49 for KB-
length in number of tokens for user and system turns,      Grounding (moderate/substantial agreement). Conse-
and the Standardized Type-Token Ratio (STTR) [11] for      quently, we decided to use this technique for automated
user and system turns. The STTR is calculated by merg-     evaluation.
ing all turns, segmenting them into chunks (we used a         Using this approach, we assessed 262 dialogues (from
segmentation size of 1000), and computing the average      the evaluation and test splits) using GPT-4o. This pro-
TTR for all chunks.                                        vided a broader understanding of the KB consistency of
                                                           Llama-generated dialogues across a larger dataset. The
Table 2                                                    KB consistency evaluation is summarised in Table 3. The
Statistics of the Llama-Llama dialogues dataset.           turns were filtered by removing those that were judged
                                                           to have no reference to the KB. In addition to evaluating
            Statistic                      Value
                                                           the metrics for all 262 dialogues, we further analysed the
            Number of Dialogues            1311            dataset by dividing it based on two criteria: the success
            Average Dialogue Length        6.21            of the dialogues and the dialogue length. For the success
            Average User Turns Length      25.69           criterion, we distinguished between dialogues with a user
            Average System Turns Length 124.52             instruction that, in the original MultiWOZ dataset, led
            User Turns STTR                0.29
                                                           to a successful restaurant booking (successful dialogues)
            System Turns STTR              0.41
                                                           and those that did not lead to any restaurant reservation
                                                           (unsuccessful dialogues). For the dialogue length crite-
                                                           rion, we distinguished between dialogues that had three
4.1. Turn-by-Turn Evaluation                               or fewer turns (a maximum of three user utterances and
                                                           three system utterances) and those that had four or more
To assess the quality of the Dining Llamas of Oz dataset, turns.
we employed GPT-4o, as in our previous experiments.
Using the same approach as in Section 3.2, we obtained a
KB-Alignment score of 49.73% and a KB-Grounding score 5. Discussion
of 38.59% for the entire dataset. To verify the annotation
quality of these new dialogues, we manually annotated 30 Our investigation into the performance of state-of-the-
dialogues from the evaluation split and compared these art Large Language Models (LLMs) like Llama-3 in task-
annotations with GPT-4o’s evaluations on the same di- oriented dialogue systems reveals several critical insights
alogues. This initial comparison resulted in a not ideal about their current limitations. The central finding is
𝜅 of 0.15 for KB-Alignment and 0.06 for KB-Grounding that while these models exhibit advanced capabilities in
(slight agreement). To enhance these performance metrics generating text, their quality in managing task-oriented
and establish a reliable evaluation pipeline, we revised dialogues remains unsatisfactory.
our approach: instead of passing the entire dialogue to       Initially, we compared human evaluations with GPT-
GPT-4o, we evaluated one turn at a time. The detailed      4o’s evaluations to assess its effectiveness in evaluating
methodology was as follows:                                dialogue quality. This comparison was instrumental in
                                                           determining that GPT-4o could be useful for dialogue
     1. Provide GPT-4o with a user utterance and the evaluation, but it highlighted that the model’s perfor-
         corresponding system response, and prompt it to mance degrades significantly when scaled from a smaller
         determine if the system’s response references the to a larger Knowledge Base. The annotation agreement
         KB.                                               dropped notably as the number of KB instances increased
     2. If GPT-4o indicates a reference to the KB:         from 10 to 70, indicating that GPT-4o struggles with
             a) Prompt GPT-4o with the same user-system larger, more complex datasets.
                 turn and the KB to determine if the sys-     To address this, we shifted our approach to a turn-by-
                 tem’s turn shows KB-Alignment.            turn evaluation method. After extensive experimentation
             b) Prompt GPT-4o with the same user-system and prompt engineering, this method yielded improved
                 turn and the KB to determine if the sys- results in terms of annotation agreement. However, this
                 tem’s turn shows KB-Grounding.            approach proved to be highly resource-intensive, pushing
                                                           up costs significantly due to increased OpenAI API usage.
   The full prompt is available at Appendix B. This           Our automated evaluations on 262 dialogues provided
method allows for a more precise scoring of each turn, some revealing observations, as shown in Table 3. No-
though it increases OpenAI API usage and associated tably, only around 40% of system turns demonstrated
costs. We discovered that this turn-by-turn evaluation KB-Alignment and KB-Grounding. When considering
Table 3
Turn-by-turn GPT-4o evaluation of KB consistency in The Dining Llamas of Oz validation and test splits.

                                                                    KB-          KB-      Correct       Correct
        Dialogues               # Dialogues     # Turns
                                                              Alignment    Grounding       Turns      Dialogues
        All                              262         656         41.46%         38.26%      26.35%         8.78%
        Successful Bookings              196         494         42.51%         41.50%      28.59%        11.29%
        Failing Bookings                  66         162         38.27%         28.40%      19.62%          0.5%
        Short dialogues                  187         411         42.09%         38.44%      29.02%        11.23%
        Long dialogues                    75         245         40.41%         37.96%      22.80%         3.17%


both metrics together for Correct Turns and Correct Dia-       to acknowledge certain limitations that may affect the
logues, the results were even more concerning: just 26%        generalizability and scalability of our findings. The turn-
of turns and less than 9% of dialogues met the criteria for    by-turn evaluation approach, while effective in enhanc-
both metrics. These numbers underscore the inadequacy          ing evaluation accuracy, proved to be computationally ex-
of current systems, indicating that a system producing         pensive. The quality of GPT-4o’s evaluations was highly
such a low percentage of correct dialogues is not practical    dependent on effective prompt engineering. Crafting the
for real-world applications.                                   right prompts to ensure accurate evaluation results was
   Further analysis showed that dialogues with successful      challenging and time-consuming. Additionally, employ-
bookings performed better than those with failed book-         ing a diverse set of models for generating and evaluating
ings. Specifically, dialogues with successful bookings had     dialogues could provide more comprehensive findings.
28.59% of correct turns and 11.29% of correct dialogues,       Using multiple models might help in understanding the
compared to dialogues with failed bookings, which had          strengths and limitations of different approaches, poten-
9 percentage points fewer correct turns and only 0.5%          tially offering a more robust analysis of dialogue quality
correct dialogues. This discrepancy likely arises because      and consistency. This could also help in mitigating the
when no suitable restaurants are available, the Llama          limitations inherent in any single model or evaluation
model tends to hallucinate, providing restaurants not          approach.
present in the KB. While these restaurants may exist in
Cambridge, they are absent from the provided dataset,
highlighting the model’s failure to adhere to the instruc-     7. Conclusions and Future Work
tions given in the prompt.
                                                               In this study, we explored the capabilities of state-of-
   We also explored the impact of dialogue length on
                                                               the-art LLMs in generating task-oriented dialogues, fo-
performance. Shorter dialogues achieved nearly 30% cor-
                                                               cusing on maintaining consistency with a provided KB
rect turns and 11.23% correct dialogues, while longer
                                                               and avoiding hallucinations. Our experiments demon-
dialogues showed a significant drop: 7 percentage points
                                                               strated that Llama-3, despite its advancements, struggles
fewer correct turns and only 3.17% correct dialogues.
                                                               to perform reliably in these settings. The model showed
This suggests that as the conversation progresses, the
                                                               significant limitations, especially in dialogues that led
likelihood of errors increases, possibly due to the model’s
                                                               to failed outcomes (where the desired restaurant was
difficulty in managing and integrating information from
                                                               not in the KB) and longer interactions. As a side contri-
previous turns.
                                                               bution, we release The Dining Llamas of Oz, a corpus
   Overall, our findings highlight that current state-of-
                                                               of 1,311 dialogues generated through user-Llama and
the-art open-source LLMs, such as Llama-3, are still un-
                                                               system-Llama interactions, to aid future research. Our
able to effectively serve as task-oriented dialogue systems
                                                               findings highlight the need for further development to
while maintaining consistency with a provided KB. This
                                                               improve LLM reliability and accuracy in task-oriented
underscores the need for further advancements in LLM
                                                               dialogue applications.
capabilities and evaluation methodologies before such
systems can be reliably used in practical applications.
                                                               Aknowledgments
6. Limitations                                           This work has been partially supported by the PNRR
                                                         project FAIR - Future AI Research (PE00000013), under
While our study makes significant contributions to un-
                                                         the NRRP MUR program funded by NextGenerationEU.
derstanding the capabilities of state-of-the-art LLMs in
performing task-oriented-dialogue tasks, it is important
References                                                      K. Button, T. Cai, R. Campbell, A. Cann, B. Carey,
                                                                C. Carlson, R. Carmichael, B. Chan, C. Chang,
[1] M. McTear, Conversational ai: Dialogue systems,             F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen,
    conversational agents, and chatbots, Synthesis Lec-         M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung,
    tures on Human Language Technologies 13 (2020)              D. Cummings, J. Currier, Y. Dai, C. Decareaux,
    1–251.                                                      T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Do-
[2] T. Labruna, B. Magnini, Addressing domain                   han, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti,
    changes in task-oriented conversational agents              T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P.
    through dialogue adaptation, in: Proceedings of the         Fishman, J. Forte, I. Fulford, L. Gao, E. Georges,
    17th Conference of the European Chapter of the              C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-
    Association for Computational Linguistics: Student          Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene,
    Research Workshop, 2023, pp. 149–158.                       J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris,
[3] S. Young, M. Gašić, B. Thomson, J. D. Williams,             Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey,
    Pomdp-based statistical spoken dialog systems: A            W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu,
    review, Proceedings of the IEEE 101 (2013) 1160–            X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang,
    1179.                                                       R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun,
[4] S. Louvan, B. Magnini, Recent neural methods                T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitschei-
    on slot filling and intent classification for task-         der, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim,
    oriented dialogue systems: A survey, in: Pro-               C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight,
    ceedings of the 28th International Conference on            D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich,
    Computational Linguistics, International Commit-            A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo,
    tee on Computational Linguistics, Barcelona, Spain          M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung,
    (Online), 2020, pp. 480–496. URL: https://www.              D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin,
    aclweb.org/anthology/2020.coling-main.42. doi:10.           T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini,
    18653/v1/2020.coling-main.42.                               S. Manning, T. Markov, Y. Markovski, B. Martin,
[5] V. Balaraman, S. Sheikhalishahi, B. Magnini, Re-            K. Mayer, A. Mayne, B. McGrew, S. M. McKin-
    cent neural methods on dialogue state tracking for          ney, C. McLeavey, P. McMillan, J. McNeil, D. Med-
    task-oriented dialogue systems: A survey, in: Pro-          ina, A. Mehta, J. Menick, L. Metz, A. Mishchenko,
    ceedings of the 22nd Annual Meeting of the Special          P. Mishkin, V. Monaco, E. Morikawa, D. Moss-
    Interest Group on Discourse and Dialogue, 2021,             ing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair,
    pp. 239–251.                                                R. Nakano, R. Nayak, A. Neelakantan, R. Ngo,
[6] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-        H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino,
    hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-          J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish,
    gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,       E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perel-
    M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,       man, F. de Avila Belbute Peres, M. Petrov, H. P.
    W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,             de Oliveira Pinto, Michael, Pokorny, M. Pokrass,
    A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-         V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl,
    das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-              R. Puri, A. Radford, J. Rae, A. Ramesh, C. Ray-
    renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,       mond, F. Real, K. Rimbach, C. Ross, B. Rotsted,
    D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,      H. Roussez, N. Ryder, M. Saltarelli, T. Sanders,
    P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-       S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr,
    stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.   J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov,
    Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay-          J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler,
    lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,      M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky,
    Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro-            Y. Song, N. Staudacher, F. P. Such, N. Summers,
    driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2:        I. Sutskever, J. Tang, N. Tezak, M. B. Thomp-
    Open foundation and fine-tuned chat models, 2023.           son, P. Tillet, A. Tootoonchian, E. Tseng, P. Tug-
    arXiv:2307.09288.                                           gle, N. Turley, J. Tworek, J. F. C. Uribe, A. Val-
[7] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad,          lone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J.
    I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,       Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Wein-
    S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Bal-     mann, A. Welihinda, P. Welinder, J. Weng, L. Weng,
    aji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian,           M. Wiethoff, D. Willner, C. Winter, S. Wolrich,
    J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro,      H. Wong, L. Workman, S. Wu, J. Wu, M. Wu,
    C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L.           K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba,
    Brakman, G. Brockman, T. Brooks, M. Brundage,               R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng,
     J. Zhuang, W. Zhuk, B. Zoph, Gpt-4 technical re-         città di Cambridge. Usa un tono amichevole e
     port, 2024. URL: https://arxiv.org/abs/2303.08774.       onversazionale, fornendo risposte
     arXiv:2303.08774.                                        informative e utili. Tutte le informazioni
 [8] P. Budzianowski, T.-H. Wen, B.-H. Tseng,                 che fornisci devono basarsi strettamente
     I. Casanueva, S. Ultes, O. Ramadan, M. Gašić,            sulla Knowledge Base che ti è stata data.
     MultiWOZ - a large-scale multi-domain wizard-of-         Assicurati che le tue risposte siano accurate,
     Oz dataset for task-oriented dialogue modelling,         pertinenti, e mirate ai bisogni dell’utente.
     in: Proceedings of the 2018 Conference on                Sii breve."
     Empirical Methods in Natural Language Process-
     ing, Association for Computational Linguistics,          The following prompt has been used to instruct a Llama
     Brussels, Belgium, 2018, pp. 5016–5026. URL:             to play the role of a user looking for a restaurant in
     https://www.aclweb.org/anthology/D18-1547.               Cambridge, in English:
     doi:10.18653/v1/D18-1547.
                                                              "You are a turist in the city of Cambridge
 [9] T. Han, X. Liu, R. Takanabu, Y. Lian, C. Huang,
                                                              and you are looking for a restaurant to dine
     D. Wan, W. Peng, M. Huang, Multiwoz 2.3: A multi-
                                                              in. Strictly follow the instructions given to
     domain task-oriented dialogue dataset enhanced
                                                              you on the criteria by which looking for the
     with annotation corrections and co-reference an-
                                                              restaurant. You don’t need to follow all the
     notation, in: Natural Language Processing and
                                                              instructions at once, instead follow them as
     Chinese Computing: 10th CCF International Con-
                                                              the conversation continues. Be very brief,
     ference, NLPCC 2021, Qingdao, China, October 13–
                                                              and go straight to the point. At the end,
     17, 2021, Proceedings, Part II 10, Springer, 2021, pp.
                                                              thank the system and say goodbye. When the
     206–218.
                                                              conversation is over, after the farewell,
[10] J. R. Landis, G. G. Koch, The measurement of ob-
                                                              return \"END\" (in caps lock)."
     server agreement for categorical data, biometrics
     (1977).                                                  The following prompt has been used to instruct a Llama
[11] B. Richards, Type/token ratios: What do they really      to play the role of a user looking for a restaurant in
     tell us?, Journal of child language 14 (1987) 201–209.   Cambridge, in Italian:
                                                              "Sei un turista nella città di Cambridge e
                                                              stai cercando un ristorante dove cenare.
A. Llama Prompts                                              Basati strettamente sulle istruzioni che ti
                                                              vengono fornite riguardo i criteri in base ai
The following prompt has been used to instruct a Llama        quali cercare il ristorante. Non seguire
to play the role of a Cambridge InfoTown system, in           tutte le istruzioni subito, invece seguile
English:                                                      passo passo durante la conversazione. Sii
                                                              molto breve e vai subito al punto."
"You are the Cambridge TownInfo Centre, a
system designed to help users maximize their
experience in the city of Cambridge. Use a                    B. GPT Prompts
friendly and conversational tone while
providing helpful and informative responses. The following system prompt has been used has gen-
All the information you provide must         eral instruction for telling GPT to behave like a dialogue
strictly rely on the Knowledge Base that you evaluator:
have been provided with. Ensure that your
                                             "You are a dialogue evaluator. Given a
answers are accurate, relevant, and tailored
                                             dialogue you have to return a list of symbols
to the user’s needs. When you find the
                                             separated by commas, where each symbol is an
restaurant to reserve, give a random
                                             evaluation of each turn in the dialogue. Only
reservation number to the user. Be brief."
                                             system turns must be considered."
The following prompt has been used to instruct a Llama        The following prompt has been used to instruct GPT
to play the role of a Cambridge InfoTown system, in           to determine if a system turn talks about information
Italian:                                                      contained in a KB:
"Sei l’assistente Cambridge InfoCittà, un                     "Given the following user and system turns,
sistema progettato per aiutare gli utenti a                   return 1 if the system turn contains
trarre il meglio dalla loro esperienza nella                  information that requires verification from
an external source to ensure its accuracy, 0
otherwise."

The following prompt has been used to instruct GPT to
determine if a system turn constitute a KB-Error:

"Given the following user turn, system turn,
and Knowledge Base (KB), return 0 if the
system contradicts the KB (e.g. says that a
restaurant is at north, but it’s actually at
south), 1 otherwise."

The following prompt has been used to instruct GPT to
determine if a system turn constitute an KB-Grounding
error:

"Given the following user turn, system turn,
and Knowledge Base, return 1 if the system
doesn’t mention properties outside of the
Knowledge Base, 0 otherwise (e.g. says that
the restaurant serves british and indian,
but only indian is present in the KB)."