=Paper= {{Paper |id=Vol-3878/71_main_long |storemode=property |title=Dynamic Prompting: Large Language Models for Task Oriented Dialog |pdfUrl=https://ceur-ws.org/Vol-3878/71_main_long.pdf |volume=Vol-3878 |authors=Jan Nehring,Akhil Juneja,Adnan Ahmad,Roland Roller,Dietrich Klakow |dblpUrl=https://dblp.org/rec/conf/clic-it/NehringJARK24 }} ==Dynamic Prompting: Large Language Models for Task Oriented Dialog== https://ceur-ws.org/Vol-3878/71_main_long.pdf
                                Dynamic Prompting: Large Language Models for Task
                                Oriented Dialog
                                Jan Nehring1 , Akhil Juneja1 , Adnan Ahmad2 , Roland Roller1 and Dietrich Klakow3
                                1
                                  German Research Center for Artificial Intelligence (DFKI), Alt-Moabit 91c, 10559 Berlin, Germany
                                2
                                  TU Berlin, Straße des 17. Juni 135, 10623 Berlin, Germany
                                3
                                  Saarland University, Campus, 66123 Saarbrücken, Germany


                                               Abstract
                                               Large Language Models show impressive results in many different applications, most notably in the context of question-
                                               answering and open dialog situations. However, it is still an open question how to use those models for task-oriented dialogs
                                               such as booking or customer information systems, and such. In this work, we propose Dynamic Prompting, an architecture
                                               for task-oriented dialog, integrating the benefits of Large Language Models and showcasing the approach on the MultiWOZ
                                               2.2 dataset. Our architecture leads to a high task success rate, provides sensible and specific answers, and is resistant to
                                               hallucinations. Further, we show that Dynamic Prompting is able to answer questions that were not anticipated by the dialog
                                               systems designer and that it can correct several types of errors and other characteristics of the system.

                                               Keywords
                                               Dialog Systems, Large Language Models, Task-Oriented Dialog, Dynamic Prompting,



                                1. Introduction                                                      and DST also but, unlike our work, they used a static
                                                                                                     prompt. Other approaches to LLMs for task-oriented di-
                                Task-Oriented Dialog Systems (TODS) assist users in alog are presented by Cao [11], Hu et al. [12], Wei et al.
                                completing a task within a conversation [1], for instance, [13], Li et al. [14].
                                in the context of customer information and bookings                    To address those limitations and concerns, we propose
                                (train/restaurant). In an applied setting with real users, Dynamic Prompting, a technique to combine a traditional
                                it is important that those systems provide correct an- task-oriented dialog system pipeline with the benefits of
                                swers, tasks can be quickly solved, and lead ideally to LLMs. Showcased and tested in the context of restaurant
                                high user satisfaction. To ensure this, TODS often pro- booking, we present the advantages and limitations of
                                vide a high level of control over its dialog management our approach.
                                and answer behavior for system developers. Existing
                                solutions normally either manually implement a dialog
                                manager to control the complete interaction, or train it 2. Experimental Setup
                                on large amounts of dialog interactions [2, 3, 4, 5].
                                   In contrast, Large Language Models (LLMs) are very 2.1. Dataset
                                good at open-domain dialog and provide fluent and con-
                                                                                                     For our experiments, we use the Multi-domain Wizard-of-
                                vincing messages in different styles. However, those
                                                                                                     Oz 2.2 (MultiWOZ) dataset [15, 16]. The dataset contains
                                answers might be misleading and even false (hallucina-
                                                                                                     conversations between users and task-oriented chatbots
                                tion) [6, 7, 8]. In task-oriented dialog, the model could
                                                                                                     that were collected using the Wizard-of-Oz [17] method.
                                possibly ‘break out’ of the given dialog task.
                                                                                                     For our setup, we only use conversations about the restau-
                                   Using LLMs for task-oriented dialog is still in its in-
                                                                                                     rant domain. In the conversations, users ask chatbots
                                fancy. Madotto et al. [9] used LLMs for the whole pipeline
                                                                                                     about restaurants in Cambridge, England. The systems
                                of Natural Language Understanding, Dialog State Track-
                                                                                                     can recommend various restaurants based on attributes
                                ing, Dialog Policy and Natural Language Generation.
                                                                                                     such as the type of food, location or price range. Also,
                                Hudeček and Dusek [10] expand on this idea by eval-
                                                                                                     the system can perform bookings.
                                uating the abilities of LLMs to generate complete task-
                                                                                                       Table 1 shows an example dialog from the dataset.
                                oriented multi-turn dialogs. They used LLMs for NLU
                                                                                                     The restaurant domain encompasses two intents (find,
                                                                                                     book) and ten slots. For each user utterance, the dataset
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                     contains annotations for intent, slots in the utterance and
                                Dec 04 — 06, 2024, Pisa, Italy
                                $ jan.nehring@dfki.de (J. Nehring); juneja.akhil@dfki.de             dialog state (detected slots over the whole dialog).
                                (A. Juneja); adnan.ahmad@tu-berlin.de (A. Ahmad);
                                roland.roller@dfki.de (R. Roller);
                                dietrich.klakow@lsv.uni-saarland.de (D. Klakow)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                         Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    User:    Can you find me an expensive place serves       Also, it fetches data from a database and generates a
             panasian food?                                  prompt. It uses prompting templates that consist of three
    Bot:     i haven’t found any expensive panasain          parts: 1) A general task description, 2) content from the
             restaurants. should i try another price         database, NLU states in JSON format, and 3) the previous
             range?                                          conversation in a theater script style. Table 2 shows an
    User:    Are there restaurants that serve Mediter-       example prompt of our system in the restaurant domain
             ranean food?                                    with the detected intent ‘book restaurant’. The prompt is
    Bot:     I have a couple. Both are in the centre. la     then sent to a LLM to generate a corresponding user reply.
             mimosa, shiraz restaurant                       In our experiment, we use GPT-3.5-Turbo (ChatGPT) by
    User:    I am interested in La Mimosa. Please book       accessing the model via API calls, as it has demonstrated
             a table for 7 at 15:15 on Wednesday. Thanks     leading performance in the results presented by Hudeček
             for your help!                                  and Dusek [10].
    Bot:     Booking was successful. The table will be          Appendix A shows the prompting templates and the
             reserved for 15 minutes. Reference number       rules that we developed for our prompt generator. Our
             is : ER5DFE50 . Is there anything else I can
                                                             prompt construction approach involved multiple itera-
             help you with?
                                                             tions of ‘trial and error’ process on the on training set,
    User:    No, that’s it. Thank you so much. Good-
                                                             evaluating their effectiveness based on the system’s task
             bye.
                                                             completion and relevance to the conversation. Initially,
    Bot:     Have a great day!
                                                             we introduced single instructions in the prompts. How-
Table 1                                                      ever, scenarios such as having no available restaurants,
MultiWOZ 2.2 example dialog.                                 multiple options, or booking a restaurant required more
                                                             specific instructions. This led us to implement dynamic
                                                             prompts with tailored rules for each situation.
2.2. Dialog System Architecture
                                                               Assist the user in booking a restaurant. Always assume
Task-Oriented Dialog Systems (TODS) consist, tra-              the restaurant is available to confirm a successful book-
ditionally, of the following main components [1]: Natural      ing. Provide a reference number when the restaurant
Language Understanding (NLU), Dialog State Tracking            name, bookday, bookpeople, and booktime are given.
(DST), Dialog Manager (DM) and Response Generation             Prompt if these details are missing. Omit information
(RG). The NLU performs intent detection and slot filling       about fictional bookings.
on the current user utterance, whereas DST keeps track
of the slots over the whole dialog. The DM selects the         Dialog State: {"food": "mediterranean", "pricerange": "ex-
next action of the bot based on the results of NLU and         pensive", bookday": "wednesday, bookpeople": "7", book-
DST. Lastly, RG constructs the response to the user.           time": "15:15"}

Dynamic Prompting In the following, we introduce                Conversation History
dynamic prompting, a TODS architecture, extended by the         User: Can you find me an expensive place serves
capabilities of an LLM. Figure 1 shows the architecture.        panasian food?
We chose to use a trained model for the NLU component           Bot: i haven’t found any expensive panasian restaurants.
to handle intent recognition and entity extraction, as          should i try another price range?
Hudeček and Dusek [10] highlighted the limited perfor-          User: Are there restaurants that serve Mediterranean
mance of LLMs in these tasks. For NLU, we use the RASA          food?
NLU component, powered by the DIET classifier [18],             Bot: I have a couple. Both are in the centre. la mimosa,
while for DST, we use a simple hashmap that stores the          shiraz restaurant
most recent NLU results. We trained the NLU component
on user utterances only from the restaurant domain of the       User: I am interested in La Mimosa. Please book a table
train split of the dataset, focusing on the find and book in-   for 7 at 15:15 on Wednesday. Thanks for your help!
tents. During training, we also provided use case-specific
                                                              Table 2
entities, including categorical (pricerange, area, book-
                                                              Example prompt of Dynamic Prompting, which is sent to
day, bookpeople), as well as non-categorical (food, name, ChatGPT.
booktime, address, phone, postcode, and reference).
   We replace/extend the Dialog Manager and Response
Generation with a Prompt Generation and an LLM.
The prompt generation uses input from the DST and NLU
and uses a series of rules, similar to a dialog manager.
Figure 1: Processing pipeline of Dynamic Prompting



2.3. Evaluation                                                     • Specificity describes if the utterance is specific
                                                                      regarding the context [21, 22]. LMs are used to
In our experiment, the chatbot generates a response using
                                                                      generate unspecific answers such as “this is great”,
our dynamic prompting system for each dialog turn. We
                                                                      which are sensible but not desired.
evaluated its performance on the test split of the dataset’s
                                                                    • Interestingness describes if the utterance cap-
restaurant domain. To evaluate the responses on different
                                                                      tures someone’s attention, arouses curiosity or
levels, we label them by two human annotators, given
                                                                      exhibits traits such as unexpectedness, wit, or in-
the following criteria. The annotation guidelines can be
                                                                      sightfulness [22]. Interestingness contributes to
found in the supplementary materials.
                                                                      a compelling and engaging user experience.
     • Task Success Rate describes the ratio of success-
       ful dialogs to the total number of dialogs. Follow-
       ing the definition of Wen et al. [19] and Nekvinda      3. Results
       and Dušek [20], we mark a dialog as successful
       if 1) the system’s recommendation aligns with           Table 3 shows the task success rate of our system com-
       the user’s requests (such as price range, location,     pared to other TODS on the MultiWOZ 2.2 dataset. Al-
       and cuisine) and 2) the system adequately ad-           though the other systems use the whole dataset and, thus,
       dressed subsequent requests by the user, such as        are not perfectly comparable to ours, it still shows that
       providing the telephone number or confirming a          Dynamic Prompting has a similar performance compared
       successful booking.                                     to SOTA systems. This is remarkable, particularly as we
     • Prompt instruction performance, a binary                use a relatively simple NLU component, which by itself
       scale to assess whether responses aligned with          might produce errors. However, if we do not use the
       the prompt instructions.                                NLU system of our pipeline but instead use the entity
     • Information Extraction Performance, a bi-               annotations from the dataset, we get a ‘perfect’ NLU
       nary scale, if the system is able to fetch the rele-    without any errors. In this case, our Dynamic Prompting
       vant address from the JSON information.                 achieves a task Success Rate of 0.94 - which highlights
     • Response slot accuracy, the ratio of correctly          the efficiency of the LLM solution.
       predicted slot values and the number of slot val-          Table 5 shows further performance metrics. The dialog
       ues in the response. It measures if our system          success rate is supported by the high sensibility and speci-
       is able to return all desired slots to the user. We     ficity scores, which indicate that the system answers on
       compute ratios across all annotated turns from          point and does not deviate from the dialog’s goal. How-
       these metrics.                                          ever, the response slot accuracy is only 80% and needs
     • Sensibleness describes if the utterance makes           to be improved - but this is not the focus of this work.
       sense given the context [21, 22].                       Extracting information from the database works almost
   System                                Task Success         in the database, although the correct format would be
   Yang et al. [23]                          0.83
                                                              “CB1 7AA” in the Cambridge area. Our approach consis-
   Lee [3]                                   0.80
   Su et al. [24]                            0.85
                                                              tently fixes these errors out of the box.
   Dynamic Prompting                         0.81
   perfect NLU + Dynamic Prompting           0.94             3.1.4. Diverse Responses
Table 3                                                      Dynamic Prompting produces responses that are, on av-
Comparison of Task Success Rates on MultiWOZ 2.2 data, with  erage, 2.41 times longer and more diverse than the re-
an inter-annotator agreement of 1 for Dynamic Prompting.     sponses of the crowd workers in the WOZ dataset, with
                                                             lexical diversity measured by an MTLD score [25] of 80.41
                                                             compared to 72.26 for the WOZ dataset. We assume that
perfectly (Information Extraction Performance=0.98). Al- the crowd workers were interested in providing fast and
though the system does not always follow all instructions minimalistic answers. However, while diverse replies
from the prompt (Prompt Instruction Performance=0.82), might be considered as positive as they make the inter-
the task success is still quite high, so we assume that only action more enjoyable, it needs to be determined if long
minor errors cause the relatively low Prompt Instruction answers are preferable over short answers containing all
Performance.                                                 necessary information. Table 7 in the appendix shows
                                                             examples of this behavior.
3.1. Qualitative Analysis
                                                            3.1.5. Adaptive NLU Error Recovery
In the following, we analyze the conversations and,
particularly, the generated responses of our Dynamic We detected a user dialog in which the NLU failed to
Prompting in more detail.                                   detect the correct area. The user asked for a restaurant
                                                            in the center of the city, but due to the NLU error, the
3.1.1. Handling Unusual Requests                            prompt generator wrote two restaurants in the prompt,
                                                            one in the center and one in the south. The LLM was
In one situation the user asked to send the information via able to detect that the user wanted a restaurant in the
email, which the designers of the original dataset did not center and presented only the correct restaurant in its
anticipate. In those situations, traditional dialog systems response (see Table 8 in the Appendix).
then can only answer with “I did not understand”. Our
approach instead was able to produce a sensible response, 3.1.6. Error Analysis
although it has never been trained for this case (see Table
4).                                                         Our system produced two types of problems. First, it was
                                                            not able to display the total count of available restaurants
3.1.2. Politeness and Engagement                            correctly. Although the total count of available restau-
                                                            rants is encoded in the JSON prompt, the LLM opted to
Similar to our findings in Section 3.1.4, the responses of count the number of elements within the JSON options
our system are not only longer but also more engaging array instead.
compared to the ground truth. For example, in one sit-         The second common problem was the generation of
uation, our system produced an answer such as “You’re fake booking reference numbers. At the end of the dialog,
welcome! If you have any more questions or need further users often ask for a booking reference number. We do
assistance, feel free to ask. Have a great day too!” while not include booking reference numbers in the prompt.
the crowd worker wrote only “Thank you. Goodbye”. Therefore, the system hallucinated booking reference
Overall, we counted ‘polite’ phrases in the responses and numbers. We evaluated including reference numbers in
found out that dynamic prompting uses them more often the JSON format in the database section of the prompt
than the ground truth, such as “enjoy your meal” (15.5 on ten samples, and we eliminated all occurrences of
more often), “have a great day” (2.2), “you’re welcome” fake reference number generation in our small sample
(4.8), “certainly!” (61.0), “great!” (20.0). Table 9 in the set. Still, we did not include this in the whole dataset,
appendix shows more detailed examples.                      so our results serve as an indicator only that this ap-
                                                            proach could solve the problem of hypothetical booking
3.1.3. Formatting Addresses and Names                       reference number generation.
The database entries are formulated in a different for-
mat. Names are often lowercase, and the crowd workers
did not correct this issue when they wrote the system re-
sponses. Also, postcodes are stored in the format “cb17aa”
    Prompt:     Assist the user in finding a restaurant. A single restaurant matches the user’s criteria. Provide requested
                details of the restaurant. If no specific request is made, offer general information like food type, area,
                and price range. Respond in a coherent, non-numbered, and fluent text format based only on the
                conversation history, state, and database information.
                State: restaurant {“area": “centre", “food": “italian", “name": “stazione restaurant and coffee bar",
                “bookday": “saturday", “bookpeople": “1", “booktime": “12:45"}
                Database: restaurant {Count:1, Options: {“address": “Market Hill City Centre", “area": “centre", “food":
                “italian", “id": “19240", “name": “stazione restaurant and coffee bar", “phone": “01223352607", “postcode":
                “cb23nj", “pricerange": “expensive", “type": “restaurant"}}
    User:       Terrific! You are great, please send me all this information to my e-mail, thanks.
    TOD:        I apologize, but as a language model AI, I don’t have the capability to send emails. However, you can
                easily save or take a screenshot of this conversation for future reference. Remember to have a great
                time at stazione restaurant and coffee bar on Saturday at 12:45!
    Truth:      You’re welcome, was there anything else you need?
Table 4
Showcasing the ability of the system to handle unusual requests.


        Metric                        Score     IA              database entries. Moreover, the generated system an-
        NLU
                                                                swers are more diverse (Section 3.1.4) and more polite
        Intent Detection Accuracy     0.89      na
        Entity Recognition Joint      0.76      na
                                                                (Section 3.1.2) than the human-generated responses in
          State Accuracy                                        the dataset. We would like to examine these qualitative
        LLM metrics                                             results in future research in a more quantitative way.
        Prompt Instruction            0.82      1                  Overall, we find that the widespread problem of hallu-
          Performance                                           cinations in LLMs is not an issue in our system as long
        Information Extraction        0.98      0.65            as we present the correct information to the LLM. As
          Performance                                           soon as the user asks the system for information that is
        Response Slot Accuracy        0.80      na              not present in the prompt, such as the booking reference
        Sensibility                   0.94      1               numbers, the LLM starts to hallucinate.
        Specificity                   0.94      1                  Although we assess the system’s performance solely
        Interestingness               0.89      0.84
                                                                on the restaurant domain, the dynamic prompting
Table 5                                                         method can be extended to other domains in the Multi-
The table shows the scores and the interannotator agreement     WOZ 2.2 dataset, such as hotel, taxi, and train. Expanding
(IA, Cohen Kappa) of the quantitative analysis.                 to new domains will require updating the prompt gen-
                                                                eration module to accommodate new intents and state
                                                                values, ensuring smooth integration with these additional
4. Conclusion                                                   domains.

We presented Dynamic Prompting, a technique integrat-
ing LLMs for task-oriented dialog. The results show high Acknowledgements
sensibility and specificity values, which indicate that the
system answers on point and does not deviate from the This work has been supported by the Federal Joint Com-
dialog’s goal. The relatively low Prompt Extraction Per- mittee of Germany (Gemeinsamer Bundesausschuss) as
formance and Response Slot Accuracy values still result part of the project smartNTX (01NVF21116).
in excellent task success. The high values in the per-
formance metrics Prompt Instruction Performance and References
Information Extraction Performance indicate that the
LLM follows the task-oriented guidance of the dynamic       [1] D. Jurafsky, J. H. Martin, Speech and Language Pro-
prompts. The Information Extraction Performance of              cessing (Third Edition draft), https://web.stanford.
0.98 shows that the system could very well reuse the            edu/~jurafsky/slp3/ed3bookfeb3_2024.pdf, 2024.
database information embedded in the prompt in the              Accessed: 2024-3-10.
JSON format.                                                [2] W. He, Y. Dai, Y. Zheng, Y. Wu, Z. Cao, D. Liu,
   In addition, our system shows various ways to correct        P. Jiang, M. Yang, F. Huang, L. Si, et al., Galaxy:
errors, such as NLU errors, user requests not anticipated       A generative pre-trained model for task-oriented
by the designer of DS, and errors in the format of the          dialog with semi-supervised learning and explicit
    policy injection, Proceedings of the AAAI Confer-                systems, 2020. arXiv:2008.06239.
    ence on Artificial Intelligence (2022).                     [10] V. Hudeček, O. Dusek, Are large language models
[3] Y. Lee, Improving end-to-end task-oriented dia-                  all you need for task-oriented dialogue?, in: S. Stoy-
    log system with a simple auxiliary task, in: M.-                 anchev, S. Joty, D. Schlangen, O. Dusek, C. Kenning-
    F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.),              ton, M. Alikhani (Eds.), Proceedings of the 24th
    Findings of the Association for Computational Lin-               Annual Meeting of the Special Interest Group on
    guistics: EMNLP 2021, Association for Computa-                   Discourse and Dialogue, Association for Computa-
    tional Linguistics, Punta Cana, Dominican Repub-                 tional Linguistics, Prague, Czechia, 2023, pp. 216–
    lic, 2021, pp. 1296–1303. URL: https://aclanthology.             228. URL: https://aclanthology.org/2023.sigdial-1.
    org/2021.findings-emnlp.112. doi:10.18653/v1/                    21. doi:10.18653/v1/2023.sigdial-1.21.
    2021.findings-emnlp.112.                                    [11] L. Cao, Diaggpt: An llm-based and multi-agent
[4] H. Sun, J. Bao, Y. Wu, X. He, Mars: Modeling context             dialogue system with automatic topic manage-
    & state representations with contrastive learning                ment for flexible task-oriented dialogue, 2024.
    for end-to-end task-oriented dialog, in: A. Rogers,              arXiv:2308.08043.
    J. Boyd-Graber, N. Okazaki (Eds.), Findings of the          [12] Z. Hu, Y. Feng, Y. Deng, Z. Li, S.-K. Ng,
    Association for Computational Linguistics: ACL                   A. T. Luu, B. Hooi, Enhancing large language
    2023, Association for Computational Linguistics,                 model induced task-oriented dialogue systems
    Toronto, Canada, 2023, pp. 11139–11160. URL: https:              through look-forward motivated goals, 2023.
    //aclanthology.org/2023.findings-acl.708. doi:10.                arXiv:2309.08949.
    18653/v1/2023.findings-acl.708.                             [13] J. Wei, S. Kim, H. Jung, Y.-H. Kim, Leveraging large
[5] Q. Wu, D. Alnuhait, D. Chen, Z. Yu, Using tex-                   language models to power chatbots for collecting
    tual interface to align external knowledge for                   user self-reported data, 2023. arXiv:2301.05843.
    end-to-end task-oriented dialogue systems, 2023.            [14] Z. Li, B. Peng, P. He, M. Galley, J. Gao, X. Yan,
    arXiv:2305.13710.                                                Guiding large language models via directional
[6] W. Sun, Z. Shi, S. Gao, P. Ren, M. de Rijke, Z. Ren,             stimulus prompting, in: A. Oh, T. Neumann,
    Contrastive learning reduces hallucination in con-               A. Globerson, K. Saenko, M. Hardt, S. Levine
    versations, in: Proceedings of the Thirty-Seventh                (Eds.), Advances in Neural Information Processing
    AAAI Conference on Artificial Intelligence and                   Systems, volume 36, Curran Associates, Inc., 2023,
    Thirty-Fifth Conference on Innovative Applications               pp. 62630–62656. URL: https://proceedings.
    of Artificial Intelligence and Thirteenth Symposium              neurips.cc/paper_files/paper/2023/file/
    on Educational Advances in Artificial Intelligence,              c5601d99ed028448f29d1dae2e4a926d-Paper-Conference.
    AAAI’23/IAAI’23/EAAI’23, AAAI Press, 2023. URL:                  pdf.
    https://doi.org/10.1609/aaai.v37i11.26596. doi:10.          [15] P. Budzianowski, T.-H. Wen, B.-H. Tseng,
    1609/aaai.v37i11.26596.                                          I. Casanueva, S. Ultes, O. Ramadan, M. Gašić,
[7] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su,                  MultiWOZ - a large-scale multi-domain Wizard-of-
    B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V.              Oz dataset for task-oriented dialogue modelling,
    Do, Y. Xu, P. Fung, A multitask, multilingual, mul-              in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii
    timodal evaluation of ChatGPT on reasoning, hal-                 (Eds.), Proceedings of the 2018 Conference
    lucination, and interactivity, in: J. C. Park, Y. Arase,         on Empirical Methods in Natural Language
    B. Hu, W. Lu, D. Wijaya, A. Purwarianti, A. A. Kris-             Processing, Association for Computational
    nadhi (Eds.), Proceedings of the 13th International              Linguistics, Brussels, Belgium, 2018, pp. 5016–
    Joint Conference on Natural Language Processing                  5026. URL: https://aclanthology.org/D18-1547.
    and the 3rd Conference of the Asia-Pacific Chapter               doi:10.18653/v1/D18-1547.
    of the Association for Computational Linguistics            [16] X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang,
    (Volume 1: Long Papers), Association for Computa-                J. Chen, MultiWOZ 2.2 : A Dialogue Dataset with
    tional Linguistics, Nusa Dua, Bali, 2023, pp. 675–718.           Additional Annotation Corrections and State Track-
    URL: https://aclanthology.org/2023.ijcnlp-main.45.               ing Baselines, in: Proceedings of the 2nd Work-
    doi:10.18653/v1/2023.ijcnlp-main.45.                             shop on Natural Language Processing for Conver-
[8] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,        sational AI, Association for Computational Lin-
    Y. J. Bang, A. Madotto, P. Fung, Survey of hal-                  guistics, Online, 2020, pp. 109–117. URL: https:
    lucination in natural language generation, ACM                   //aclanthology.org/2020.nlp4convai-1.13. doi:10.
    Comput. Surv. 55 (2023). URL: https://doi.org/10.                18653/v1/2020.nlp4convai-1.13.
    1145/3571730. doi:10.1145/3571730.                          [17] J. F. Kelley, An iterative design methodology for
[9] A. Madotto, Z. Liu, Z. Lin, P. Fung, Language mod-               user-friendly natural language office information
    els as few-shot learner for task-oriented dialogue               applications, ACM Trans. Inf. Syst. 2 (1984) 26–41.
     URL: https://doi.org/10.1145/357417.357420. doi:10.      diversity assessment, Behavior research methods
     1145/357417.357420.                                      42 (2010) 381–92. doi:10.3758/BRM.42.2.381.
[18] T. Bunk, D. Varshneya, V. Vlasov, A. Nichol, DIET:
     Lightweight language understanding for dialogue
     systems, 2020. arXiv:2004.09936.
[19] T.-H. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M.
                                                            Appendix
     Rojas-Barahona, P.-H. Su, S. Ultes, S. Young, A
     network-based end-to-end trainable task-oriented A. Prompting templates
     dialogue system, in: M. Lapata, P. Blunsom,
     A. Koller (Eds.), Proceedings of the 15th Confer- Table 6 shows some prompting templates that have been
     ence of the European Chapter of the Association developed for our system.
     for Computational Linguistics: Volume 1, Long Pa-
     pers, Association for Computational Linguistics, B. Example Conversations
     Valencia, Spain, 2017, pp. 438–449. URL: https:
     //aclanthology.org/E17-1042.                           Tables 7 - 9 show different examples conversations.
[20] T. Nekvinda, O. Dušek, Shades of BLEU, flavours
     of success: The case of MultiWOZ, in: A. Bosselut,
     E. Durmus, V. P. Gangal, S. Gehrmann, Y. Jernite,
     L. Perez-Beltrachini, S. Shaikh, W. Xu (Eds.), Pro-
     ceedings of the 1st Workshop on Natural Language
     Generation, Evaluation, and Metrics (GEM 2021),
     Association for Computational Linguistics, Online,
     2021, pp. 34–46. URL: https://aclanthology.org/2021.
     gem-1.4. doi:10.18653/v1/2021.gem-1.4.
[21] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall,
     N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha,
     G. Nemade, Y. Lu, Q. V. Le, Towards a human-like
     open-domain chatbot, 2020. arXiv:2001.09977.
[22] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer,
     A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos,
     L. Baker, Y. Du, et al.,         LaMDA: Language
     Models for Dialog Applications, arXiv preprint
     arXiv:2201.08239 (2022).
[23] Y. Yang, Y. Li, X. Quan, Ubar: Towards fully end-
     to-end task-oriented dialog system with gpt-2, in:
     Thirty-Fifth AAAI Conference on Artificial Intel-
     ligence, AAAI 2021, Thirty-Third Conference on
     Innovative Applications of Artificial Intelligence,
     IAAI 2021, The Eleventh Symposium on Educa-
     tional Advances in Artificial Intelligence, EAAI
     2021, Virtual Event, February 2-9, 2021, AAAI Press,
     2021, pp. 14230–14238. URL: https://ojs.aaai.org/
     index.php/AAAI/article/view/17674.
[24] Y. Su, L. Shu, E. Mansimov, A. Gupta, D. Cai, Y.-A.
     Lai, Y. Zhang, Multi-task pre-training for plug-and-
     play task-oriented dialogue system, in: S. Mure-
     san, P. Nakov, A. Villavicencio (Eds.), Proceedings
     of the 60th Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long
     Papers), Association for Computational Linguis-
     tics, Dublin, Ireland, 2022, pp. 4661–4676. URL:
     https://aclanthology.org/2022.acl-long.319. doi:10.
     18653/v1/2022.acl-long.319.
[25] P. Mccarthy, S. Jarvis, Mtld, vocd-d, and hd-d: A val-
     idation study of sophisticated approaches to lexical
     intent == find_restaurant:
         len(database) ≥ 5:
             Assist the user in finding a restaurant. Numerous restaurants match the user’s criteria. Ask for
             missing details of food type, area, or price range to narrow down the search before providing the
             restaurant options. Respond in a coherent, non-numbered, and fluent text format based only on the
             conversation history, state, and database information.
             {State + Database + Conversation History}

          len(database) ≥ 1:
              Assist the user in finding a restaurant. Multiple restaurants match the user’s criteria. Provide
              requested details from the available restaurant options. If no specific request is made, offer general
              information like food type, area, and price range. Respond in a coherent, non-numbered, and fluent
              text format based only on the conversation history, state, and database information.
              {State + Database + Conversation History}

          len(database) == 1:
              Assist the user in finding a restaurant. A single restaurant matches the user’s criteria. Provide
              requested details of the restaurant. If no specific request is made, offer general information like food
              type, area, and price range. Respond in a coherent, non-numbered, and fluent text format based only
              on the conversation history, state, and database information.
              {State + Database + Conversation History}

          len(database) == 0:
              Assist the user in finding a restaurant. No restaurant matches the user’s criteria. Suggest trying
              different food types, areas, or prices. Respond in a coherent, non-numbered, and fluent text format
              based only on the conversation history, state, and database information.
              {State + Database + Conversation History}

     intent == book_restaurant:
         Assist the user in booking a restaurant. Always assume the restaurant is available to confirm a
         successful booking. Provide a reference number when the restaurant name, bookday, bookpeople, and
         booktime are given. Prompt if these details are missing. Omit information about fictional bookings.
         {State + Conversation history}
     intent == none:
         You’re assisting the user in finding and booking a restaurant. Provide helpful information and
         guidance based on their requests.
         {Conversation history}


Table 6
Prompting templates that we developed for our system.
   Example 1:
   Prompt:         Assist the user in finding a restaurant. No restaurant matches the user’s criteria. Suggest trying
                   different food types, areas, or prices. Respond in a coherent, non-numbered, and fluent text format
                   based only on the conversation history, state, and database information.
                   State: restaurant {“food": “mexican", “area": “centre"}
                   Database: restaurant {Count:0}
   User:           I am looking for a mexican restaurant in the centre.
   TOD:            I’m sorry, but I couldn’t find any mexican restaurants in the centre. However, there may be other
                   options available. Would you like me to suggest some other types of cuisine or different areas within
                   the centre?
   Truth:          I’m sorry, there aren’t any Mexican restaurants in the centre of town. Would you like to try a different
                   cuisine?

   Example 2:
   Prompt:         Assist the user in finding a restaurant. No restaurant matches the user’s criteria. Suggest trying
                   different food types, areas, or prices. Respond in a coherent, non-numbered, and fluent text format
                   based only on the conversation history, state, and database information.
                   State: restaurant {“food": “venetian", “area": “centre"}
                   Database: restaurant {Count:0}
   User:           I am looking at a place to eat that serves venetian food in the centre.
   TOD:            I apologize, but it seems that there are no restaurants in our database that serve Venetian food in the
                   centre at the moment. However, I can suggest a few alternatives for you. You might consider trying
                   some other types of cuisine such as Italian, seafood, or Mediterranean, which are commonly found in
                   the centre. Alternatively, you could try searching for Venetian restaurants in nearby areas or expanding
                   your search to include restaurants that are not strictly in the centre.
   Truth:          Unfortunately, I don’t see any restaurants that serve Venetian in the centre of town. Would you like to
                   try a different area or type of cuisine?
Table 7
Showcasing the ability of the system to generate diverse responses for a similar scenario.




    Prompt:      Assist the user in finding a restaurant. Numerous restaurants match the user’s criteria. Ask for missing
                 details of food type, area, or price range to narrow down the search before providing the restaurant
                 options. Respond in a coherent, non-numbered, and fluent text format based only on the conversation
                 history, state, and database information.
                 State: restaurant {“food": “modern european", “pricerange": “moderate"}
                 Database: restaurant {Count:5, Options: {“address": “152 - 154 Hills Road", “area": “south", “food":
                 “modern european", “id": “14731", “name": “restaurant alimentum", “phone": “01223413000", “postcode":
                 “cb28pb", “pricerange": “moderate", “signature": “slowroast sirloin of beef red onion celeriac and garlic",
                 “type": “restaurant"}, {“address": “83 Regent Street", “area": “centre", “food": “modern european", “id":
                 “3697", “name": “de luca cucina and bar", “phone": “01223356666", “postcode": “cb21aw", “pricerange":
                 “moderate", “signature": “roasted barbary duck breast served with sweet potato wedges and mange tout
                 with a red wine sauce", “type": “restaurant"}
    User:        Doesn’t really matter. Is there one in the centre?
    TOD:         Yes, there is a modern European restaurant called “De Luca Cucina and Bar" located in the centre at 83
                 Regent Street. They offer a variety of dishes including their signature dish of roasted Barbary duck
                 breast served with sweet potato wedges and mange tout with a red wine sauce. The price range is
                 moderate. Would you like more information about this restaurant?
    Truth:       Yes, there are 2: De Luca Cucina and Bar & Riverside Brasserie. Would you like further information?
Table 8
Showcasing the ability of system to adapt to NLU error.
   Example 1:
   User:          Are there restaurants that serve Mediterranean food?
   TOD:           Yes, there are restaurants that serve Mediterranean food. The options available are “La Mimosa" and
                  “Shiraz Restaurant." Both of them are located in the center area of town. “La Mimosa" is situated on
                  Thompsons Lane in Fen Ditton, and you can reach them at 01223362525. On the other hand, “Shiraz
                  Restaurant" is located at 84 Regent Street in the City Centre, with a contact number of 01223307581.
                  Both restaurants offer Mediterranean cuisine and fall under the expensive price range. Let me know if
                  you need any additional information!
   Truth:         I have a couple. Both are in the centre. la mimosa, shiraz restaurant.

   Example 2:
   User:          Hi, I’m looking for a restaurant with modern european food. It should be in the moderate price range.
   TOD:           Sure, I can help you with that. Can you please provide me with any specific area you prefer or is it fine
                  if I search for restaurants in any area?
   Truth:         What area are you wanting to be in?

   Example 3:
   User:          Thank you so much. That is all I need for now. Have a great day.
   TOD:           You’re welcome! If you have any more questions or need further assistance, feel free to ask. Have a
                  great day too!
   Truth:         Thank you. Goodbye.
Table 9
Showcasing the ability of system to generate polite and engaging responses.