<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic Prompting: Large Language Models for Task Oriented Dialog</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Nehring</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akhil Juneja</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adnan Ahmad</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roland Roller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietrich Klakow</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Artificial Intelligence (DFKI), Alt-Moabit 91c</institution>
          ,
          <addr-line>10559 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saarland University</institution>
          ,
          <addr-line>Campus, 66123 Saarbrücken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>TU Berlin</institution>
          ,
          <addr-line>Straße des 17. Juni 135, 10623 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Models show impressive results in many diferent applications, most notably in the context of questionanswering and open dialog situations. However, it is still an open question how to use those models for task-oriented dialogs such as booking or customer information systems, and such. In this work, we propose Dynamic Prompting, an architecture for task-oriented dialog, integrating the benefits of Large Language Models and showcasing the approach on the MultiWOZ 2.2 dataset. Our architecture leads to a high task success rate, provides sensible and specific answers, and is resistant to hallucinations. Further, we show that Dynamic Prompting is able to answer questions that were not anticipated by the dialog systems designer and that it can correct several types of errors and other characteristics of the system.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Dialog Systems</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Task-Oriented Dialog</kwd>
        <kwd>Dynamic Prompting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        and DST also but, unlike our work, they used a static
prompt. Other approaches to LLMs for task-oriented
diTask-Oriented Dialog Systems (TODS) assist users in alog are presented by Cao [11], Hu et al. [12], Wei et al.
completing a task within a conversation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], for instance, [13], Li et al. [14].
in the context of customer information and bookings To address those limitations and concerns, we propose
(train/restaurant). In an applied setting with real users, Dynamic Prompting, a technique to combine a traditional
it is important that those systems provide correct an- task-oriented dialog system pipeline with the benefits of
swers, tasks can be quickly solved, and lead ideally to LLMs. Showcased and tested in the context of restaurant
high user satisfaction. To ensure this, TODS often pro- booking, we present the advantages and limitations of
vide a high level of control over its dialog management our approach.
and answer behavior for system developers. Existing
solutions normally either manually implement a dialog
manager to control the complete interaction, or train it 2. Experimental Setup
on large amounts of dialog interactions [
        <xref ref-type="bibr" rid="ref2">2, 3, 4, 5</xref>
        ].
      </p>
      <p>In contrast, Large Language Models (LLMs) are very 2.1. Dataset
good at open-domain dialog and provide fluent and con- For our experiments, we use the Multi-domain
Wizard-ofvincing messages in diferent styles. However, those Oz 2.2 (MultiWOZ) dataset [15, 16]. The dataset contains
answers might be misleading and even false (hallucina- conversations between users and task-oriented chatbots
tion) [6, 7, 8]. In task-oriented dialog, the model could that were collected using the Wizard-of-Oz [17] method.
possibly ‘break out’ of the given dialog task. For our setup, we only use conversations about the
restau</p>
      <p>Using LLMs for task-oriented dialog is still in its in- rant domain. In the conversations, users ask chatbots
fancy. Madotto et al. [9] used LLMs for the whole pipeline about restaurants in Cambridge, England. The systems
of Natural Language Understanding, Dialog State Track- can recommend various restaurants based on attributes
ing, Dialog Policy and Natural Language Generation. such as the type of food, location or price range. Also,
Hudeček and Dusek [10] expand on this idea by eval- the system can perform bookings.
uating the abilities of LLMs to generate complete task- Table 1 shows an example dialog from the dataset.
oriented multi-turn dialogs. They used LLMs for NLU The restaurant domain encompasses two intents (find,
book) and ten slots. For each user utterance, the dataset
contains annotations for intent, slots in the utterance and
dialog state (detected slots over the whole dialog).
Bot:</p>
      <sec id="sec-1-1">
        <title>User:</title>
        <p>Bot:</p>
      </sec>
      <sec id="sec-1-2">
        <title>User:</title>
        <p>Bot:</p>
      </sec>
      <sec id="sec-1-3">
        <title>User: Bot:</title>
        <sec id="sec-1-3-1">
          <title>Can you find me an expensive place serves</title>
          <p>panasian food?
i haven’t found any expensive panasain
restaurants. should i try another price
range?</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>Are there restaurants that serve Mediterranean food?</title>
        </sec>
        <sec id="sec-1-3-3">
          <title>I have a couple. Both are in the centre. la mimosa, shiraz restaurant</title>
        </sec>
        <sec id="sec-1-3-4">
          <title>I am interested in La Mimosa. Please book a table for 7 at 15:15 on Wednesday. Thanks for your help!</title>
        </sec>
        <sec id="sec-1-3-5">
          <title>Booking was successful. The table will be reserved for 15 minutes. Reference number is : ER5DFE50 . Is there anything else I can help you with?</title>
        </sec>
        <sec id="sec-1-3-6">
          <title>No, that’s it. Thank you so much. Goodbye.</title>
        </sec>
        <sec id="sec-1-3-7">
          <title>Have a great day!</title>
          <p>2.2. Dialog System Architecture</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>Task-Oriented Dialog Systems (TODS) consist, tra</title>
        <p>
          ditionally, of the following main components [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: Natural
Language Understanding (NLU), Dialog State Tracking
(DST), Dialog Manager (DM) and Response Generation
(RG). The NLU performs intent detection and slot filling
on the current user utterance, whereas DST keeps track
of the slots over the whole dialog. The DM selects the
next action of the bot based on the results of NLU and
DST. Lastly, RG constructs the response to the user.
Dynamic Prompting In the following, we introduce
dynamic prompting, a TODS architecture, extended by the
capabilities of an LLM. Figure 1 shows the architecture.
We chose to use a trained model for the NLU component
to handle intent recognition and entity extraction, as
Hudeček and Dusek [10] highlighted the limited
performance of LLMs in these tasks. For NLU, we use the RASA
NLU component, powered by the DIET classifier [ 18],
while for DST, we use a simple hashmap that stores the
most recent NLU results. We trained the NLU component
on user utterances only from the restaurant domain of the
train split of the dataset, focusing on the find and book
intents. During training, we also provided use case-specific
entities, including categorical (pricerange, area,
bookday, bookpeople), as well as non-categorical (food, name,
booktime, address, phone, postcode, and reference).
        </p>
        <p>We replace/extend the Dialog Manager and Response
Generation with a Prompt Generation and an LLM.
The prompt generation uses input from the DST and NLU
and uses a series of rules, similar to a dialog manager.
Also, it fetches data from a database and generates a
prompt. It uses prompting templates that consist of three
parts: 1) A general task description, 2) content from the
database, NLU states in JSON format, and 3) the previous
conversation in a theater script style. Table 2 shows an
example prompt of our system in the restaurant domain
with the detected intent ‘book restaurant’. The prompt is
then sent to a LLM to generate a corresponding user reply.
In our experiment, we use GPT-3.5-Turbo (ChatGPT) by
accessing the model via API calls, as it has demonstrated
leading performance in the results presented by Hudeček
and Dusek [10].</p>
        <p>Appendix A shows the prompting templates and the
rules that we developed for our prompt generator. Our
prompt construction approach involved multiple
iterations of ‘trial and error’ process on the on training set,
evaluating their efectiveness based on the system’s task
completion and relevance to the conversation. Initially,
we introduced single instructions in the prompts.
However, scenarios such as having no available restaurants,
multiple options, or booking a restaurant required more
specific instructions. This led us to implement dynamic
prompts with tailored rules for each situation.</p>
        <sec id="sec-1-4-1">
          <title>Assist the user in booking a restaurant. Always assume</title>
          <p>the restaurant is available to confirm a successful
booking. Provide a reference number when the restaurant
name, bookday, bookpeople, and booktime are given.</p>
        </sec>
        <sec id="sec-1-4-2">
          <title>Prompt if these details are missing. Omit information about fictional bookings.</title>
        </sec>
        <sec id="sec-1-4-3">
          <title>Dialog State: {"food": "mediterranean", "pricerange": "expensive", bookday": "wednesday, bookpeople": "7", booktime": "15:15"}</title>
        </sec>
        <sec id="sec-1-4-4">
          <title>Conversation History</title>
        </sec>
        <sec id="sec-1-4-5">
          <title>User: Can you find me an expensive place serves panasian food?</title>
        </sec>
        <sec id="sec-1-4-6">
          <title>Bot: i haven’t found any expensive panasian restaurants. should i try another price range?</title>
        </sec>
        <sec id="sec-1-4-7">
          <title>User: Are there restaurants that serve Mediterranean food?</title>
        </sec>
        <sec id="sec-1-4-8">
          <title>Bot: I have a couple. Both are in the centre. la mimosa, shiraz restaurant</title>
        </sec>
        <sec id="sec-1-4-9">
          <title>User: I am interested in La Mimosa. Please book a table for 7 at 15:15 on Wednesday. Thanks for your help!</title>
          <p>2.3. Evaluation
In our experiment, the chatbot generates a response using
our dynamic prompting system for each dialog turn. We
evaluated its performance on the test split of the dataset’s
restaurant domain. To evaluate the responses on diferent
levels, we label them by two human annotators, given
the following criteria. The annotation guidelines can be
found in the supplementary materials.</p>
          <p>• Task Success Rate describes the ratio of
successful dialogs to the total number of dialogs.
Following the definition of Wen et al. [19] and Nekvinda
and Dušek [20], we mark a dialog as successful
if 1) the system’s recommendation aligns with
the user’s requests (such as price range, location,
and cuisine) and 2) the system adequately
addressed subsequent requests by the user, such as
providing the telephone number or confirming a
successful booking.
• Prompt instruction performance, a binary
scale to assess whether responses aligned with
the prompt instructions.
• Information Extraction Performance, a
binary scale, if the system is able to fetch the
relevant address from the JSON information.
• Response slot accuracy, the ratio of correctly
predicted slot values and the number of slot
values in the response. It measures if our system
is able to return all desired slots to the user. We
compute ratios across all annotated turns from
these metrics.
• Sensibleness describes if the utterance makes
sense given the context [21, 22].
• Specificity describes if the utterance is specific
regarding the context [21, 22]. LMs are used to
generate unspecific answers such as “this is great”,
which are sensible but not desired.
• Interestingness describes if the utterance
captures someone’s attention, arouses curiosity or
exhibits traits such as unexpectedness, wit, or
insightfulness [22]. Interestingness contributes to
a compelling and engaging user experience.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Results</title>
      <p>Table 3 shows the task success rate of our system
compared to other TODS on the MultiWOZ 2.2 dataset.
Although the other systems use the whole dataset and, thus,
are not perfectly comparable to ours, it still shows that
Dynamic Prompting has a similar performance compared
to SOTA systems. This is remarkable, particularly as we
use a relatively simple NLU component, which by itself
might produce errors. However, if we do not use the
NLU system of our pipeline but instead use the entity
annotations from the dataset, we get a ‘perfect’ NLU
without any errors. In this case, our Dynamic Prompting
achieves a task Success Rate of 0.94 - which highlights
the eficiency of the LLM solution.</p>
      <p>Table 5 shows further performance metrics. The dialog
success rate is supported by the high sensibility and
speciifcity scores, which indicate that the system answers on
point and does not deviate from the dialog’s goal.
However, the response slot accuracy is only 80% and needs
to be improved - but this is not the focus of this work.
Extracting information from the database works almost
3.1. Qualitative Analysis
In the following, we analyze the conversations and,
particularly, the generated responses of our Dynamic
Prompting in more detail.</p>
      <sec id="sec-2-1">
        <title>3.1.1. Handling Unusual Requests</title>
        <p>In one situation the user asked to send the information via
email, which the designers of the original dataset did not
anticipate. In those situations, traditional dialog systems
then can only answer with “I did not understand”. Our
approach instead was able to produce a sensible response, 3.1.6. Error Analysis
although it has never been trained for this case (see Table
4).</p>
      </sec>
      <sec id="sec-2-2">
        <title>System</title>
        <p>Yang et al. [23]
Lee [3]
Su et al. [24]</p>
        <sec id="sec-2-2-1">
          <title>Dynamic Prompting perfect NLU + Dynamic Prompting</title>
          <p>in the database, although the correct format would be
“CB1 7AA” in the Cambridge area. Our approach
consistently fixes these errors out of the box.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.1.4. Diverse Responses</title>
        <p>Table 3 Dynamic Prompting produces responses that are, on
avComparison of Task Success Rates on MultiWOZ 2.2 data, with erage, 2.41 times longer and more diverse than the
rean inter-annotator agreement of 1 for Dynamic Prompting. sponses of the crowd workers in the WOZ dataset, with
lexical diversity measured by an MTLD score [25] of 80.41
compared to 72.26 for the WOZ dataset. We assume that
perfectly (Information Extraction Performance=0.98). Al- the crowd workers were interested in providing fast and
though the system does not always follow all instructions minimalistic answers. However, while diverse replies
from the prompt (Prompt Instruction Performance=0.82), might be considered as positive as they make the
interthe task success is still quite high, so we assume that only action more enjoyable, it needs to be determined if long
minor errors cause the relatively low Prompt Instruction answers are preferable over short answers containing all
Performance. necessary information. Table 7 in the appendix shows
examples of this behavior.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.1.2. Politeness and Engagement</title>
        <p>Similar to our findings in Section 3.1.4, the responses of
our system are not only longer but also more engaging
compared to the ground truth. For example, in one
situation, our system produced an answer such as “You’re
welcome! If you have any more questions or need further
assistance, feel free to ask. Have a great day too!” while
the crowd worker wrote only “Thank you. Goodbye”.
Overall, we counted ‘polite’ phrases in the responses and
found out that dynamic prompting uses them more often
than the ground truth, such as “enjoy your meal” (15.5
more often), “have a great day” (2.2), “you’re welcome”
(4.8), “certainly!” (61.0), “great!” (20.0). Table 9 in the
appendix shows more detailed examples.</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.1.3. Formatting Addresses and Names</title>
        <p>The database entries are formulated in a diferent
format. Names are often lowercase, and the crowd workers
did not correct this issue when they wrote the system
responses. Also, postcodes are stored in the format “cb17aa”</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.1.5. Adaptive NLU Error Recovery</title>
        <p>We detected a user dialog in which the NLU failed to
detect the correct area. The user asked for a restaurant
in the center of the city, but due to the NLU error, the
prompt generator wrote two restaurants in the prompt,
one in the center and one in the south. The LLM was
able to detect that the user wanted a restaurant in the
center and presented only the correct restaurant in its
response (see Table 8 in the Appendix).</p>
        <p>Our system produced two types of problems. First, it was
not able to display the total count of available restaurants
correctly. Although the total count of available
restaurants is encoded in the JSON prompt, the LLM opted to
count the number of elements within the JSON options
array instead.</p>
        <p>The second common problem was the generation of
fake booking reference numbers. At the end of the dialog,
users often ask for a booking reference number. We do
not include booking reference numbers in the prompt.
Therefore, the system hallucinated booking reference
numbers. We evaluated including reference numbers in
the JSON format in the database section of the prompt
on ten samples, and we eliminated all occurrences of
fake reference number generation in our small sample
set. Still, we did not include this in the whole dataset,
so our results serve as an indicator only that this
approach could solve the problem of hypothetical booking
reference number generation.</p>
      </sec>
      <sec id="sec-2-7">
        <title>User: TOD: Truth:</title>
        <p>Assist the user in finding a restaurant. A single restaurant matches the user’s criteria. Provide requested
details of the restaurant. If no specific request is made, ofer general information like food type, area,
and price range. Respond in a coherent, non-numbered, and fluent text format based only on the
conversation history, state, and database information.</p>
        <p>State: restaurant {“area": “centre", “food": “italian", “name": “stazione restaurant and cofee bar",
“bookday": “saturday", “bookpeople": “1", “booktime": “12:45"}
Database: restaurant {Count:1, Options: {“address": “Market Hill City Centre", “area": “centre", “food":
“italian", “id": “19240", “name": “stazione restaurant and cofee bar", “phone": “01223352607", “postcode":
“cb23nj", “pricerange": “expensive", “type": “restaurant"}}</p>
        <sec id="sec-2-7-1">
          <title>Terrific! You are great, please send me all this information to my e-mail, thanks.</title>
          <p>I apologize, but as a language model AI, I don’t have the capability to send emails. However, you can
easily save or take a screenshot of this conversation for future reference. Remember to have a great
time at stazione restaurant and cofee bar on Saturday at 12:45!</p>
        </sec>
        <sec id="sec-2-7-2">
          <title>You’re welcome, was there anything else you need?</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <p>We presented Dynamic Prompting, a technique
integrating LLMs for task-oriented dialog. The results show high
sensibility and specificity values, which indicate that the
system answers on point and does not deviate from the
dialog’s goal. The relatively low Prompt Extraction
Performance and Response Slot Accuracy values still result
in excellent task success. The high values in the
performance metrics Prompt Instruction Performance and
Information Extraction Performance indicate that the
LLM follows the task-oriented guidance of the dynamic
prompts. The Information Extraction Performance of
0.98 shows that the system could very well reuse the
database information embedded in the prompt in the
JSON format.</p>
      <p>In addition, our system shows various ways to correct
errors, such as NLU errors, user requests not anticipated
by the designer of DS, and errors in the format of the
database entries. Moreover, the generated system
answers are more diverse (Section 3.1.4) and more polite
(Section 3.1.2) than the human-generated responses in
the dataset. We would like to examine these qualitative
results in future research in a more quantitative way.</p>
      <p>Overall, we find that the widespread problem of
hallucinations in LLMs is not an issue in our system as long
as we present the correct information to the LLM. As
soon as the user asks the system for information that is
not present in the prompt, such as the booking reference
numbers, the LLM starts to hallucinate.</p>
      <p>Although we assess the system’s performance solely
on the restaurant domain, the dynamic prompting
method can be extended to other domains in the
MultiWOZ 2.2 dataset, such as hotel, taxi, and train. Expanding
to new domains will require updating the prompt
generation module to accommodate new intents and state
values, ensuring smooth integration with these additional
domains.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work has been supported by the Federal Joint
Committee of Germany (Gemeinsamer Bundesausschuss) as
part of the project smartNTX (01NVF21116).
policy injection, Proceedings of the AAAI Confer- systems, 2020. arXiv:2008.06239.
ence on Artificial Intelligence (2022). [10] V. Hudeček, O. Dusek, Are large language models
[3] Y. Lee, Improving end-to-end task-oriented dia- all you need for task-oriented dialogue?, in: S.
Stoylog system with a simple auxiliary task, in: M.- anchev, S. Joty, D. Schlangen, O. Dusek, C.
KenningF. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), ton, M. Alikhani (Eds.), Proceedings of the 24th
Findings of the Association for Computational Lin- Annual Meeting of the Special Interest Group on
guistics: EMNLP 2021, Association for Computa- Discourse and Dialogue, Association for
Computational Linguistics, Punta Cana, Dominican Repub- tional Linguistics, Prague, Czechia, 2023, pp. 216–
lic, 2021, pp. 1296–1303. URL: https://aclanthology. 228. URL: https://aclanthology.org/2023.sigdial-1.
org/2021.findings-emnlp.112. doi: 10.18653/v1/ 21. doi:10.18653/v1/2023.sigdial-1.21.
2021.findings-emnlp.112. [11] L. Cao, Diaggpt: An llm-based and multi-agent
[4] H. Sun, J. Bao, Y. Wu, X. He, Mars: Modeling context dialogue system with automatic topic
manage&amp; state representations with contrastive learning ment for flexible task-oriented dialogue, 2024.
for end-to-end task-oriented dialog, in: A. Rogers, arXiv:2308.08043.</p>
      <p>J. Boyd-Graber, N. Okazaki (Eds.), Findings of the [12] Z. Hu, Y. Feng, Y. Deng, Z. Li, S.-K. Ng,
Association for Computational Linguistics: ACL A. T. Luu, B. Hooi, Enhancing large language
2023, Association for Computational Linguistics, model induced task-oriented dialogue systems
Toronto, Canada, 2023, pp. 11139–11160. URL: https: through look-forward motivated goals, 2023.
//aclanthology.org/2023.findings-acl.708. doi: 10. arXiv:2309.08949.</p>
      <p>18653/v1/2023.findings-acl.708. [13] J. Wei, S. Kim, H. Jung, Y.-H. Kim, Leveraging large
[5] Q. Wu, D. Alnuhait, D. Chen, Z. Yu, Using tex- language models to power chatbots for collecting
tual interface to align external knowledge for user self-reported data, 2023. arXiv:2301.05843.
end-to-end task-oriented dialogue systems, 2023. [14] Z. Li, B. Peng, P. He, M. Galley, J. Gao, X. Yan,
arXiv:2305.13710. Guiding large language models via directional
[6] W. Sun, Z. Shi, S. Gao, P. Ren, M. de Rijke, Z. Ren, stimulus prompting, in: A. Oh, T. Neumann,
Contrastive learning reduces hallucination in con- A. Globerson, K. Saenko, M. Hardt, S. Levine
versations, in: Proceedings of the Thirty-Seventh (Eds.), Advances in Neural Information Processing
AAAI Conference on Artificial Intelligence and Systems, volume 36, Curran Associates, Inc., 2023,
Thirty-Fifth Conference on Innovative Applications pp. 62630–62656. URL: https://proceedings.
of Artificial Intelligence and Thirteenth Symposium neurips.cc/paper_files/paper/2023/file/
on Educational Advances in Artificial Intelligence, c5601d99ed028448f29d1dae2e4a926d-Paper-Conference.
AAAI’23/IAAI’23/EAAI’23, AAAI Press, 2023. URL: pdf.
https://doi.org/10.1609/aaai.v37i11.26596. doi:10. [15] P. Budzianowski, T.-H. Wen, B.-H. Tseng,
1609/aaai.v37i11.26596. I. Casanueva, S. Ultes, O. Ramadan, M. Gašić,
[7] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, MultiWOZ - a large-scale multi-domain
Wizard-ofB. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Oz dataset for task-oriented dialogue modelling,
Do, Y. Xu, P. Fung, A multitask, multilingual, mul- in: E. Rilof, D. Chiang, J. Hockenmaier, J. Tsujii
timodal evaluation of ChatGPT on reasoning, hal- (Eds.), Proceedings of the 2018 Conference
lucination, and interactivity, in: J. C. Park, Y. Arase, on Empirical Methods in Natural Language
B. Hu, W. Lu, D. Wijaya, A. Purwarianti, A. A. Kris- Processing, Association for Computational
nadhi (Eds.), Proceedings of the 13th International Linguistics, Brussels, Belgium, 2018, pp. 5016–
Joint Conference on Natural Language Processing 5026. URL: https://aclanthology.org/D18-1547.
and the 3rd Conference of the Asia-Pacific Chapter doi:10.18653/v1/D18-1547.
of the Association for Computational Linguistics [16] X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang,
(Volume 1: Long Papers), Association for Computa- J. Chen, MultiWOZ 2.2 : A Dialogue Dataset with
tional Linguistics, Nusa Dua, Bali, 2023, pp. 675–718. Additional Annotation Corrections and State
TrackURL: https://aclanthology.org/2023.ijcnlp-main.45. ing Baselines, in: Proceedings of the 2nd
Workdoi:10.18653/v1/2023.ijcnlp-main.45. shop on Natural Language Processing for
Conver[8] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, sational AI, Association for Computational
LinY. J. Bang, A. Madotto, P. Fung, Survey of hal- guistics, Online, 2020, pp. 109–117. URL: https:
lucination in natural language generation, ACM //aclanthology.org/2020.nlp4convai-1.13. doi:10.
Comput. Surv. 55 (2023). URL: https://doi.org/10. 18653/v1/2020.nlp4convai-1.13.
1145/3571730. doi:10.1145/3571730. [17] J. F. Kelley, An iterative design methodology for
[9] A. Madotto, Z. Liu, Z. Lin, P. Fung, Language mod- user-friendly natural language ofice information
els as few-shot learner for task-oriented dialogue applications, ACM Trans. Inf. Syst. 2 (1984) 26–41.
URL: https://doi.org/10.1145/357417.357420. doi:10. diversity assessment, Behavior research methods
1145/357417.357420. 42 (2010) 381–92. doi:10.3758/BRM.42.2.381.
[18] T. Bunk, D. Varshneya, V. Vlasov, A. Nichol, DIET:</p>
      <p>Lightweight language understanding for dialogue
systems, 2020. arXiv:2004.09936. Appendix
[19] T.-H. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M.</p>
      <p>Rojas-Barahona, P.-H. Su, S. Ultes, S. Young, A
network-based end-to-end trainable task-oriented A. Prompting templates
dialogue system, in: M. Lapata, P. Blunsom,
A. Koller (Eds.), Proceedings of the 15th Confer- Table 6 shows some prompting templates that have been
ence of the European Chapter of the Association developed for our system.
for Computational Linguistics: Volume 1, Long
Papers, Association for Computational Linguistics, B. Example Conversations
Valencia, Spain, 2017, pp. 438–449. URL: https:
//aclanthology.org/E17-1042. Tables 7 - 9 show diferent examples conversations.
[20] T. Nekvinda, O. Dušek, Shades of BLEU, flavours
of success: The case of MultiWOZ, in: A. Bosselut,
E. Durmus, V. P. Gangal, S. Gehrmann, Y. Jernite,
L. Perez-Beltrachini, S. Shaikh, W. Xu (Eds.),
Proceedings of the 1st Workshop on Natural Language
Generation, Evaluation, and Metrics (GEM 2021),
Association for Computational Linguistics, Online,
2021, pp. 34–46. URL: https://aclanthology.org/2021.</p>
      <p>gem-1.4. doi:10.18653/v1/2021.gem-1.4.
[21] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall,</p>
      <p>N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha,
G. Nemade, Y. Lu, Q. V. Le, Towards a human-like
open-domain chatbot, 2020. arXiv:2001.09977.
[22] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer,</p>
      <p>A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos,
L. Baker, Y. Du, et al., LaMDA: Language
Models for Dialog Applications, arXiv preprint
arXiv:2201.08239 (2022).
[23] Y. Yang, Y. Li, X. Quan, Ubar: Towards fully
endto-end task-oriented dialog system with gpt-2, in:
Thirty-Fifth AAAI Conference on Artificial
Intelligence, AAAI 2021, Thirty-Third Conference on
Innovative Applications of Artificial Intelligence,
IAAI 2021, The Eleventh Symposium on
Educational Advances in Artificial Intelligence, EAAI
2021, Virtual Event, February 2-9, 2021, AAAI Press,
2021, pp. 14230–14238. URL: https://ojs.aaai.org/
index.php/AAAI/article/view/17674.
[24] Y. Su, L. Shu, E. Mansimov, A. Gupta, D. Cai, Y.-A.</p>
      <p>Lai, Y. Zhang, Multi-task pre-training for
plug-andplay task-oriented dialogue system, in: S.
Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings
of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), Association for Computational
Linguistics, Dublin, Ireland, 2022, pp. 4661–4676. URL:
https://aclanthology.org/2022.acl-long.319. doi:10.</p>
      <p>18653/v1/2022.acl-long.319.
[25] P. Mccarthy, S. Jarvis, Mtld, vocd-d, and hd-d: A
validation study of sophisticated approaches to lexical
intent == find_restaurant:
len(database) ≥ 5:</p>
      <p>Assist the user in finding a restaurant. Numerous restaurants match the user’s criteria. Ask for
missing details of food type, area, or price range to narrow down the search before providing the
restaurant options. Respond in a coherent, non-numbered, and fluent text format based only on the
conversation history, state, and database information.</p>
      <p>{State + Database + Conversation History}
len(database) ≥ 1:</p>
      <p>Assist the user in finding a restaurant. Multiple restaurants match the user’s criteria. Provide
requested details from the available restaurant options. If no specific request is made, ofer general
information like food type, area, and price range. Respond in a coherent, non-numbered, and fluent
text format based only on the conversation history, state, and database information.</p>
      <p>{State + Database + Conversation History}
len(database) == 1:</p>
      <p>Assist the user in finding a restaurant. A single restaurant matches the user’s criteria. Provide
requested details of the restaurant. If no specific request is made, ofer general information like food
type, area, and price range. Respond in a coherent, non-numbered, and fluent text format based only
on the conversation history, state, and database information.</p>
      <p>{State + Database + Conversation History}
len(database) == 0:</p>
      <p>Assist the user in finding a restaurant. No restaurant matches the user’s criteria. Suggest trying
diferent food types, areas, or prices. Respond in a coherent, non-numbered, and fluent text format
based only on the conversation history, state, and database information.</p>
      <p>{State + Database + Conversation History}
intent == book_restaurant:</p>
      <p>Assist the user in booking a restaurant. Always assume the restaurant is available to confirm a
successful booking. Provide a reference number when the restaurant name, bookday, bookpeople, and
booktime are given. Prompt if these details are missing. Omit information about fictional bookings.
{State + Conversation history}
intent == none:</p>
      <p>You’re assisting the user in finding and booking a restaurant. Provide helpful information and
guidance based on their requests.</p>
      <p>{Conversation history}</p>
      <p>Assist the user in finding a restaurant. No restaurant matches the user’s criteria. Suggest trying
diferent food types, areas, or prices. Respond in a coherent, non-numbered, and fluent text format
based only on the conversation history, state, and database information.</p>
      <sec id="sec-4-1">
        <title>State: restaurant {“food": “mexican", “area": “centre"}</title>
      </sec>
      <sec id="sec-4-2">
        <title>Database: restaurant {Count:0}</title>
      </sec>
      <sec id="sec-4-3">
        <title>I am looking for a mexican restaurant in the centre. I’m sorry, but I couldn’t find any mexican restaurants in the centre. However, there may be other options available. Would you like me to suggest some other types of cuisine or diferent areas within the centre?</title>
        <p>I’m sorry, there aren’t any Mexican restaurants in the centre of town. Would you like to try a diferent
cuisine?
Assist the user in finding a restaurant. No restaurant matches the user’s criteria. Suggest trying
diferent food types, areas, or prices. Respond in a coherent, non-numbered, and fluent text format
based only on the conversation history, state, and database information.</p>
      </sec>
      <sec id="sec-4-4">
        <title>State: restaurant {“food": “venetian", “area": “centre"}</title>
      </sec>
      <sec id="sec-4-5">
        <title>Database: restaurant {Count:0}</title>
      </sec>
      <sec id="sec-4-6">
        <title>I am looking at a place to eat that serves venetian food in the centre.</title>
        <p>I apologize, but it seems that there are no restaurants in our database that serve Venetian food in the
centre at the moment. However, I can suggest a few alternatives for you. You might consider trying
some other types of cuisine such as Italian, seafood, or Mediterranean, which are commonly found in
the centre. Alternatively, you could try searching for Venetian restaurants in nearby areas or expanding
your search to include restaurants that are not strictly in the centre.</p>
        <p>Unfortunately, I don’t see any restaurants that serve Venetian in the centre of town. Would you like to
try a diferent area or type of cuisine?</p>
        <p>Assist the user in finding a restaurant. Numerous restaurants match the user’s criteria. Ask for missing
details of food type, area, or price range to narrow down the search before providing the restaurant
options. Respond in a coherent, non-numbered, and fluent text format based only on the conversation
history, state, and database information.</p>
      </sec>
      <sec id="sec-4-7">
        <title>State: restaurant {“food": “modern european", “pricerange": “moderate"}</title>
        <p>Database: restaurant {Count:5, Options: {“address": “152 - 154 Hills Road", “area": “south", “food":
“modern european", “id": “14731", “name": “restaurant alimentum", “phone": “01223413000", “postcode":
“cb28pb", “pricerange": “moderate", “signature": “slowroast sirloin of beef red onion celeriac and garlic",
“type": “restaurant"}, {“address": “83 Regent Street", “area": “centre", “food": “modern european", “id":
“3697", “name": “de luca cucina and bar", “phone": “01223356666", “postcode": “cb21aw", “pricerange":
“moderate", “signature": “roasted barbary duck breast served with sweet potato wedges and mange tout
with a red wine sauce", “type": “restaurant"}</p>
      </sec>
      <sec id="sec-4-8">
        <title>Doesn’t really matter. Is there one in the centre?</title>
        <p>Yes, there is a modern European restaurant called “De Luca Cucina and Bar" located in the centre at 83
Regent Street. They ofer a variety of dishes including their signature dish of roasted Barbary duck
breast served with sweet potato wedges and mange tout with a red wine sauce. The price range is
moderate. Would you like more information about this restaurant?</p>
        <sec id="sec-4-8-1">
          <title>Truth:</title>
          <p>Yes, there are 2: De Luca Cucina and Bar &amp; Riverside Brasserie. Would you like further information?</p>
        </sec>
        <sec id="sec-4-8-2">
          <title>User: TOD: Truth:</title>
        </sec>
        <sec id="sec-4-8-3">
          <title>Example 2:</title>
        </sec>
        <sec id="sec-4-8-4">
          <title>User: TOD: Truth:</title>
        </sec>
        <sec id="sec-4-8-5">
          <title>User: TOD:</title>
        </sec>
        <sec id="sec-4-8-6">
          <title>Truth:</title>
        </sec>
        <sec id="sec-4-8-7">
          <title>Example 3:</title>
        </sec>
      </sec>
      <sec id="sec-4-9">
        <title>Are there restaurants that serve Mediterranean food?</title>
        <p>Yes, there are restaurants that serve Mediterranean food. The options available are “La Mimosa" and
“Shiraz Restaurant." Both of them are located in the center area of town. “La Mimosa" is situated on
Thompsons Lane in Fen Ditton, and you can reach them at 01223362525. On the other hand, “Shiraz
Restaurant" is located at 84 Regent Street in the City Centre, with a contact number of 01223307581.
Both restaurants ofer Mediterranean cuisine and fall under the expensive price range. Let me know if
you need any additional information!</p>
      </sec>
      <sec id="sec-4-10">
        <title>I have a couple. Both are in the centre. la mimosa, shiraz restaurant. Hi, I’m looking for a restaurant with modern european food. It should be in the moderate price range. Sure, I can help you with that. Can you please provide me with any specific area you prefer or is it fine if I search for restaurants in any area?</title>
      </sec>
      <sec id="sec-4-11">
        <title>What area are you wanting to be in?</title>
      </sec>
      <sec id="sec-4-12">
        <title>Thank you so much. That is all I need for now. Have a great day. You’re welcome! If you have any more questions or need further assistance, feel free to ask. Have a great day too!</title>
      </sec>
      <sec id="sec-4-13">
        <title>Thank you. Goodbye.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <source>Speech and Language Processing (Third Edition draft)</source>
          , https://web.stanford. edu/~jurafsky/slp3/ed3bookfeb3_
          <year>2024</year>
          .pdf ,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -3-10.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cao</surname>
          </string-name>
          , D. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Si</surname>
          </string-name>
          , et al.,
          <article-title>Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>