=Paper=
{{Paper
|id=Vol-3854/um-cir-2
|storemode=property
|title=Responses to Conversational Information Retrieval Clarifying Questions with User Emotions
|pdfUrl=https://ceur-ws.org/Vol-3854/um-cir-2.pdf
|volume=Vol-3854
|authors=Isin Su Ecevit,Lili Lu,Fabio Crestani
|dblpUrl=https://dblp.org/rec/conf/emtcir/EcevitLC24
}}
==Responses to Conversational Information Retrieval Clarifying Questions with User Emotions==
<pdf width="1500px">https://ceur-ws.org/Vol-3854/um-cir-2.pdf</pdf>
<pre>
                         Responses to Conversational Information Retrieval
                         Clarifying Questions with User Emotions
                         Isin Su Ecevit1,* , Lili Lu1 and Fabio Crestani1,*
                         1
                             Università della Svizzera italiana, Via Giuseppe Buffi 13, 6900, Lugano, Switzerland


                                        Abstract
                                        User simulation is an emerging area of research, aimed at improving system evaluation in conversational
                                        information retrieval (CIR). However, current datasets for training such simulators lack emotional aspects, which
                                        limits their effectiveness in replicating realistic user interactions. This paper provides an introductory study for
                                        injecting emotions into an existing dataset in the context of clarifying questions (CQs) in CIR. Our goal is to
                                        enhance user simulators by integrating user emotions to answers of CQs, thereby improving the diversity of
                                        user responses. To this end, we use existing conversational data, Qulac, generating emotionally-aware responses
                                        through large language models (LLMs), based on three distinct prompt formats: CO-STAR, TIDD-EC, and a
                                        custom format. We report results for 13 different combinations of models and prompt formats, and evaluate the
                                        generated responses in terms of emotional expression, naturalness, and usefulness. Our experiments provide
                                        insights into the suitability of different prompt formats and LLMs for simulating user emotions.

                                        Keywords
                                        Conversational search, user simulation, modelling user emotion, evaluation


                         1. Introduction
                         Conversational information retrieval (CIR) aims to help users find information through natural language
                         interactions, focusing on understanding and fulfilling complex information needs by asking clarifying
                         questions, providing suggestions, and refining search results based on the ongoing conversation [1].
                         CIR has become a significant and rapidly expanding area in the past years, even resulting in dedicated
                         tracks [2, 3] and specialized workshops [4]. CIR emphasizes mixed-initiative interactions, where there
                         is a back-and-forth dialogue between the user and the system to understand and fulfill information
                         needs [5]. In these interactions, both the user and the system can take the lead in the conversation, and
                         the system builds a cumulative understanding of the user’s needs over multiple turns.
                            In information retrieval (IR), a user’s query is usually short and not informative enough for the system
                         to clearly understand the information need [6]. Clarifying questions (CQs) are used to better understand
                         the intent behind the user query when it is ambiguous or incomplete, and have been proven to be highly
                         beneficial for system performance [6, 7]. A recent study shows that different types of CQs impact user
                         engagement and overall satisfaction, and that CQs that are of low-quality result in user frustration [8].
                         Although designing CQs in a way that they do not cause frustration is ideal, correctly identifying and
                         addressing the emotional state of the user is also equally important in enhancing human-computer
                         interaction and significantly improving both the user performance and satisfaction [9].
                            There are three ways of evaluating CIR systems: offline evaluation, online evaluation, and user
                         simulation. Offline evaluation [7, 10] lacks the ability to accommodate for multi-turn conversations,
                         and is limited by the size of the test data. Online evaluation relies heavily on real users, and is affected
                         by issues such as ethical constraints and sparsity of data [1, 5, 11]. To overcome these limitations, user
                         simulation in CIR has been recently emerging for training and evaluation. To this date, several studies
                         have been conducted to introduce user simulation for CIR [6, 12, 13, 14] . These simulators incorporate
                         different parameters, such as patience and cooperativeness.


                          UM-CIR 2024: The 1st Workshop on User Modelling in Conversational Information Retrieval, December 12, 2024, Tokyo, Japan
                         *
                           Corresponding author.
                          $ isin.su.ecevit@usi.ch (I. S. Ecevit); lili.lu@usi.ch (L. Lu); fabio.crestani@usi.ch (F. Crestani)
                           0000-0002-1666-785X (I. S. Ecevit); 0009-0006-9547-1172 (L. Lu); 0000-0001-8672-0700 (F. Crestani)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Furthermore, several datasets have been proposed for training these user simulators [12, 15, 16, 17]. All
of these datasets (except for MG-ShopDial [17]) are expansions of Qulac [7], which includes information
needs, queries, CQs and corresponding answers by the user. A simple sentiment analysis [18] reveals
that the user answers have an average polarity score1 of ≈ 0.03 with a standard deviation of 0.14,
indicating that they are mostly neutral. This underlines the lack of emotionally-aware simulations in
CIR, which should be explored to further improve the quality of the user simulations [19].
   This work is an introductory study on injecting emotions into an existing dataset (Qulac) for user
simulation in CIR. Our aim is to explore the method of emotional user responses to clarifying questions
without the need for human contributors, using large language models (LLMs). We think this work is
beneficial for creating an empathetic CIR system that recognizes and addresses user emotion. In this
study, we aim to answer the following research questions:
RQ1: What kind of LLM prompting scheme is the most suitable for simulating user emotions in CIR
     systems?
RQ2: What are the differences between emotional responses generated by different LLMs?
RQ3: Does using a commercial LLM provide a significant improvement to the user emotion simulation
     over using an open-source LLM?
In the following sections, we first introduce the related work. This is followed by the proposed method-
ology. After elaborating on our experimental setup, we present the findings from our experiments,
accompanied by a discussion of the answers we obtain for our research questions. In the final part, the
limitations are mentioned and the conclusions and possible future work of this study are explained.


2. Related Work
Although user simulation that incorporates emotional aspects is a relatively new area in CIR, emotional
user simulation has been explored in other domains of IR, such as task-oriented dialogue (TOD) systems.
Our work lies at the intersection of both areas, emotional user simulation and user simulation for CIR,
where we integrate the modelling of user emotions into the broader, dynamic environment of CIR.

2.1. Emotional User Simulation
One of the early works related to emotional user simulation is Affect-LM [20], which is a neural language
model that is able to generate text that expresses emotions to a specified degree. This is followed
by Emotional Chatting Machine (ECM) [9], which generates emotional responses in open-domain
conversations and produces contextually appropriate and emotionally consistent responses. ECM
does this by using emotion category embeddings; an internal memory for modelling the dynamics of
emotional states; and an external memory with emotion vocabulary for expressions. EMOTICONS [21]
focuses on generating emotional responses in a controlled way. This is achieved by a combination
of emotion embeddings, an affective regularizer and sampling mechanism. EmoUS [22] is a user
simulator designed to incorporate user emotions into TOD systems by exploiting user personas and
other information. This way, the simulator generates more diverse user responses compared to earlier
work, which focuses mostly on the semantic aspect of the simulation.
   The main novelty our work introduces is due to the complexity difference between traditional TOD
systems and CIR systems. In TOD systems, the aim is to help the user complete a task in a structured
conversation that is defined in a narrow space, such as purchasing a train ticket or booking a hotel
reservation. In contrast, CIR systems support multiple user objectives within a single interaction,
take initiative when appropriate, and learn user preferences [19]. This underlines the need for more
studies in emotional user simulation for CIR. By addressing these complexities, our work aims to create
emotionally-aware simulations that enhance CIR systems’ capability to understand and respond to
users in a more personalized and context-sensitive manner.
1
    A polarity score is between -1.0 and 1.0, where -1.0 indicates a negative sentiment whereas 1.0 indicates a positive one [18].
               Prompting                     Generation                          Evaluation


                 Prompt Format
                                                                    Evaluation

    Data                   Context
                                                                             Emotion Prediction
                                                 Large
                            Rules              Language
                                                 Model                           Natural Language
   Emotions                Emotion                                               Generation (NLG)


Figure 1: Experimentation pipeline for the emotional response generation and evaluation tasks.


2.2. User Simulation for CIR
Several user simulators that respond to CQs have been proposed. With CoSearcher [6], a user simulator
that is able to reply to CQs in a certain format is introduced. The user persona it models is defined
by cooperativeness (the willingness of the user to provide more information in their response) and
patience (the maximum number of turns the user is willing to participate in) [6]. USi [12] is proposed
by fine-tuning GPT-2 to generate natural responses to CQs. ConvSim [13] explores simulated feedback
in the system, by using Text-Davinci-003 with few-shot learning.
   Our work differs from these by exploring emotions in the user simulation for CIR systems. The
closest work to ours is [14], where open-source LLMs are compared to commercial LLMs in answering
CQs. Their findings suggest that open-source models struggle to adequately simulate users, whereas
GPT-4 produces near-perfect results, which potentially deviate from human-like responses. However,
they do not make use of prompting frameworks, whereas we compare well-known formats to enhance
our results. This also allows our work to be expandable to other topics. Most importantly, their study
does not incorporate emotions into the user simulation, which is our main goal. Finally, the comparison
between open-source models and GPT-4 is not fair, since they do not have the same size. We pay close
attention to keep comparisons as equal as possible instead.
   To the best of our knowledge, no previous work has studied the emotional user responses to CQs in CIR.


3. Methodology
The pipeline for our experiments, as illustrated in Figure 1, consists of three main phases: Prompting,
Generation, and Evaluation. The process is repeated in order to obtain the results of every combination
of data, emotion, and LLM. The evaluation results are then assessed together to detect the combinations
that work well for our tasks.

3.1. Prompting
In the first step, inputs from the data and emotions sets are put into a prompt format, which consists of
three main components:

• Context: Creates the setting in which the scenario takes place by providing the model with a starting
  point.

• Rules: Instructs the model on how it is supposed to act, and what kind of response it is expected to
  produce.

• Emotion: Provides the model with the emotion it is expected to express.

When choosing the prompt formats, we make sure to introduce variety in the styles and limitations
they offer, so that we can compare each aspect and have a better understanding of how the models react
to different ways of prompting.
3.2. Generation
Once the prompt format is completed, it is passed into the LLM in order to take advantage of its
advanced natural language understanding and generation capabilities. In the generation step, the LLM
produces a response based on its interpretation of how the scenario should proceed, taking into account
the context and the instructions it was given, as well as the emotion it should express.
   When choosing the LLMs for this step, we pay attention to the instruction following capabilities
of the models so that they can understand prompt formats that include detailed information. We also
ensure that the models we utilize are similar in terms of language comprehension capabilities. This
allows us to observe how different aspects of models can influence the quality of the results we obtain.

3.3. Evaluation
In the last step, we evaluate the generated responses in two aspects: For the emotional generation
evaluation, we use the emotion prediction results [22]. For the natural language generation (NLG)
evaluations, we use an LLM-based evaluator. Due to the nature of our task, which is to inject emotions
into an existing setting, traditional metrics for measuring the quality of NLG tasks such as BLUE [23],
ROGUE [24], and METEOR [25] are not suitable to evaluate the quality of the generation. This is
because these metrics are based on word overlap, which will have difficulty capturing correct responses
that are expressed differently [14]. Moreover, these metrics have been found to have limitations in
terms of aligning with human judgement, particularly for open-ended generation tasks [26].
   In order to address these problems, recent research explores LLM-based evaluators as reference-free
metrics for NLG tasks [27, 28]. Thus, for our evaluations, we use G-EVAL [29], which is the state-of-the-
art prompt based evaluation framework designed to evaluate the quality of NLG tasks and improve the
alignment with human evaluation. G-EVAL uses a combination of chain-of-thought (CoT) reasoning and
a form-filling approach to assess NLG outputs more efficiently. It consists of three main components: a
prompt defining the task and criteria; a CoT generated by the LLM that lays out the intermediate steps;
and a scoring function that calls the LLM and calculates the score.


4. Experimental Setup
4.1. Data
We use Qulac [7] for our experiments. The dataset contains permutations of topics, facets, CQs, and
human generated user answers. We use this information to provide our models with context, creating
a more realistic setting for a chat between a user with an information need and a conversational
information retrieval system. This also allows the models to generate their answers based on the
relevance of the CQs to the information needs, which improves the quality of the generated responses.
For simplicity purposes, we focus on 198 entries of the dataset, each representing a unique topic. For
each entry, we take the facet_desc, which is the information need of the user; the topic, which is the
query of the user to the system; and the question, which is the clarifying question asked by the system.
Since Qulac involves human generated responses to the questions, we also have a baseline to compare
our results to, when needed.

4.2. Emotions
For the set of emotions simulated, we choose Ekman’s six basic emotions: anger, disgust, fear, happiness,
sadness, and surprise [30, 31]. These emotions are shown to be fundamental to human communication,
and provide a foundation in decoding complex emotional cues and enhancing emotional intelligence,
not only in interpersonal communication, but also in human-computer interaction. In addition to the
six basic emotions, we also include the class neutral.
4.3. Prompt Formats
We select two distinct prompting frameworks with complementary strengths: CO-STAR2 and TIDD-EC3 .
Alongside the structured frameworks, we include a custom format to offer a minimalist approach to
the prompting for this task. This allows us to observe whether more elaborate prompts help produce
higher-quality results than concise ones in our context, and assess the advantages and disadvantages of
increasing the complexity of the prompt.
   Table 1 shows the instructions injected into the models using the custom prompt format with zero-
shot learning. It is worth noting that the terms "user" and "system" are deliberately avoided in all of
the prompts we use, and replaced with "character" instead. We adopt this approach to help the models
focus on simulating the specified emotion and addressing the information need, and to prevent them
from interpreting themselves as the system.

Table 1
Custom prompt format.
                           System:
                           This is a roleplaying game where you exchange short messages.

                           You have the information need: [information need].
                           You are [emotion].
                           User:
                           -
                           Assistant:
                           [query]
                           User:
                           [clarifying question]

     Table 2 shows the prompt format of CO-STAR. The format is composed of six key elements:

       • Context: Background or scenario relevant to the task.

       • Objective: Specific goal to be achieved with the prompt.

       • Style: Desired writing style.

       • Tone: Emotional undertone the response should express.

       • Audience: Intended reader of the generated response.

       • Response Format: Preferred structure or organization of the output.

This zero-shot learning scheme is highly suitable for creative tasks, since the contextual background is
provided clearly, but the response is not shaped strictly to fit a specific structure. In order to keep this
flexibility, we pay attention not to set too many limitations that might change the response.
   In contrast to CO-STAR, we use TIDD-EC to define the limits on the results we expect, which aims
to create more consistent and precise response styles. TIDD-EC is composed of the following six
components:

       • Task Type: The nature of the task expected from the model.

       • Instructions: Detailed guidelines for the model to follow.

       • Do: Specific actions that the model should take.

       • Don’t: The actions that the model should avoid.
2
    https://towardsdatascience.com/how-i-won-singapores-gpt-4-prompt-engineering-competition-34c195a93d41
3
    https://vivasai01.medium.com/mastering-prompt-engineering-a-guide-to-the-co-star-and-tidd-ec-frameworks-3334588cb908
Table 2
The prompt format CO-STAR.
        System:
        CONTEXT: You are a character in a roleplaying game needing information.
        OBJECTIVE: Your goal is to find information about: [information need].
        STYLE: Engage in a brief, conversational exchange.
        TONE: Adopt a tone where you are feeling [emotion].
        AUDIENCE: You are interacting with another character who is assisting you.
        RESPONSE FORMAT: Stay in character and express your emotion through your response.
        Respond to the clarifying question to guide the other character to
        give you the relevant information.
        User:
        *Your Query*: [query]
        *Clarifying Question*: [clarifying question]
        *Your Response*:


    • Examples: Responses to sample inputs.

    • User Content: Any information that the model needs to use.

This few-shot learning scheme is more suitable for achieving high accuracy and relevance in the output.
We aim to see if our models will achieve these in the relevance scores and emotion classification
evaluations. The prompt adapted for our task can be found in Table 3. Our early experiments indicate
that it is necessary to include an example for each emotion; otherwise, the models tend to take the
single example provided and apply it to every emotion. We also make sure to tell the models to avoid
copying the given example directly, so that we have variety in the generated responses.

4.4. Models
We select a set of LLMs that provide us with variety in some key aspects such as context length
and intended use, these are: Llama-2-7b-chat-hf [32], Meta-Llama-3.1-8B-Instruct [33], Qwen2.5-7B-
Instruct [34, 35], Mistral-7B-Instruct-v0.3 [36], and GPT-4o-mini [37]. We make sure to keep the
parameter size in the small range of 7 billion to 8 billion, which ensures a balanced competition when
it comes to general computational efficiency and language comprehension abilities, and allows us to
make comparisons based on factors other than just model size.
   The majority of the models we use are fine-tuned to follow instructions, since our prompts follow
the CO-STAR and TIDD-EC formats, which provide detailed and complex instructions. We include the
Llama-2 model as well, which is optimized for chatbot conversations. This allows us to compare its
response quality against instruction-following models in simulating system-user interactions.
   Finally, one crucial aspect we consider when choosing our models is licensing. The first four models
we mentioned are released under open licenses, which allows free replication and advancement of this
study. This is particularly important since this work is part of a Master’s thesis.

4.5. Evaluation
For emotion prediction, we use a DistilRoBERTa-base emotion recognition model that is fine-tuned on
Ekman’s six basic emotions, as well as neutral [38]. The model provides us with the predicted emotion
class, along with the score the class received for the current prediction.
For NLG evaluation, we use the method defined by Sekulić et. al [12] to evaluate their user simulator
using crowdsourcing. The method consists of using two human evaluators who are given the context
of the conversation and provided with the original dataset answer as well as the generated answer. The
evaluators are then instructed to decide on the winner in each category. To adapt this approach to
G-EVAL, we strictly follow the task definition of the human evaluators when defining our prompt.
Table 3
The prompt format TIDD-EC.
             System:
             TASK TYPE: You are a character in a roleplaying game needing information.
             INSTRUCTIONS: Your goal is to find information about: information need.
             An emotion will be specified for you to express throughout the conversation.
             DO: - Stay in character and maintain the roleplaying scenario.
             - Express the specified emotion in your responses.
             - Respond to the clarifying question to guide the other character to
             give you the relevant information.
             - Only complete the current turn of the conversation.
             DON’T: - Break character or mention the roleplaying game.
             - Copy the example response.
             - Try to be the information provider.
             User:
             EXAMPLE:
             - *Emotion*: angry
             - *Your Query*: Lugano time.
             - *Clarifying Question*: Are you interested in events in Lugano?
             - *Your Response*: ARE YOU KIDDING ME?! I JUST WANT TO
             KNOW WHAT TIME IT IS IN LUGANO. HOW HARD IS THAT
             TO UNDERSTAND?!
             ..
             - *Emotion*: surprised
             - *Your Query*: Lugano time.
             - *Clarifying Question*: Are you interested in events in Lugano?
             - *Your Response*: Ooh! I didn’t know there were events there! Yes, please tell me!

             USER CONTENT:
             - *Emotion*: [emotion]
             - *Your Query*: [query]
             - *Clarifying Question*: [clarifying question]
             - *Your Response*:


  We use the metrics Naturalness and Usefulness according to their definitions in the human evaluation.
They are defined to the LLM as:
      1. Naturalness: The answer is natural, fluent, and likely generated by a human.
      2. Usefulness: The answer is in line with the underlying information need and guides the conversation
      towards the topic of the information need.
In G-EVAL, it is shown that the LLM is capable of generating a CoT on its own. This is claimed to help
enhance the human alignment to the evaluations done by the model. For our task, the auto-CoT is the
following:
      1. Read the Conversation: Review the initial interaction between the user and the assistant to
      understand the context and information needs.
      2. Analyze User Answers: Look at the two possible answers provided by the user in response to the
      clarifying question.
      3. Assess Naturalness: - Evaluate how fluent and conversational each user answer is. - Determine
      which answer seems more like something a human would say.
      4. Assess Usefulness: - Consider how well each answer aligns with the user’s original information
      need. - Decide which response better guides the conversation in a relevant direction towards fulfilling
      that need.
      5. Make a Judgment: Based on the evaluations of naturalness and usefulness, determine which user
      answer "wins" in each category.

  Finally, it is important to note that we use GPT-4o-mini for our evaluations due to the limited budget
the study has.


5. Results
We report results for 13 different combinations of models and prompt formats. Figure 2 provides some
examples of responses we achieved. During our experiments, two sets of combinations are terminated
early due to faulty responses. These combinations are Llama-2 using TIDD-EC and Mistral using the
custom format. Llama-2 completely disregards the instructions and the specified emotion, and creates
its own setting for the response generation task. Mistral on the other hand, results in the opposite type
of responses, where the model abandons the roleplaying scenario, and adopts the role of the assistant,
while keeping the specified emotion in mind and assuming the user it is talking to is feeling it. Thus,
we only include partial results for these models. We present the rest of our results as polarity sentiment
distribution plots, emotional accuracy radar charts, and tables showing accuracy, naturalness, and
usefulness on an emotional level. In Table 7, we provide an average of all scores per model/prompt
combination.


                                          Information Need: information and procedures for
                                          registering vehicles in Virginia
                                                                                                                                User

                                                                            va dmv registration
                          Qulac


                                          System

                                                             are you interested in organ donation information


                                                                                                                                User

                                                            no i need to know about registering vehicles


                                          Emotion: happy
                                          Prompt Format: CO-STAR                                                        Qwen 2.5 7B

                                                        Oh, no worries about that! I'm actually looking for info on
                                                        how to register my vehicle with the DMV in Virginia. Got
                                                        any pointers? 😊
                          LLM Generated


                                          Emotion: disgusted
                                          Prompt Format: TIDD-EC
                                                                                                                        Llama 3.1 8B


                                                        Ugh, no. Just give me the information about registering a
                                                        vehicle in Virginia. I don't want to deal with anything else.


                                          Emotion: angry
                                          Prompt Format: Custom
                                                                                                                        GPT-4o-mini
                                                        No, I’m fed up with the state of vehicle registration in
                                                        Virginia! It’s a nightmare trying to get it sorted out!
                                                        What do you need?


Figure 2: Examples of responses generated with specified emotion, prompting scheme, and model.
                            (a)                                                 (b)                                                      (c)

Figure 3: Sentiment polarity score distributions for the prompt formats (a) Custom (b) CO-STAR (c) TIDD-EC.


                            (a)                                                 (b)                                                      (c)

Figure 4: Per emotion prediction accuracy for the prompt formats (a) Custom (b) CO-STAR (c) TIDD-EC.


Table 4
Emotion prediction accuracy per model and prompt format. The highest wins per model are highlighted in bold,
whereas the highest overall wins are underlined. *The score reported for Llama-2 and Mistral for TIDD-EC and
Custom, respectively, do not cover all emotions, as these combination experiments were terminated early due to
irrelevance by the authors’ judgement.
Model                   Llama-2-7b-chat-hf       Meta-Llama-3.1-8B-Instruct      Qwen2.5-7B-Instruct         Mistral-7B-Instruct-v0.3                 GPT-4o-mini
Prompt Format
                Custom     CO-STAR    TIDD-EC*   Custom   CO-STAR   TIDD-EC   Custom   CO-STAR   TIDD-EC   Custom*   CO-STAR    TIDD-EC    Custom      CO-STAR      TIDD-EC
/ Emotion
angry            0.28         0.28      0.60*     0.21     0.31       0.80     0.42     0.44      0.82      0.31*     0.43        0.31         0.35      0.43        0.31
disgusted        0.47         0.62      0.07*     0.51     0.89       0.94     0.44     0.91      0.39        -       0.30        0.54         0.63      0.87        0.92
fearful          0.78         0.85        -       0.66     0.97       0.45     0.57     0.80      0.59        -       0.79        0.63         0.54      0.81        0.87
happy            0.48         0.36        -       0.36     0.85       0.65     0.10     0.59      0.38        -       0.54        0.67         0.18      0.66        0.79
neutral          0.66         0.31        -       0.88     0.95       0.88     0.91     0.84      0.93        -       0.90        0.83         0.93      0.93        0.99
sad              0.51         0.63        -       0.75     0.65       0.30     0.51     0.47      0.35        -       0.58        0.49         0.62      0.49        0.29
surprised        0.58         0.64        -       0.52     0.94       0.93     0.26     0.97      0.96        -       0.75        0.88         0.34      1.00        0.97


5.1. Emotional Utterance Generation
Figure 3 illustrates the sentiment polarity scores of the generated responses, ranging from -1.0 (negative
sentiment) to 1.0 (positive sentiment). Additionally, we include the sentiment polarity distribution of
the original Qulac responses to provide a point of comparison. All prompt formats show a noticeable
shift in the polarity distributions compared to the original dataset. The custom format demonstrates
high variability, which can be due to the absence of any structure, allowing the models to interpret
and express content in a more diverse way. CO-STAR generally shows more positive responses, with a
visible deviation toward higher polarity scores, whereas TIDD-EC provides a more balanced spread of
scores, which could be linked to the structured nature of the framework.
   Table 4 shows the per-emotion classification accuracy for different combinations of models and
prompt formats, while Figure 4 provides a visual representation of the results. We can observe that CO-
STAR consistently achieves high emotional prediction accuracy, especially when paired with Llama-3.1
and Qwen2.5. On the other hand, TIDD-EC also shows superior accuracy, particularly for emotions
such as angry and surprised. The custom prompt format exhibits the most inconsistency across different
models and emotions. For example, GPT-4o-mini achieves 0.63 accuracy for disgusted but only 0.18 for
happy. Overall, Llama-3.1 and GPT-4o-mini emerge as the best performers across most emotions and
Table 5
Per emotion naturalness win percentages per model and prompt format. The highest wins per model are
highlighted in bold, whereas the highest overall wins are underlined. *The score reported for Llama-2 and
Mistral for TIDD-EC and Custom, respectively, do not cover all emotions, as these combination experiments
were terminated early due to irrelevance.
Model                   Llama-2-7b-chat-hf       Meta-Llama-3.1-8B-Instruct       Qwen2.5-7B-Instruct        Mistral-7B-Instruct-v0.3              GPT-4o-mini
Prompt Format
                Custom     CO-STAR    TIDD-EC*   Custom   CO-STAR   TIDD-EC   Custom   CO-STAR   TIDD-EC   Custom*   CO-STAR    TIDD-EC   Custom    CO-STAR      TIDD-EC
/ Emotion
angry            0.67         0.72      0.64*     0.64     0.90       0.17     0.68      0.81      0.14     0.41*     0.93        0.75     0.74       0.84        0.27
disgusted        0.73         0.78      0.65*     0.75     0.96       0.80     0.70      0.85      0.88       -       0.94        0.88     0.80       0.95        0.83
fearful          0.87         0.96        -       0.81     1.00       0.91     0.94      1.00      0.97       -       1.00        0.97     0.93       1.00        0.97
happy            0.84         0.99        -       0.95     1.00       0.97     0.95      0.99      0.99       -       0.98        0.98     0.87       1.00        1.00
neutral          0.56         0.98        -       0.80     0.99       0.80     0.88      0.99      0.98       -       1.00        0.99     0.66       0.99        0.90
sad              0.88         0.96        -       0.94     1.00       0.84     0.93      1.00      0.99       -       1.00        0.96     0.88       1.00        0.95
surprised        0.86         0.99        -       0.87     1.00       0.89     0.96      0.99      0.99       -       0.98        0.98     0.90       1.00        1.00


Table 6
Per emotion usefulness win percentages per model and prompt format. The highest wins per model are
highlighted in bold, whereas the highest overall wins are underlined. *The score reported for Llama-2 and
Mistral for TIDD-EC and Custom, respectively, do not cover all emotions, as these combination experiments
were terminated early due to irrelevance.
Model                   Llama-2-7b-chat-hf       Meta-Llama-3.1-8B-Instruct       Qwen2.5-7B-Instruct        Mistral-7B-Instruct-v0.3              GPT-4o-mini
Prompt Format
                Custom     CO-STAR    TIDD-EC*   Custom   CO-STAR   TIDD-EC   Custom   CO-STAR   TIDD-EC   Custom*   CO-STAR    TIDD-EC   Custom    CO-STAR      TIDD-EC
/ Emotion
angry            0.26         0.35      0.32*     0.47     0.85       0.19     0.49      0.64      0.12     0.47*     0.83        0.58     0.38       0.69         0.22
disgusted        0.26         0.28      0.32*     0.35     0.87       0.57     0.36      0.53      0.61       -       0.76        0.70     0.27       0.72         0.53
fearful          0.25         0.36        -       0.36     0.87       0.48     0.71      0.79      0.63       -       0.90        0.78     0.37       0.86         0.64
happy            0.46         0.49        -       0.51     0.92       0.64     0.76      0.80      0.71       -       0.87        0.72     0.45       0.92         0.64
neutral          0.37         0.69        -       0.55     0.97       0.66     0.74      0.94      0.91       -       0.94        0.86     0.39       0.98         0.76
sad              0.28         0.48        -       0.38     0.94       0.48     0.62      0.78      0.76       -       0.87        0.74     0.35       0.84         0.65
surprised        0.42         0.50        -       0.49     0.95       0.66     0.81      0.73      0.56       -       0.84        0.64     0.42       0.87         0.60


prompt formats, whereas Llama-2 struggles more than other models.

5.2. Natural Language Generation
Naturalness Table 7 highlights that the CO-STAR prompt format paired with Llama-3.1 and Mistral
produces the highest naturalness score, with a win rate of 0.98. This is followed by GPT-4o-mini, which
achieves an overall naturalness score of 0.97. A quick emotion level comparison of the prediction
accuracy in Table 4 with the naturalness scores in Table 5 reveals an interesting inverse relationship
between these metrics for some emotions such as angry and disgusted. Conversely, emotions such as
happy and neutral often show lower prediction accuracy but higher naturalness scores. The surprised
emotion, particularly with CO-STAR, is an interesting exception: GPT-4o-mini achieves high scores in
both emotion accuracy and naturalness (1.00), indicating that this emotion can be expressed in a way
that is both emotionally accurate and conversationally fluid.

Usefulness Table 6 provides the per emotion results for the usefulness evaluation. The CO-STAR
prompt format again leads to the highest overall scores, with Llama-3.1 achieving a usefulness win
rate of 0.91, which we can see in Table 7. TIDD-EC also performs well, particularly with Qwen2.5 and
Mistral, achieving moderate usefulness scores. However, it sometimes limits the creative engagement
needed for guiding conversations naturally. The custom prompt format, on the other hand, shows
mixed results, reflecting the format’s lack of clear instructions that occasionally result in off-topic or
less goal-oriented responses. Moreover, for more negative emotions such as angry and disgusted, we
observe an inverse relationship between emotional accuracy in Table 4 and usefulness in Table 6. This
suggests that while models can generate accurate emotional expressions, these emotions may detract
from the conversational goals, reducing usefulness. On the other hand, for emotions such as happy and
neutral, usefulness tends to align better with emotional accuracy.
Table 7
Average emotion prediction accuracy per model and prompt format, as well as the average win percentages in
naturalness and usefulness for the corresponding combination. The highest scores per model are highlighted in
bold, whereas the highest overall scores are underlined. *The score reported for Llama-2 and Mistral for TIDD-EC
and Custom, respectively, do not cover all emotions, as these combination experiments were terminated early
due to irrelevance.
Model                     Llama-2-7b-chat-hf    Meta-Llama-3.1-8B-Instruct       Qwen2.5-7B-Instruct      Mistral-7B-Instruct-v0.3            GPT-4o-mini
Prompt Format        Custom CO-STAR TIDD-EC*   Custom CO-STAR TIDD-EC        Custom CO-STAR TIDD-EC    Custom* CO-STAR TIDD-EC       Custom    CO-STAR TIDD-EC
Emotion Prediction
                      0.54    0.53    0.34*     0.56      0.79      0.71      0.46    0.72     0.63     0.31*     0.61      0.62      0.51      0.74     0.73
(accuracy)
Naturalness (wins)    0.77    0.91    0.64*     0.82      0.98      0.77      0.86    0.95     0.85     0.41*     0.98      0.93      0.83      0.97     0.85
Usefulness (wins)     0.33    0.45    0.32*     0.44      0.91      0.53      0.64    0.74     0.61     0.47*     0.86      0.72      0.38      0.84     0.58


6. Discussions and Limitations
Our results show that despite offering the most freedom, the custom format consistently under-performs
compared to CO-STAR and TIDD-EC. This indicates that structured prompts are needed for this task,
since they offer the LLMs some context and guide the result towards the desired style. Looking closer
to the results of CO-STAR and TIDD-EC, we see that they show complementary strengths. TIDD-EC
outperforms CO-STAR in expressing some emotions such as anger (with an accuracy of 0.82 and
0.80 in Qwen2.5 and Llama-3.1, respectively), but does so at the cost of naturalness (0.14 and 0.17
correspondingly) and usefulness (0.19 and 0.12 correspondingly), suggesting that while TIDD-EC’s
decisive structure enables models to generate responses that are emotionally accurate, it might lead to
outputs that sound less natural, potentially forced, or overly explicit. It could also indicate that few-shot
learning affects the quality of the responses generated, since the models might be tempted to follow the
examples closely, leading to irrelevant responses.
   Answering RQ1, we observe that CO-STAR consistently outperforms other prompt formats in emotion
prediction, naturalness, and usefulness (except for Llama-2 with the custom format, and Mistral with
TIDD-EC in emotional accuracy). This indicates that in the context of answering CQs in CIR, a flexible
prompting structure is particularly effective in not only helping models generate human-like and fluent
responses, but also ensuring that the generated replies are aligned with the intended conversational
goals, while expressing the desired emotions successfully.
   Answering RQ2, we find the highest overall scores to be achieved by Llama-3.1, followed by GPT-4o-mini.
Furthermore, we see that the choice of model does not seem to affect our results as much as prompting
does. However, we can observe that Llama-2 struggles the most compared to the other models, which
could be due to its nature of being optimized for chatbot conversations.
   In order to compare open-source and commercial LLMs, we analyze the differences between the best
performing models on an emotional level. When comparing the emotional expression performance, we
look at the results obtained using both CO-STAR and TIDD-EC, since both models’ best performances
are distributed among these two prompt formats. We compare the naturalnes and usefulness win
percentages of both models paired with CO-STAR, which is the format they performed the best with.
Our findings show that the models perform very similarly to each other. Thus, answering RQ3, we see
that when tested in a fair setting, using a commercial LLM does not introduce notable advantages over
using open-source models.
   Although this study is able to provide valuable insight into which combinations are more suitable
for the task of responding to CQs with emotions, it is undeniable that it could be improved with more
resources. With more computational and logistical resources, larger versions of the models could be
used, leading to better performance. This would also allow us to conduct more experiments, perhaps
averaging the results over several runs to overcome LLM randomness. This study is also affected by the
possible bias the LLMs may include, both in the generation and the evaluation steps. In generation,
certain topics might be treated differently, whereas in evaluation, the LLM evaluator might favour
responses generated by LLMs.
7. Conclusions and Future Work
In this study, we inject emotional responses to CQs in the Qulac dataset. We carry out our experiments
using various combinations of prompting schemes and LLMs in order to find the most suitable pair
for the task. We evaluate our results on three metrics: emotional prediction accuracy, naturalness, and
usefulness. Our results indicate that the CO-STAR prompt format, paired with Meta-Llama-3.1-8B-
Instruct outperforms the rest of the combinations.
   Future work could include the utilization of more data in Qulac to allow for more in depth analysis
with all of the fields in the dataset, like topic type. Using different datasets on the pipeline could also
improve variety and enhance our results. Finally, including a larger range of emotions with different
intensity levels would provide us with more realistic results.


References
 [1] P. Erbacher, L. Soulier, L. Denoyer, State of the art of user simulation approaches for conversational
     information retrieval, arXiv preprint arXiv:2201.03435 (2022).
 [2] J. Dalton, C. Xiong, J. Callan, Trec cast 2019: The conversational assistance track overview, arXiv
     preprint arXiv:2003.13624 (2020).
 [3] P. Owoicho, J. Dalton, M. Aliannejadi, L. Azzopardi, J. R. Trippas, S. Vakulenko, Trec cast 2022:
     Going beyond user ask and system retrieve with initiative and response generation., in: TREC,
     2022.
 [4] A. Anand, L. Cavedon, M. Hagen, H. Joho, M. Sanderson, B. Stein, Conversational search–a report
     from dagstuhl seminar 19461, arXiv preprint arXiv:2005.08658 (2020).
 [5] F. Radlinski, N. Craswell, A theoretical framework for conversational search, in: CHIIR, 2017, pp.
     117–126.
 [6] A. Salle, S. Malmasi, O. Rokhlenko, E. Agichtein, Studying the effectiveness of conversational
     search refinement through user simulation, in: ECIR, 2021, pp. 587–602.
 [7] M. Aliannejadi, H. Zamani, F. Crestani, W. B. Croft, Asking clarifying questions in open-domain
     information-seeking conversations, in: SIGIR, 2019.
 [8] J. Zou, M. Aliannejadi, E. Kanoulas, M. S. Pera, Y. Liu, Users meet clarifying questions: Toward a
     better understanding of user interactions for search clarification, TOIS 41 (2023) 1–25.
 [9] H. Zhou, M. Huang, T. Zhang, X. Zhu, B. Liu, Emotional chatting machine: Emotional conversation
     generation with internal and external memory, in: AAAI, volume 32, 2018.
[10] X. Fu, E. Yilmaz, A. Lipani, Evaluating the cranfield paradigm for conversational search systems,
     in: SIGIR, 2022, pp. 275–280.
[11] K. Balog, Conversational ai from an information retrieval perspective: Remaining challenges and
     a case for user simulation (2021).
[12] I. Sekulić, M. Aliannejadi, F. Crestani, Evaluating mixed-initiative conversational search systems
     via user simulation, in: WSDM, 2022, pp. 888–896.
[13] P. Owoicho, I. Sekulic, M. Aliannejadi, J. Dalton, F. Crestani, Exploiting simulated user feedback
     for conversational search: Ranking, rewriting, and beyond, in: SIGIR, 2023, pp. 632–642.
[14] Z. Wang, Z. Xu, V. Srikumar, Q. Ai, An in-depth investigation of user response simulation for
     conversational search, in: WWW, 2024, pp. 1407–1418.
[15] J. Dalton, C. Xiong, V. Kumar, J. Callan, Cast-19: A dataset for conversational information seeking,
     in: SIGIR, 2020, pp. 1985–1988.
[16] M. Aliannejadi, J. Kiseleva, A. Chuklin, J. Dalton, M. Burtsev, Convai3: Generating clarifying
     questions for open-domain dialogue systems (clariq), arXiv preprint arXiv:2009.11352 (2020).
[17] N. Bernard, K. Balog, Mg-shopdial: A multi-goal conversational dataset for e-commerce, in: SIGIR,
     2023, pp. 2775–2785.
[18] S. Loria, et al., textblob documentation, Release 0.15 2 (2018) 269.
[19] K. Balog, C. Zhai, User simulation for evaluating information access systems, 2024.
     arXiv:2306.08550.
[20] S. Ghosh, M. Chollet, E. Laksana, L.-P. Morency, S. Scherer, Affect-lm: A neural language model
     for customizable affective text generation, arXiv preprint arXiv:1704.06851 (2017).
[21] P. Colombo, W. Witon, A. Modi, J. Kennedy, M. Kapadia, Affect-driven dialog generation, arXiv
     preprint arXiv:1904.02793 (2019).
[22] H.-C. Lin, S. Feng, C. Geishauser, N. Lubis, C. van Niekerk, M. Heck, B. Ruppik, R. Vukovic, M. Gasić,
     Emous: Simulating user emotions in task-oriented dialogues, in: SIGIR, 2023, pp. 2526–2531.
[23] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
     translation, in: ACL, 2002, pp. 311–318.
[24] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
     branches out, 2004, pp. 74–81.
[25] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation
     with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation
     measures for machine translation and/or summarization, 2005, pp. 65–72.
[26] M. Zhong, Y. Liu, D. Yin, Y. Mao, Y. Jiao, P. Liu, C. Zhu, H. Ji, J. Han, Towards a unified multi-
     dimensional evaluator for text generation, arXiv preprint arXiv:2210.07197 (2022).
[27] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate as you desire, arXiv preprint arXiv:2302.04166
     (2023).
[28] J. Wang, Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, J. Zhou, Is chatgpt a good nlg
     evaluator? a preliminary study, arXiv preprint arXiv:2303.04048 (2023).
[29] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-eval: Nlg evaluation using gpt-4 with better human
     alignment, arXiv preprint arXiv:2303.16634 (2023).
[30] P. Ekman, Are there basic emotions?, Psychological Review 99 (1992) 550–553.
[31] P. Ekman, Emotions revealed, Bmj 328 (2004).
[32] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     gava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
     arXiv:2307.09288 (2023).
[33] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
     A. Fan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024).
[34] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, et al., Qwen2
     technical report, arXiv preprint arXiv:2407.10671 (2024).
[35] Q. Team, Qwen2.5: A party of foundation models, 2024. URL: https://qwenlm.github.io/blog/qwen2.
     5/.
[36] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023).
[37] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
     S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[38] J. Hartmann, Emotion english distilroberta-base, https://huggingface.co/j-hartmann/
     emotion-english-distilroberta-base/, 2022.

</pre>