<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-Agent Response Refinement in Conversational Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soyeong Jeong</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aparna Elangovan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emine Yilmaz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleg Rokhlenko</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Conversational AI, Multi-Agent Systems, Adaptive Response Refinement, Large Language Models</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University College London</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>38</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for efective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user's persona, or both.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Large Language Models (LLMs) have demonstrated remarkable performance across a
broad spectrum of NLP tasks, primarily due to their ability to generate coherent and contextually relevant
responses, powered by extensive training on diverse data [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, LLMs do not always produce
satisfactory responses on the first attempt [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], and this issue becomes more evident in multi-turn
conversational settings, where models must not only interpret the user’s current query but also consider
the entire conversational history, including dependencies, ambiguities, and co-references [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. These
dificulties are particularly pronounced in complex, personalized conversations that demand both user’s
persona alignment and factual accuracy [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. In such situations, users frequently have to prompt the
LLM to correct its responses, which can interrupt the conversation flow and diminish the overall user
experience [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. Thus, LLMs should proactively refine their inaccurate responses, even without
explicit user requests for correction.
      </p>
      <p>
        Several approaches have been explored for refining responses using a single agent, such as
SelfRefine [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where a single agent handles an entire refinement process, generating feedback across
multiple aspects and iteratively improving its outputs based on it. However, relying solely on a single
agent can be suboptimal, as the agent may become overly confident in the initial output, leading to
bias [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which can be amplified through repeated iterations [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. This issue could be particularly
critical for multi-turn conversational tasks, where errors in earlier turns can propagate to subsequent
turns [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>LaCATODA 2026: The 10th Linguistic and Cognitive Approaches to Dialog Agents Workshop at the 40th AAAI conference
†Work done during internship at Amazon.
‡Work done while at Amazon.</p>
      <p>CEUR
Workshop
Proceedings</p>
      <p>ceur-ws.org</p>
      <p>ISSN1613-0073
(A) Conversation
User Profile: I have been to Vermont to go skiing.</p>
      <p>I wouldn’t want to be governor.</p>
      <p>Corresponding Facts: Galusha House was built in 1790.</p>
      <p>Jonas Galusha was elected as Governor, beginning in 1809.</p>
      <p>(Q1) When was this historic home built?
(R1) The home was built in 1650.</p>
      <p>(Q2) When did Jonas Galusha
first become governor of Vermont?</p>
      <p>(R2) In 1823.</p>
      <p>Are you interested in this job?
(B) Refinement with a Single Agent (C) Refinement with Multiple Agents
(R1) The home was built in 1650.
(R1’) The home was built in 1890.</p>
      <p>(R2) In 1823.</p>
      <p>Are you interested in this job?</p>
      <p>(R2’) In 1809.</p>
      <p>Are you interested in this job?
(R1) The home was built in 1650.</p>
      <p>Refine with the Fact Agent,
and then with Coherence Agent.</p>
      <p>(R2) In 1823.</p>
      <p>Are you interested in this job?</p>
      <p>Refine with the Fact Agent,
and then with Persona Agent.</p>
      <p>Planner Agent
Fact Refining Agent
Persona Refining Agent</p>
      <p>Coherence Refining Agent
(R1’) The home was built in 1790.</p>
      <p>(R2’) In 1809,
a job that you do not want to be.</p>
      <p>
        A more efective approach is to incorporate multiple agents, each with a specific perspective. This
method takes advantage of LLMs’ ability to tackle complex tasks by decomposing them into simpler
subtasks and enabling collaborative problem-solving [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ]. Additionally, the multi-agent framework
enables for integration of a broader range of tools and knowledge sources [
        <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
        ], ensuring that
each agent can be specifically tailored to meet the diverse needs of each task. While multi-agent
frameworks have shown great promise in fields such as human behavior simulation, economic theories, and
more, their potential for response refinement in conversational systems remains largely underexplored.
Figure 1 (A) exemplifies a conversation where a response should accurately reflect the user profile and
specific knowledge. In such setting, employing multiple refining agents with specialized perspectives
can enhance conversation quality. Therefore, in this work, we introduce multiple refining agents that
refine responses when the initial outputs are incorrect, each focused on a distinct perspective: user
persona alignment (persona-refining agent), factual grounding to mitigate hallucinations (fact-refining
agent), and logical coherence with previous conversational turns (coherence-refining agent).
      </p>
      <p>The remaining challenge, then, is how to enhance the overall quality of the conversation to deliver
the most satisfactory refined response to the user, utilizing our three refining agents. To achieve this, the
agents should collaborate diferently for each query, as each query, even within the same conversation,
may require a diferent focus. For example, as shown in Figure 1 (A), some queries require factual
knowledge, while some queries also demand an understanding of the user’s profile. Therefore, the
set of agents deployed should vary depending on the query’s specific needs. Additionally, the optimal
sequence in which the agents are engaged may vary, as the focus and priorities of each query can difer.
For example, a query requiring immediate factual verification might first engage the fact-refining agent
to correct inaccuracies, followed by other agents as necessary. Therefore, we further propose a dynamic
refinement process that adjusts both the set and the sequence of refining agents for each query, by
introducing a novel planner agent that generates a sequence of required refining agents tailored to
the query, along with justifications for each decision. Each refining agent then refines the response
sequentially, following the sequence and referencing the justifications provided by the planner agent.
Note that all agents are based on unsupervised LLMs, each instantiated with a prompt specifying a
specific role. We refer to this framework as Multi-Agent Refinement with Adaptive agent selection
(MARA), which is illustrated in Figure 1 (C).</p>
      <p>We validate the eficacy of our framework on challenging conversational datasets that involve
user persona integration, require specific factual knowledge, or both. The results show that MARA
significantly outperforms relevant baselines, indicating that refining along the three aspects, persona,
factuality, and coherence, is beneficial. Moreover, our analyses highlight the importance of dynamically
selecting appropriate refinement strategies tailored to the specific needs of each query in diverse
conversational contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        LLMs in Conversation Systems. Recent Large Language Models (LLMs) have demonstrated
remarkable performance across a range of NLP tasks [
        <xref ref-type="bibr" rid="ref23 ref24 ref25">23, 24, 25</xref>
        ], even without additional training, largely
due to their massive pretraining on diverse datasets [
        <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
        ]. Their success is also remarkable in
conversational tasks [
        <xref ref-type="bibr" rid="ref1 ref2 ref28">28, 1, 2</xref>
        ], as LLMs are able to generate human-like responses [
        <xref ref-type="bibr" rid="ref11 ref29 ref30">29, 30, 11</xref>
        ]. However,
LLMs occasionally produce unsatisfactory responses, particularly when they fail to consider a user’s
profile [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], specific knowledge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or are not coherent with the previous turns [
        <xref ref-type="bibr" rid="ref31 ref32">31, 32</xref>
        ].
Verification and Refinement with LLMs. To address the issue of unsatisfactory responses, several
studies have explored verification and refinement approaches. While some research involves the
use of additional refinement models, [
        <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
        ], more recent studies emphasize the capability of LLMs
to verify and refine their own responses without the need for further training [
        <xref ref-type="bibr" rid="ref35 ref36 ref37 ref38 ref39">35, 36, 37, 38, 39</xref>
        ].
Specifically, Madaan et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed a method where a single agent LLM generates feedback on
multiple aspects of its own response. However, single-agent-based refinements can be less efective, as
the overall performance is constrained by the capacity of a single agent, and once an agent becomes
confident in its outputs, it may struggle to further generate novel thoughts [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ].
Multi-agent LLMs. Along with the powerful capabilities of an LLM, its performance can be further
enhanced when multiple LLMs are involved by collaborating or debating each other, particularly
on complex problems that require diverse perspectives from specialized LLM agents [
        <xref ref-type="bibr" rid="ref17 ref18 ref40 ref41">40, 17, 18, 41</xref>
        ].
Multi-agent LLM framework has recently been widely applied to various tasks, including software
development [
        <xref ref-type="bibr" rid="ref42 ref43">42, 43</xref>
        ], model evaluation [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ], research topic generation [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ], diagnostic consultations [
        <xref ref-type="bibr" rid="ref46 ref47">46,
47</xref>
        ], recommendation [
        <xref ref-type="bibr" rid="ref48 ref49">48, 49</xref>
        ], and reasoning improvement [
        <xref ref-type="bibr" rid="ref50">50</xref>
        ]. However, the use of multiple agents
for refining conversational turns remains underexplored.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Preliminaries</title>
        <p>We introduce MARA, a multi-agent framework that dynamically refines conversational responses.
We begin by formally defining an LLM, specifically in the context of a multi-turn conversational setting.
Large Language Models. Let us first define LLM as LLM, which takes an input sequence of tokens
 and generates an output sequence of tokens  . This process can be represented as  = LLM( ()) ,
where the prompt template  incorporates additional context or instructions that guide the LLM’s
behavior. Specifically,  can be used for role assignment, ensuring that the LLM adopts a particular
role or performs a specific task.</p>
        <p>Conversation with an LLMs. In a multi-turn conversational setting, the LLM, acting as a responding
agent, generates a response   for the  -th turn based on the current query   and the preceding
conversational context, which can be represented as:   = LLM( respond(  ,  −1 ,  −1 , ...,  1,  1)), where
  is the user’s query at the  -th turn, and   is the response generated by the LLM which operates
under the prompt template  respond. However, the quality of initial responses from the LLM may be
suboptimal in customized, realistic conversations, potentially overlooking the user’s profile, missing
specific knowledge, or failing to maintain coherence, and such errors in earlier conversational turns
can cumulatively impact subsequent turns.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Multi-Agent Response Refinement</title>
        <p>We now turn to our primary focus of further refining the initial response. To achieve this, we define
three specialized refining agents, each responsible for enhancing diferent aspects of the response.</p>
        <sec id="sec-3-2-1">
          <title>Single-Agent Response Refinement.</title>
          <p>As an initial response  may be inaccurate or unsatisfactory,
the goal of response refinement is to further improve its quality. This can be achieved by using an LLM
operating as a refining agent, represented as:  refine = LLM( refine()) , where an LLM instantiated
with  refine refines an initial response  to produce the improved response  refine. However, relying on
a single agent to assess and refine multiple aspects can be inefective, as it may struggle to holistically
address the diverse factors essential for a high-quality, customized conversation. To overcome these
limitations, we propose a multi-agent framework in which specialized refining agents collaborate to
enhance response quality.</p>
          <p>Multiple Refining Agents for Conversation. In a realistic yet challenging conversational scenario,
responses must be contextually relevant, aligned with user preferences, and factually reliable. To meet
these requirements, responses must be refined across multiple dimensions, ensuring factual correctness,
alignment with the user’s persona, and coherence across multiple conversational turns. Here, instead
of addressing all these aspects with a single agent, we introduce three specialized refining agents
that extend the general refining agent, LLM( refine()) , each addressing a distinct aspect of response
quality. Specifically, the fact-refining agent generates the refined response,
 fact-refine, that ensures
factual accuracy,  persona-refine that aligns responses with the user’s profile, and
 coherence-refine
that maintains coherence throughout conversation, where each LLM is instantiated with its respective
refining role template:  fact-refine,  persona-refine, and  coherence-refine. Having defined the role
of each refining agent, the next key challenge lies in how these agents should collaborate to collectively
refine the response.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Communication Strategy among Agents</title>
        <p>To enable efective collaboration among multiple agents, we explore various communication strategies
and propose a dynamic approach to optimize the refinement process.</p>
        <sec id="sec-3-3-1">
          <title>Simultaneous Communication.</title>
          <p>As an initial approach to agent collaboration, we introduce a
simultaneous communication strategy, where all refining agents independently refine the initial
response, and their refined outputs are then passed to a finalizer agent instantiated with
 finalizer, which
aggregates the individual refinements into a single unified response. Formally, this process is denoted as:
LLM( finalizer( fact-refine,  persona-refine,  coherence-refine)). However, this simultaneous approach
always necessitates an additional finalizer agent, and the overall quality of the final output may heavily
depend on the capabilities of this finalizer agent.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Sequential Communication.</title>
          <p>To address potential limitations of the simultaneous approach, we
introduce a sequential refinement process where each agent builds upon the response refined by the
previous agent. Specifically, given an initial response  , it is sequentially refined by a series of agents
denoted as  = [LLM( refine ())]

=1 , where each refining agent is instantiated with a role-specific
prompt template  refine , selected from the set of refinement templates: { fact-refine,  persona-refine,
 coherence-refine}. Here, each refining agent receives the response refined by the preceding agent,
 refine−1</p>
          <p>, as input, and applies its specific prompt template,  refine , to generate the next refined
response,  refine . The sequential process continues through the sequence of agents until obtaining
the final refined response,</p>
          <p>refine . Yet, it may require a diferent combination and ordering of refining
agents based on the specific context, but the optimal sequence of refining agents may vary depending
on the conversational context.
Dynamic Sequential Communication. Therefore, to further adaptively operate the response
refinement process, we introduce a dynamic strategy, where a planner agent selects the most suitable
sequence of refining agents along with justifications for its decisions, adapting to the needs of each
query. Specifically, the planner agent determines the sequence of refining agents required for each
query, instantiated by a template  planner. Formally, given a query  and an initial response  , the
planner agent outputs the sequence of refining agents as follows:  planner = LLM( planner(, )) . Once
 planner is determined, the refining agents sequentially refine the response, with each agent taking as
input the refined response from the preceding agent in the sequence. Additionally, each agent also
receives the planner’s justifications, allowing each agent to understand its role in the sequence and
collaborate efectively.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setups</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          In this section, we describe the experimental setup, leaving further details in Appendix A.
We validate MARA in realistic conversational settings, including queries requiring personalization
and factual information. PersonaChat [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] is designed to generate responses aligned with a user’s
persona. INSCIT [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ] is an information seeking dataset where knowledge is grounded in Wikipedia.
FoCus [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ] requires both user-aligned and knowledge-grounded responses. Furthermore, we conduct
evaluations on two other datasets: the role-playing dataset, PRODIGy [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ], and the domain-specific,
Ubuntu Dialogue Corpus [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baselines and Our Model</title>
        <p>
          We compare MARA against six baseline models, starting with a baseline without refinement, two
single-agent refinement approaches, and three multi-agent refinement approaches. 1) No Refine
generates response without any further refinement. 2) Self-Refine [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is a single agent refinement
baseline, which makes an LLM to generate feedback on 10 aspects of its own response and iteratively
refines with the feedback. 3) SPP [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ] is another single-agent refinement baseline, where a single LLM
generates multiple self-constructed roles that collaborate within a single prompt. 4) LLMvLLM [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ] is a
multi-agent refinement baseline where two agents engage in cross-examination to detect factual errors
in generated response. 5) MADR [58] is another multi-agent refinement baseline where two agents
debate based on the predefined error types to generate fact checking explanations. 6) MultiDebate [59]
is a multi-agent refinement baseline where multiple language model agents iteratively engage in a
structured debate to improve factual accuracy and reasoning. 7) MARA is our proposed system, where
the agents refine the response in the specified order, adaptively generated by a planner agent.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <p>We evaluate the models using G-Eval [60] to assess the quality of the refined responses. Following the
setup in Liu et al. [60] for a conversational setting, we assess each refined response using four metrics: 1)
Coherence (Coh.) measures whether the conversation response logically follows the preceding context
with a scale of 1 to 3, which is highly related to the efectiveness of the coherence-refining agent. 2)
Groundedness (Grd.) evaluates whether the response accurately incorporates the provided fact, with
a scale of 0 to 1, which corresponds to the efectiveness of the fact-refining agent. 3) Naturalness (Nat.)
evaluates whether the response is natural with a scale of 1 to 3. 4) Engagingness (Eng.) measures
whether the response is engaging with a scale of 1 to 3, which mainly reflects the efectiveness of the
persona-refining agent. Furthermore, to provide a comprehensive assessment, we report the 5) Overall
score, which represents a scale-normalized average of the four evaluation metrics.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Implementation Details</title>
        <p>
          We mainly use the Claude Sonnet 3 model [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] as the base LLM for both our framework and the baselines.
Furthermore, since a multi-agent framework ofers the flexibility to assign diferent LLMs to each agent
based on their role, for the fact-refining agent, we use another Claude Sonnet 3.5 model, as it ofers
improved capabilities in factual accuracy, which are critical for this particular role. Additionally, we
report the performance of MARA∗, a variant of MARA in which the fact-refining agent uses Sonnet 3
instead, ensuring that all agents use the same model. In order to assess the robustness of our proposed
framework across diverse LLMs, we further use GPT-4o-mini [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], LLaMA 3.1 8B, and LLaMA 3.1
70B [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. For each dataset, we sample 100 conversations, resulting in a total of 673 queries for the
PersonaChat dataset, 506 queries for the INSCIT dataset, and 563 queries for the FoCus dataset. For the
G-Eval metric, we use GPT-4o mini model [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], with the normalization steps. We include the prompts
used for MARA and G-Eval in Appendix A.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results and Analyses</title>
      <p>We present the experimental results and analyses.</p>
      <sec id="sec-5-1">
        <title>5.1. Main Results</title>
        <p>Here, we present the overall results across various challenging yet realistic conversational scenarios.
Overall Results. We report the overall experimental results in Table 1 with three diferent runs. As
shown in the table, MARA consistently outperforms other models across diverse metrics, and the gaps
between MARA and the other models are significantly substantial.</p>
        <p>To be more specific, we explore diverse conversational scenarios, particularly challenging ones
requiring alignments with the user’s persona (PersonaChat), factual grounding (INSCIT), and a combination
of both (FoCus). In PersonaChat, which emphasizes responses that align with the user’s profile,
singleagent refinement methods such as Self-Refine and SPP outperform the baseline without refinement. This
improvement is likely due to their focus on enhancing fluency and coherence, which aligns well with
tasks requiring the incorporation of user interests and preferences across conversations, rather than
managing specific factual content. However, in datasets that demand specific factual knowledge, such
as INSCIT and FoCus, single-agent refinement becomes less efective and even results in performance
degradation, particularly in groundedness scores. Notably, SPP outperforms Self-Refine, suggesting
that generating multiple roles within a single agent can be beneficial for information-intensive queries.
Nevertheless, compared to the significantly improved performance of our MARA framework, generating
multiple perspectives through distinct agents appears to be more efective.</p>
        <p>However, adopting a multi-agent framework does not necessarily guarantee improved performance,
especially in challenging customized conversational tasks that require user persona understanding and
factual knowledge. In fact, the significant performance gap between MARA and other multi-agent
baselines highlights the crucial role of agent design in enhancing the efectiveness of multi-agent
frameworks. In other words, compared to other multi-agent frameworks, which either focus only on
predefined factual error types (MADR) or lack explicitly defined agent roles for targeted refinement
(LLMvLLM and MultiDebate), these design choices may be less efective in challenging conversational
scenarios that require multiple aspects. In contrast, MARA consistently improves performance across all
three conversational scenarios, demonstrating the efectiveness of our three-agent design, specifically
tailored for user profile alignment, factual accuracy, and coherence.</p>
        <p>Efectiveness of our Planner Agent. Notably, in realistic conversational scenarios, diferent queries
(even within the same conversation) focus on diferent aspects, thus requiring distinct sets of refining
agents. To address this, we propose a planner agent that dynamically assigns refining agents based on
the specific requirements of each query. To evaluate its efectiveness, we first analyze the distribution
of refining agents across three diferent conversational settings. As shown in Figure 2, the distribution
varies across diferent conversational datasets, indicating that the planner agent efectively adapts to
diverse query requirements. Specifically, the persona-refining agent is predominantly used in datasets
where responses must incorporate user preferences, while the fact-refining agent is more frequently
required in knowledge-focused datasets. For the dataset requiring both aspects, both agents are allocated
in similar proportions.</p>
        <p>Next, we further evaluate the efectiveness of the current design of our planner agent by comparing
it against a random setting. As shown in Table 2, the performance of a randomly assigned sequence of
refining agents is lower than that of our planner-driven approach, indicating that the proposed planner
efectively determines an optimal refinement sequence. However, even with randomly assigned agents,
performance remains higher than the baseline without refinement, demonstrating the efectiveness of
incorporating our three refining agents. Furthermore, we explore the full potential of our planner agent
by investigating how much further it can improve performance. To do so, we evaluate an ideal planner
that selects the best-performing sequence. We report its performance by brute-forcing all possible
combinations of sequences using our three refining agents and selecting the optimal sequence for each
query1. As shown in Table 2, this ideal planner achieves the highest performance while requiring fewer
accesses to the refining agents. These results validate the efectiveness of our dynamic agent allocation
strategy and indicate that further advancements in planner agents could yield even greater performance
improvements.</p>
        <p>
          Evaluation on the Role-Playing Dataset. While both role-playing conversations and our
conversational setting involve persona-driven responses, their objectives are fundamentally diferent.
Specifically, role-playing tasks focus on the LLM’s persona, requiring an LLM to adopt a predefined
persona, whereas ours focuses on the user’s persona, enabling an LLM to generate responses that
align with the user’s persona. Nonetheless, to further evaluate the efectiveness of MARA, we evaluate
the performance on the role-playing conversational dataset, PRODIGy [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ]. As shown in Table 3,
MARA significantly outperforms other baselines, demonstrating its efectiveness even in role-playing
conversational settings.
        </p>
        <p>
          Evaluation on a Domain-Specific Corpus. To examine whether MARA generalizes well to
domainspecific settings, we additionally evaluate it on the Ubuntu Dialogue Corpus [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ], consisting of
multiturn dialogues focused on technical problem-solving in software environments. As shown in Table 4,
MARA outperforms all baselines, demonstrating its efectiveness even in software engineering–related
conversational tasks.
1We report results for 20 conversations in Table 2, as full brute-force computation is largely expensive.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Ablations and Discussion</title>
        <p>In this section, we provide a detailed analysis of the performance improvements.</p>
        <p>Human Evaluation. Although G-Eval is a widely used evaluation metric for its high correlation with
humans, we further conducted a human evaluation to measure the alignment between their evaluations.
Specifically, 8 English-fluent participants were involved in evaluating 288 conversational queries from
the FoCus dataset. As shown in Table 5, human raters consistently preferred MARA’s responses over
baseline models.</p>
        <p>Additionally, to validate the reliability of G-Eval, we further assess the alignment between G-Eval
and human judgments by calculating Spearman’s rank correlation between the model-based evaluation
and human evaluation results. The correlation values for coherence (0.51) and groundedness (0.48)
indicate a reasonable alignment between G-Eval and human rankings, suggesting that G-Eval is fairly
reliable in assessing these metrics. For engagingness, which reflects how the response aligns with the
user’s persona, the correlation of 0.58 demonstrates the strongest alignment, indicating that G-Eval
is particularly efective at evaluating how engaging a conversation is. However, the relatively low
correlation for naturalness (0.35) reveals a noticeable gap between G-Eval’s assessments and human
evaluations, especially in capturing the human-like qualities of the responses. This result might help
explain the relatively smaller gaps in naturalness among the models, as shown in Table 1.
Ablation Studies on Refining Agents. In order to see how each refining agent contributes to the
overall performance, we perform ablation studies when adding three refining agents. As shown in
Table 6, each agent plays a crucial role in improving overall performance, and incorporating all three
perspectives turns out to be even more beneficial. We also evaluate a single-agent variant that integrates
all three aspects, which performs better than the No Refine baseline, validating the efectiveness of our
proposed perspectives. However, MARA achieves superior performance, demonstrating the benefit of
distributed specialization across multiple agents. Additionally, we examine an iterative single-agent
setup, where the same agent refines responses across multiple rounds. This approach results in a
performance drop, suggesting that initial errors may be amplified through repeated iterations—further
underscoring the advantage of collaborative multi-agent refinement.</p>
        <p>Ablation Studies on Communication Strategy. In order to investigate the eficacy of our proposed
dynamic sequential communication strategy, we compare it against other possible strategies using our</p>
        <p>T No Refine 2.45
PG MARA (Ours) 2.61</p>
        <p>No Refine - 8B 2.30
am MARA - 8B (Ours) 2.46
laL No Refine - 70B 2.41</p>
        <p>MARA - 70B (Ours) 2.60
refining agents. As shown in Table 7, the simultaneous communication strategy requires more access
to the agents per query, as it involves all three refining agents as well as the additional finalizer agent.
We also evaluate the performance of three diferent sequential settings, where the agents refine the
initial response in static orders. The results indicate that performance varies significantly depending on
the refinement order, making it impractical for realistic conversational settings, as users would need to
predefine the optimal sequence for every query. On the other hand, our dynamic strategy adapts to
each query by selecting the most relevant sequence of agents, leading to more consistent and improved
performance in realistic conversational settings.</p>
        <p>Analyses on Fact-Refining Agent. A notable advantage of the multi-agent framework is its flexibility
in utilizing diferent agents, and in our case, we can assign a diferent model to the fact-refining agent.
As shown in Table 1 and Table 8, while MARA with a fact-refining agent of the same model size as a
responding agent indeed improves performance when compared to baselines, the results show that
using a diferent fact-refining agent yields much better outcomes. This highlights a key strength of
the multi-agent framework: the ability to utilize diferent LLMs for specific roles, thereby enhancing
performance beyond what a single-agent approach can achieve. Then, one might ask why not simply
use a more powerful model as the responding agent directly. However, Table 8 shows that refinement
with MARA significantly improves performance, even when the responding agent is already powerful.
This indicates that as LLMs continue to advance, adopting MARA is expected to further amplify their
efectiveness.</p>
        <p>Efectiveness on Diverse LLMs. We further validate the efectiveness of MARA with other widely
used LLMs in Table 9. The results demonstrate that MARA can efectively refine initial responses across
diferent LLMs, seamlessly integrate with diverse models, and enhance their outputs, highlighting its
broad applicability to various LLMs.</p>
        <p>Case Studies. Lastly, we present case studies in Table 11 and an illustrative example showing the
roles of refining agents and the planner in Table 12, along with analyses in Appendix B.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>While LLMs have shown promise in conversational systems, they often struggle with complex,
customized conversations requiring persona alignment and specific knowledge. Prior approaches using
single-agent self-refinement can be suboptimal, as one model must handle all aspects of feedback
and refinement. In this work, we presented a cooperative multi-agent framework to address these
limitations, with specialized agents for fact-checking, persona alignment, and coherence, by allocating
them dynamically tailored to each query, and it significantly outperforms existing baselines.
Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023,
Association for Computational Linguistics, 2023, pp. 12621–12640. URL: https://doi.org/10.18653/
v1/2023.emnlp-main.778. doi:10.18653/V1/2023.EMNLP-MAIN.778.
[58] K. Kim, S. Lee, K. Huang, H. P. Chan, M. Li, H. Ji, Can llms produce faithful explanations
for fact-checking? towards faithful explainable fact-checking via multi-agent debate, arxiv
Preprint arXiv:2402.07401 abs/2402.07401 (2024). URL: https://doi.org/10.48550/arXiv.2402.07401.
doi:10.48550/ARXIV.2402.07401. arXiv:2402.07401.
[59] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, I. Mordatch, Improving factuality and reasoning
in language models through multiagent debate, in: Forty-first International Conference on
Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, OpenReview.net, 2024. URL:
https://openreview.net/forum?id=zj7YuTE4t8.
[60] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-eval: NLG evaluation using gpt-4 with better
human alignment, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023,
Association for Computational Linguistics, 2023, pp. 2511–2522. URL: https://doi.org/10.18653/v1/
2023.emnlp-main.153. doi:10.18653/V1/2023.EMNLP-MAIN.153.
[61] J. Shin, H. Song, H. Lee, S. Jeong, J. Park, Ask llms directly, ”what shapes your bias?”: Measuring
social bias in large language models, in: L. Ku, A. Martins, V. Srikumar (Eds.), Findings of the
Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting,
August 11-16, 2024, Association for Computational Linguistics, 2024, pp. 16122–16143. URL:
https://doi.org/10.18653/v1/2024.findings-acl.954. doi:10.18653/V1/2024.FINDINGS-ACL.954.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Additional Experimental Setups</title>
      <sec id="sec-7-1">
        <title>A.1. Datasets</title>
        <p>
          We validate MARA in a realistic conversational setting by testing it on three conversational datasets,
including persona-aligned and knowledge-grounded queries.
1) FoCus [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ] consists of conversational queries requiring both persona-aligned and
knowledgegrounded responses, with knowledge sourced from Wikipedia. The dataset has an average of 11.9
conversational turns. We use the publicly available validation set.
2) PersonaChat [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] is a conversational dataset involving dialogues between two speakers, each having
persona sentences that describe their character and serve as context for the conversation. The dataset
has an average of 14.0 conversational turns. We use the publicly open validation set.
3) INSCIT [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ] is an information-seeking conversational dataset, where the knowledge is grounded in
Wikipedia. The dataset has an average of 11.8 conversational turns. We use the publicly accessible test
set for the evaluation.
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. Models</title>
        <p>
          We compare MARA against four baseline models, starting with a baseline without refinement, two
single-agent refinement approaches, and one multi-agent refinement approach.
1) No Refine is an approach involves generating a response solely with a responding agent, without
any further refinement of the response.
2) Self-Refine [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is a representative single agent refinement baseline, which makes an LLM to generate
feedback on 10 aspects (e.g., Relevant, Informative, Engaging) of its own response and iteratively refines
the output by incorporating this feedback.
3) SPP [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ] is another single-agent refinement baseline where a single agent generates multiple
selfgenerated personas and makes them collaborate to solve a task using a single prompt.
4) LLMvLLM [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ] is a multi-agent refinement baseline where two agents engage in cross-examination
to detect factual errors in generated response. One agent (EXAMINEE) produces an initial statement,
while the other agent (EXAMINER) iteratively questions it, aiming to identify inconsistencies through
multiple rounds of interaction.
5) MADR [58] is a multi agent refinement baseline where two agents debate based on the predefined
error types to generate faithful fact checking explanations. They iteratively debate until the judge agent
decides that two agents are in consensus, and the finalizer agent refines the refined response based on
the feedback generated during a debate.
6) MultiDebate [59] is a multi-agent refinement baseline where multiple language model agents
iteratively engage in a structured debate to improve factual accuracy and reasoning. Each agent
independently generates an initial response to a given query, followed by multiple rounds of critique
and refinement based on responses from other agents.
7) MARA is our proposed system, where a planner agent dynamically determines the optimal sequence
of refining agents to involve, and then the agents refine the response in the specified order.
        </p>
      </sec>
      <sec id="sec-7-3">
        <title>A.3. Implementation Details</title>
        <p>The prompts used for MARA are provided in Table 16 (responding agent), Table 17 (planner agent),
Table 18 (fact-refining agent), Table 19 (persona-refining agent), and Table 20 (coherence-refining agent).
The prompt templates for G-Eval are shown in Table 21, Table 22, Table 23, and Table 24.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>B. Experimental Results</title>
      <sec id="sec-8-1">
        <title>B.1. Significance Test on FoCus (Table 13)</title>
        <p>Coherence The one-way ANOVA reveals a significant efect of model type on coherence scores
( (3, 8) = 763.67,  &lt; 0.001 ), indicating that at least one model has a significantly diferent mean score.
Post-hoc Tukey HSD tests show that MARA significantly outperforms all other models (  &lt; 0.001 ),
with a mean diference of 0.7456 over the MADR model, 0.2838 over the No Refine model, and 0.5691
over the Self-Refine model. Additionally, the No Refine model outperforms both Self-Refine and SPP,
with a small but significant diference of 0.0521 over SPP. Self-Refine also outperforms SPP with a
mean diference of 0.3375. Overall, MARA is the top performer, ofering substantial improvements in
coherence, demonstrating that its enhancements contribute meaningfully to better response quality
across all models.</p>
        <p>Groundedness The one-way ANOVA reveals a significant efect of model type on groundedness
scores ( (3, 8) = 426.81,  &lt; 0.001 ), indicating notable diferences in performance among the models.
Post-hoc Tukey HSD tests show that MARA achieves higher groundedness scores than all other models
( &lt; 0.001 ), with mean diferences of 0.3337 over MADR, 0.1648 over No Refine, and 0.2806 over
SelfRefine. Additionally, No Refine surpasses Self-Refine by 0.1158, while its diference from SPP is not
significant. Self-Refine also shows an advantage over SPP, with a mean diference of 0.101. These results
highlight the efectiveness of MARA in improving groundedness compared to other approaches.
Naturalness The one-way ANOVA reveals a significant efect of model type on naturalness scores
( (3, 8) = 215.79,  &lt; 0.001 ), indicating that model diferences lead to varying levels of naturalness.
Post-hoc Tukey HSD tests show that MARA significantly outperforms all other models (  &lt; 0.001 ), with
mean diferences of 0.4772 over MADR, 0.1339 over No Refine, and 0.2851 over Self-Refine. Furthermore,
No Refine performs better than Self-Refine by 0.1511, although its comparison with SPP shows no
significant diference. Self-Refine outperforms SPP with a mean diference of 0.1235. Overall, these
results indicate that MARA demonstrates strong naturalness compared to other models, with clear
advantages over the other approaches.</p>
        <p>Engagingness The one-way ANOVA reveals a significant efect of model type on engagingness
scores ( (3, 8) = 1772.37,  &lt; 0.001 ), indicating notable diferences across the models. Post-hoc
Tukey HSD tests show that MARA significantly outperforms all other models (  &lt; 0.001 ), with mean
diferences of 1.2939 over MADR, 0.6771 over No Refine, and 0.7801 over Self-Refine. Furthermore, No
Refine outperforms Self-Refine by 0.1030, and SPP by 0.3065. Self-Refine also outperforms SPP with a
mean diference of 0.4095. Overall, these results demonstrate that MARA consistently achieves higher
engagingness scores compared to the other models, highlighting its efectiveness in enhancing the
engagingness of responses.</p>
      </sec>
      <sec id="sec-8-2">
        <title>B.2. Significance Test on PersonaChat (Table 14)</title>
        <p>Coherence The one-way ANOVA reveals a significant efect of model type on coherence scores
( (3, 8) = 725.87,  &lt; 0.001 ), indicating substantial diferences between the models. Post-hoc Tukey HSD
tests show that MARA significantly outperforms all other models (  &lt; 0.001 ), with mean diferences
of 0.7359 over MADR, 0.4572 over No Refine, and 0.0611 over Self-Refine. Additionally, No Refine
outperforms Self-Refine by 0.3962 and SPP by 0.1141. Self-Refine also outperforms SPP with a mean
diference of 0.2821. These results demonstrate that MARA leads to higher coherence scores, providing
a clear advantage over the other models in this aspect.</p>
        <p>Groundedness The one-way ANOVA reveals a significant efect of model type on groundedness
scores ( (3, 8) = 498.84,  &lt; 0.001 ), indicating notable diferences between the models. Post-hoc
Tukey HSD tests show that MARA significantly outperforms all other models (  &lt; 0.001 ), with mean
diferences of 0.2731 over MADR, 0.1406 over No Refine, and 0.0369 over Self-Refine. Furthermore,
No Refine outperforms Self-Refine by 0.1037, although its comparison with SPP shows no significant
diference. Self-Refine also outperforms SPP with a mean diference of 0.0822. These results indicate
that MARA provides consistent improvements in groundedness compared to other models.
Naturalness The one-way ANOVA reveals a significant efect of model type on naturalness scores
( (3, 8) = 641.13,  &lt; 0.001 ), indicating substantial diferences across the models. Post-hoc Tukey HSD
tests show that MARA significantly outperforms all other models (  &lt; 0.001 ), with mean diferences of
0.6000 over MADR, 0.4273 over No Refine, and 0.2599 over SPP. The comparison between MARA and
Self-Refine also shows a small but significant diference of 0.0896. Furthermore, No Refine outperforms
Self-Refine by 0.5169 and SPP by 0.1673, while Self-Refine significantly outperforms SPP by 0.3496.
These results highlight MARA’s strong performance in naturalness compared to other models.
Engagingness The one-way ANOVA reveals a significant efect of model type on engagingness
scores ( (3, 8) = 1028.47,  &lt; 0.001 ), showing that the models difer significantly in their performance.
Post-hoc Tukey HSD tests indicate that MARA significantly outperforms all other models (  &lt; 0.001 ),
with mean diferences of 1.2213 over MADR, 0.9953 over No Refine, and 0.4745 over SPP. Additionally,
No Refine outperforms Self-Refine by 0.7535 and SPP by 0.5208, while Self-Refine outperforms SPP
by 0.2327. These results underscore MARA’s efectiveness in enhancing engagingness across diferent
models.</p>
      </sec>
      <sec id="sec-8-3">
        <title>B.3. Significance Test on INSCIT (Table 15)</title>
        <p>Coherence The one-way ANOVA reveals a significant efect of model type on coherence scores
( (3, 8) = 200.85,  &lt; 0.001 ), indicating that the models perform diferently. Post-hoc Tukey HSD tests
show that MARA significantly outperforms MADR, with a mean diference of 0.3264, as well as No
Refine ( 0.0844) and SPP (0.0946). The comparison between MADR and Self-Refine is not statistically
significant, but No Refine outperforms Self-Refine by 0.225. Self-Refine also outperforms SPP with a
mean diference of 0.2148. Overall, MARA demonstrates stronger coherence performance compared to
most other models.</p>
        <p>Groundedness The one-way ANOVA reveals a significant efect of model type on groundedness
scores ( (3, 8) = 119.45,  &lt; 0.001 ), indicating that the models show distinct performance diferences.
Post-hoc Tukey HSD tests show that MARA significantly outperforms MADR with a mean diference of
0.129, as well as No Refine ( 0.0765) and SPP (0.101). No Refine also outperforms Self-Refine by 0.0721.
However, the comparison between MADR and Self-Refine is not statistically significant. These results
suggest that MARA provides meaningful improvements in groundedness compared to the other models.
Naturalness The one-way ANOVA reveals a significant efect of model type on naturalness scores
( (3, 8) = 39.70,  &lt; 0.001 ), indicating diferences in performance across models. Post-hoc Tukey HSD
tests show that MARA significantly outperforms MADR by 0.1382 and No Refine by 0.1052, though
the comparison between MARA and Self-Refine is not significant. MADR also outperforms No Refine
by 0.2434 and Self-Refine by 0.0707, while No Refine outperforms Self-Refine by 0.1727. These results
demonstrate that MARA provides strong naturalness performance, with notable diferences in certain
comparisons, but not against all models.</p>
        <p>Engagingness The one-way ANOVA reveals a significant efect of model type on engagingness
scores ( (3, 8) = 255.26,  &lt; 0.001 ), showing diferences in model performance. Post-hoc Tukey HSD
tests show that MARA significantly outperforms MADR by 0.8387, No Refine by 0.4544, and SPP by
0.2664. The comparison between No Refine and Self-Refine is not statistically significant, but No Refine
Case study on the FoCus dataset comparing MARA with SPP.</p>
        <p>Question</p>
        <p>Persona
Q1: What is the I have the fantasy
height of the moun- about valley. I like
tain? lakes. I would like
to go to California. I
have seen lava domes.</p>
        <p>I have heard about
Lassen Volcanic
National Park.</p>
        <p>Fact
Located 2 miles (3.2
km) to the north of
Lassen Peak, they
have an elevation of
about 8,448 feet (2,575
m)
Q2: Is there any lake
near by?</p>
        <p>I have the fantasy Each year, a lake forms
about valley. I like at the base of the
lakes. I would like Crags, and typically
to go to California. I dries by the end of the
have seen lava domes. summer season.</p>
        <p>I have heard about
Lassen Volcanic
National Park.</p>
        <p>SPP</p>
        <p>MARA (Ours)
Unfortunately, without any additional con- The Chaos Crags are not a single mountain,
text about what fictional universe, book, but rather a group of six steep-sided lava
movie or game Chaos Crags is from, it’s very domes or volcanic plugs. They rise
approxidificult for me to estimate the height of this mately 550 meters (about 1,800 feet) above
mountain. If you could provide some more the Chaos Jumbles, which is the rockfall
details about the setting, terrain, or role this avalanche deposit at their base. To provide
mountain plays in the story, that would help more context, the highest point of the Chaos
me make an educated guess about how tall Crags reaches an elevation of about 2,575
meChaos Crags is likely intended to be. I’d be ters (8,450 feet) above sea level.
happy to take another look once I have some
contextual clues from the source material.</p>
        <p>Based on the lore provided, Chaos Crags is an As someone drawn to the beauty of lakes
immense, treacherous mountain range in the and the raw power of volcanic landscapes,
northern reaches of Middle-earth. While the you’ll be delighted to know there are
sevexact height is not specified, it is described as eral lakes nestled right near the incredible
one of the tallest peaks in that region, with Chaos Crags formation in Lassen Volcanic
the highest spires potentially reaching eleva- National Park. The most picturesque is
untions of 20,000-25,000 feet. The Crags likely doubtedly Manzanita Lake, formed around
consist of jagged, ice-capped pinnacles tow- 1,100 years ago by lava flows that dammed a
ering miles into the sky above deep glacial stream during the same eruptive period that
valleys. Their sheer clifs and avalanche- created the crags themselves. Can you
imagprone slopes would make them extremely ine hiking along its shores, taking in views
hazardous and virtually unscalable. While no of those dramatic lava domes rising in the
major lakes are referenced, the Crags over- backdrop? A scene like that perfectly
encaplook the icy waters of the Bay of Forochel sulates the awe-inspiring volcanic wonders
and its inlets. Nestled among the rocky of this region. But Manzanita isn’t the only
chaos may be smaller glacier-fed tarns and lake beckoning explorers like yourself.
Restreams cascading from the ice fields high flection Lake and Chaos Crags Lake, though
above. The overall scenery is one of bleak, smaller, ofer equally enticing opportunities
desolate beauty - an otherworldly alpine ex- to immerse yourself in this geologically active
panse exposed to the punishing winds blow- landscape.</p>
        <p>ing of the northern seas.
outperforms SPP by 0.188, and Self-Refine outperforms SPP by
0.2253.</p>
        <p>These results suggest that
MARA provides stronger engagingness performance compared to the other models, with significant
improvements over most.</p>
        <p>B.4.</p>
      </sec>
      <sec id="sec-8-4">
        <title>Ablation Studies on Design Choices</title>
        <p>To further investigate the efectiveness of our strategy, we conduct ablation studies on specific design
choices. First, we examine the impact of allowing each refining agent to recognize the presence of
other agents by passing the planner agent’s output, which includes both the sequence and justification
for the selected set and sequence. Specifically, the planner agent generates a suitable sequence with
justification and passes it to the refining agents, enabling each agent to be aware of the preceding and
following agents, as well as the rationale behind the sequence. As shown in Table 10, when the planner
agent’s output is not passed to the refining agents, performances with all metrics decrease, particularly
in groundedness. This underscores the importance of enabling agents to be aware of one another to
perform their roles and collaborate more efectively.</p>
        <p>Additionally, we explore whether a step-by-step for each refining agent’s reasoning process—first
verifying, then refining responses with justifications—is efective. As shown in Table
10, performance
in coherence, naturalness, and engagingness decrease when this process is removed. This suggests that
the verification step and justification is crucial for maintaining high-quality responses, as it allows for a
more thorough assessment before refining.
Lastly, we present case studies in Table 11, which shows two consecutive turns within a conversation
about the ‘Chaos Crags’. In the first turn, SPP fails to provide an accurate response, losing track of the
conversation’s context and ofering irrelevant information about fictional mountains instead of directly
addressing the user’s query about the mountain’s height. In contrast, our MARA model delivers an
accurate and contextually appropriate response, correctly identifying Chaos Crags and providing the
exact elevation, enriched with additional context about its geological formation. This demonstrates
MARA’s ability to maintain focus and generate factually grounded responses.</p>
        <p>In the second turn, both SPP and MARA attempt to understand the context of the user’s question.
However, SPP not only fails to directly address the user’s question about a nearby lake, stating, ‘While
no major lakes are referenced,’ but also lacks engagement with the user. MARA, on the other hand, not
only provides the correct factual information but also tailors the response to the user’s expressed interest
in lakes and volcanic landscapes, creating a more personalized and engaging interaction. Furthermore,
MARA takes the conversation a step further by asking a follow-up question: ‘Can you imagine hiking
along its shores, taking in views of those dramatic lava domes rising in the backdrop?’. This question
invites the user to visualize the experience, fostering deeper engagement. Such interactions showcase
MARA’s strength in incorporating user preferences beyond fact delivery.</p>
      </sec>
      <sec id="sec-8-5">
        <title>B.6. Illustrative Example of Agent Roles</title>
        <p>While we included a case study example in Table 11, we introduce another example to provide a
more detailed description of the agents in Table 12. In this example, after reviewing the query and
initial response, the planner selects the coherence-refining agent and persona-refining agent as the
required agents. Following the planner’s decision, the coherence-refining agent first addresses the
initial response’s lack of context and introduces the Galusha House, while the persona-refining agent
further refines the response by adding personalized context and engaging follow-up.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>C. Limitations</title>
      <p>While our MARA framework demonstrates clear advantages by adaptively selecting the most suitable
refining agents for diverse customized conversational settings, there remains room for improvement,
particularly in the planner agent, as shown by the performance gap between the ideal planner and
the current planner Table 2. Specifically, since our current planner is fully based on an unsupervised
LLM, constructing a dataset with labeled ideal sets and sequences of required perspectives, followed by
ifne-tuning the planner on this data, could be a promising direction for future research.</p>
      <p>While a multi-agent framework shows remarkable efectiveness, it may raise concerns about
scalability and resource eficiency. Therefore, we further suggest some possible approaches that can address
these challenges. As demonstrated by the significantly reduced number of LLM accesses with an
optimized planner in Table 2, improving the planner agent can also alleviate concerns regarding resource
eficiency. To further enhance scalability, exploring lightweight or more eficient agent models would
be also beneficial. In Table 8 and Table 9, we show that our proposed MARA performs well even with a
smaller LLM. Notably, Table 9 shows significant improvements with a lightweight, open-sourced LLM,
the Llama 3.1 8B model, which highlights its potential for scalability. Nonetheless, further exploration
of scalability remains a promising avenue for future work, which we leave as a meaningful future work.</p>
      <p>Also, the flexibility of the multi-agent framework in selecting diverse external tools or LLMs opens
up opportunities for incorporating tools such as Retrieval-Augmented Generation (RAG) systems to
further enhance performance, which could also be an interesting future research direction.</p>
    </sec>
    <sec id="sec-10">
      <title>D. Ethics Statement</title>
      <p>
        With the goal of developing human-centric applications that assist users in conversations, we
experimentally validate the efectiveness of MARA for its applicability in realistic conversational scenarios,
where a diverse range of queries with varying requests exists. However, given the potential diversity
of real-world user inputs, it is crucial to consider scenarios where inputs or generated responses may
be ofensive or harmful [
        <xref ref-type="bibr" rid="ref2">61, 2</xref>
        ]. To ensure safe and responsible deployment, additional safeguards are
necessary. In particular, integrating an agent to monitor both user inputs and generated content for
harmful or ofensive material would be valuable. We consider this a critical area for future research,
aiming to improve MARA’s robustness and ethical alignment in real-world applications.
      </p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT for the purpose of grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
      <p>No Refine vs MARA (Ours)
s No Refine vs MADR
se No Refine vs Self-Refine
gn No Refine vs SPP
ign MARA (Ours) vs MADR
ag MARA (Ours) vs Self-Refine
En MARA (Ours) vs SPP</p>
      <p>Self-Refine vs SPP
-0.2838
-0.4619
-0.2854
0.0521
0.7456
0.5691
-0.2317
0.3375
-0.1648
0.1689
-0.1158
-0.0148
0.3337
0.2806
-0.1796
0.1010</p>
      <p>No Refine vs MARA (Ours)
s No Refine vs MADR
se No Refine vs Self-Refine
gn No Refine vs SPP
ign MARA (Ours) vs MADR
ag MARA (Ours) vs Self-Refine
En MARA (Ours) vs SPP</p>
      <p>Self-Refine vs SPP
-0.4572
-0.2786
0.3962
0.1141
0.7359
-0.0611
-0.3432
-0.2821
-0.1406
0.1325
0.1037
0.0215
0.2731
-0.0369
-0.1191
-0.0822
As a &lt;role&gt;Responding Agent&lt;/role&gt;, your task is to answer the user’s question, within the
&lt;question_text&gt; tags.
- Consider the Keyword: &lt;keyword&gt;{keyword}&lt;/keyword&gt;, if available.
&lt;instructions&gt;
- Place your final response within &lt;response&gt;&lt;/response&gt; tags.
- Make your response concise.
&lt;/instructions&gt;</p>
      <p>MARA (Ours)</p>
      <p>MADR
Self-Refine</p>
      <p>SPP</p>
      <p>MADR
Self-Refine</p>
      <p>SPP
Self-Refine</p>
      <p>SPP</p>
      <p>SPP
MARA (Ours)</p>
      <p>MADR
Self-Refine</p>
      <p>SPP</p>
      <p>MADR
Self-Refine</p>
      <p>SPP
Self-Refine</p>
      <p>SPP</p>
      <p>SPP
MARA (Ours)</p>
      <p>MADR
Self-Refine</p>
      <p>SPP</p>
      <p>MADR
Self-Refine</p>
      <p>SPP
Self-Refine</p>
      <p>SPP</p>
      <p>SPP
MARA (Ours)</p>
      <p>MADR
Self-Refine</p>
      <p>SPP</p>
      <p>MADR
Self-Refine</p>
      <p>SPP
Self-Refine</p>
      <p>SPP
SPP</p>
      <p>Role: &lt;role&gt;Fact Refining Agent&lt;/role&gt;. Your task is to refine the previous responses to ensure
they are accurate within the context of the conversation, following a step-by-step reasoning process.</p>
      <p>Consider the given topics (keywords) for the conversation, if available, marked by &lt;keywords&gt; tags:
&lt;keywords&gt;{keyword}&lt;/keywords&gt;.
&lt;instructions&gt;
- Begin by reviewing the keywords in the &lt;keywords&gt; tags to understand the context of the conversation.</p>
      <p>Note that we assume that the user’s opening question specifically focuses on the keywords listed within
the &lt;keywords&gt; tags. Therefore, do not request additional detail or clarification.
- First, examine the factual accuracy of the previous response provided in the &lt;factChecking&gt; tags.
- Document your verification outcome: If the response is factually accurate, place ‘Fact is verified.’
inside the &lt;verification&gt; tags. If not, place ‘Fact is not verified.’ in the &lt;verification&gt; tags.
- Specify the reasons for the verification in the &lt;verification_justification&gt; tags.
- Next, refine the previous response to enhance factual correctness in the &lt;refined_response&gt; tags.</p>
      <p>Describe each change you make and justify it based on factual accuracy.
- Place the refined response in the &lt;refined_response&gt; tags and detail your reasoning for each
refinement step in &lt;refinement_justification&gt; tags.
- Finally, compile the &lt;verification&gt;, &lt;verification_justification&gt;, &lt;refined_response&gt; and
&lt;refinement_justification&gt; into the &lt;response&gt; tags, ensuring a clear and logical flow of thought.
- Maintain professionalism and avoid including apologies or acknowledgements in your response.
- Since this refined response is displayed directly to the user, address them as if you are the Responding
Agent, not a Refining Agent. Avoid any reference to refining roles, the refinement process, or
acknowledgment of a previous Refining Agent.
- Ensure your response shows you have considered the user’s input inside the &lt;question_text&gt; tags
and their current state of mind. Keep your &lt;refined_response&gt; concise, clear, and similar to
humanwritten text.
&lt;/instructions&gt;
This is the sequence of multiple agents involved in refining the initial response:
{planned_agent_order}.</p>
      <p>This is the justification for requiring multiple agents: {planned_agents_set_justification}.</p>
      <p>This is the justification for this sequence: {planned_agent_order_justification}.</p>
      <p>Note that you are a Fact Refining Agent.</p>
      <p>This is the user’s question, which should be fully addressed:
&lt;question_text&gt;{user_query}&lt;/question_text&gt;
This is the initial response generated by the Responding Agent:
&lt;initialResponse&gt;{initial_response}&lt;/initialResponse&gt;.</p>
      <p>This is the refined response generated by the previous refining agent ({previous_agent_name}):
{generated_response}.</p>
      <p>Role: &lt;role&gt;Persona Refining Agent&lt;/role&gt;. Your task involves verifying and refining responses
based on the user’s profile and interests. You will first verify the alignment and then refine the response
if necessary, explaining each step in your reasoning.</p>
      <p>Consider the User Profile: &lt;userProfile&gt;{persona}&lt;/userProfile&gt;
&lt;instructions&gt;
- Begin by understanding the user’s interests from the &lt;userProfile&gt; tags.
- First, verify the alignment of the previous response inside the &lt;persona&gt; tags aligns with the user’s
preferences detailed in the &lt;userProfile&gt; tags. Explain your reasoning: If it aligns well, explain why
you believe this alignment exists; if it does not, explain what aspects are misaligned or contradictory.
- Document your verification outcome: If the previous response is well-aligned, place ‘Persona is verified.’
inside the &lt;verification&gt; tags. If not, place ‘Persona is not verified.’ inside the &lt;verification&gt; tags.
- Specify the reasons for the verification in the &lt;verification_justification&gt; tags.
- Next, refine the previous response to better align with the user’s preferences detailed in the
&lt;userProfile&gt; tags.
- Place the refined response in the &lt;refined_response&gt; tags and detail your reasoning for each
refinement step in the &lt;refinement_justification&gt; tags.
- Finally, compile the &lt;verification&gt;, &lt;verification_justification&gt;, &lt;refined_response&gt; and
&lt;refinement_justification&gt; into the &lt;response&gt; tags, ensuring a clear and logical flow of thought.
- Maintain professionalism and avoid including apologies or acknowledgements in your response.
- Since this refined response is displayed directly to the user, address them as if you are the Responding
Agent, not a Refining Agent. Avoid any reference to refining roles, the refinement process, or
acknowledgment of a previous Refining Agent.
- Ensure your response shows you have considered the user’s input inside the &lt;question_text&gt; tags
and their current state of mind. Keep your &lt;refined_response&gt; concise, clear, and similar to
humanwritten text.
&lt;/instructions&gt;
This is the sequence of multiple agents involved in refining the initial response:
{planned_agent_order}.</p>
      <p>This is the justification for requiring multiple agents: {planned_agents_set_justification}.</p>
      <p>This is the justification for this sequence: {planned_agent_order_justification}.</p>
      <p>Note that you are a Persona Refining Agent.</p>
      <p>This is the user’s question, which should be fully addressed:
&lt;question_text&gt;{user_query}&lt;/question_text&gt;
This is the initial response generated by the Responding Agent:
&lt;initialResponse&gt;{initial_response}&lt;/initialResponse&gt;.</p>
      <p>This is the refined response generated by the previous refining agent ({previous_agent_name}):
{generated_response}.</p>
      <p>Role: &lt;role&gt;Coherence Refining Agent&lt;/role&gt;. Your task is to verify and refine the previous
responses to ensure the coherence and logical flow of the initial responses within the context of the
ongoing conversation and mimicking human-written text, following a step-by-step reasoning process.</p>
      <p>Consider the given topics (keywords) for the conversation, if available, marked by &lt;keywords&gt; tags:
&lt;keywords&gt;{keyword}&lt;/keywords&gt;.
&lt;instructions&gt;
- Begin by reviewing the keywords in the &lt;keywords&gt; tags to understand the context of the conversation.</p>
      <p>Note that we assume that the user’s opening question specifically focuses on the keywords listed within
the &lt;keywords&gt; tags. Therefore, do not request additional detail or clarification.
- First, examine whether the previous response provided in the &lt;coherence&gt; tags maintains coherence
with the conversation history, mimicking human-written text.
- Document your verification outcome: If the response is coherent, place ‘Coherence is verified.’ inside
the &lt;verification&gt; tags. If not, place ‘Coherence is not verified.’ in the &lt;verification&gt; tags.
- Specify the reasons for the verification in the &lt;verification_justification&gt; tags.
- Next, refine the previous response to improve the overall clarity and continuity of the conversation in
the &lt;refined_response&gt; tags. Describe each change you make and justify it based on coherence and
completeness.
- Place the refined response in the &lt;refined_response&gt; tags and detail your reasoning for each
refinement step in &lt;refinement_justification&gt; tags.
- Finally, compile the &lt;verification&gt;, &lt;verification_justification&gt;, &lt;refined_response&gt; and
&lt;refinement_justification&gt; into the &lt;response&gt; tags, ensuring a clear and logical flow of thought.
- Maintain professionalism and avoid including apologies or acknowledgements in your response.
- Since this refined response is displayed directly to the user, address them as if you are the Responding
Agent, not a Refining Agent. Avoid any reference to refining roles, the refinement process, or
acknowledgment of a previous Refining Agent.
- Ensure your response shows you have considered the user’s input inside the &lt;question_text&gt; tags
and their current state of mind. Keep your &lt;refined_response&gt; concise, clear, and similar to
humanwritten text.
&lt;/instructions&gt;
This is the sequence of multiple agents involved in refining the initial response:
{planned_agent_order}.</p>
      <p>This is the justification for requiring multiple agents: {planned_agents_set_justification}.</p>
      <p>This is the justification for this sequence: {planned_agent_order_justification}.</p>
      <p>Note that you are a Coherence Refining Agent.</p>
      <p>This is the user’s question, which should be fully addressed:
&lt;question_text&gt;{user_query}&lt;/question_text&gt;
This is the initial response generated by the Responding Agent:
&lt;initialResponse&gt;{initial_response}&lt;/initialResponse&gt;.</p>
      <p>This is the refined response generated by the previous refining agent ({previous_agent_name}):
{generated_response}.</p>
      <sec id="sec-11-1">
        <title>Instructions:</title>
        <p>You will be given a conversation segment involving two participants: the user and the system. You will then
be given one potential response for the next turn in the conversation. Your task is to rate the generated
responses on one metric. Please make sure you read and understand these instructions carefully. Please keep
this document open while reviewing, and refer to it as needed.</p>
      </sec>
      <sec id="sec-11-2">
        <title>Evaluation Criteria:</title>
        <p>Coherence (1-3) - Evaluate whether the conversation response logically follows the preceding context and
maintains a clear, logical flow.</p>
        <p>Score of 1: Assign this score if the response does not logically follow the preceding context. It may introduce
abrupt changes in topic or contain confusing statements, resulting in a disjointed conversation.
Score of 2: Assign this score if the response somewhat follows the preceding context but includes minor
logical inconsistencies or slight topic shifts, slightly disrupting the conversation’s flow.</p>
        <p>Score of 3: Assign this score if the response logically follows the preceding context and maintains a clear,
logical flow, providing a relevant and coherent continuation of the conversation without any abrupt changes
or confusing elements.</p>
        <p>Evaluation Steps:
1. Examine the conversational history, the provided fact (if given), and the user profile (if given) to fully
understand the context and dynamics of the conversation.
2. Compare the generated response with the gold standard response to evaluate how well it maintains
continuity and logical flow.
3. Assess how efectively the generated response connects with and continues the conversation, ensuring it
aligns logically with the existing conversation.
4. Based on your analysis, assign a coherence score from 1 to 3, reflecting the response’s logical integration
into the ongoing conversation.</p>
      </sec>
      <sec id="sec-11-3">
        <title>Evaluation Form (Scores ONLY without any additional text):</title>
        <p>- Coherence:</p>
      </sec>
      <sec id="sec-11-4">
        <title>Instructions:</title>
        <p>You will be given a conversation segment involving two participants: the user and the system. You will then
be given one potential response for the next turn in the conversation. The response concerns an interesting
fact, which will be provided as well. Your task is to rate the generated responses on one metric. Please make
sure you read and understand these instructions carefully. Please keep this document open while reviewing,
and refer to it as needed.</p>
      </sec>
      <sec id="sec-11-5">
        <title>Evaluation Criteria:</title>
        <p>Groundedness (0-1) - Evaluate whether the conversation response is based on and accurately incorporates
the provided fact.</p>
        <p>Score of 0: Assign this score if the response does not correctly use the provided fact or misrepresents it. This
includes instances where the response contains inaccuracies or fails to integrate the fact meaningfully into
the conversation.</p>
        <p>Score of 1: Assign this score if the response uses the provided fact accurately and integrates it seamlessly
into the conversation, thereby enhancing the dialogue’s relevance and informativeness.
Evaluation Steps:
1. Examine the conversational history and the provided fact (if given) to understand the context fully.
2. Evaluate how accurately and relevantly the generated response incorporates the provided fact, ensuring
there are no inaccuracies or hallucinated details.
3. Analyze how the generated response measures up against a gold standard response to understand the
ideal integration of the fact.
4. Based on your assessment, assign a Groundedness score ranging from 0 to 1, reflecting how efectively the
fact is incorporated into the response.</p>
      </sec>
      <sec id="sec-11-6">
        <title>Example:</title>
        <p>Conversation History: {{Document}}
Corresponding Fact: {{Fact}}
Gold Standard Response: {{Gold_Response}}
Generated Response: {{Response}}</p>
      </sec>
      <sec id="sec-11-7">
        <title>Evaluation Form (Scores ONLY without any additional text):</title>
        <p>- Groundedness:</p>
      </sec>
      <sec id="sec-11-8">
        <title>Instructions</title>
        <p>You will be given a conversation segment involving two participants: the user and the system. You will then
be given one potential response for the next turn in the conversation. Your task is to rate the generated
responses on one metric. Please make sure you read and understand these instructions carefully. Please keep
this document open while reviewing, and refer to it as needed.</p>
      </sec>
      <sec id="sec-11-9">
        <title>Evaluation Criteria:</title>
        <p>Naturalness (1-3) - Evaluate whether the dialogue response feels natural and conversational, as if it were
part of a real, human conversation.</p>
        <p>Score of 1: Assign this score if the response does not sound natural. It may contain awkward phrasing,
unnatural expressions, or robotic language, disrupting the flow of the conversation.</p>
        <p>Score of 2: Assign this score if the response somewhat sounds natural but may include minor awkwardness
or slightly unnatural phrasing, afecting the overall conversational flow.</p>
        <p>Score of 3: Assign this score if the response sounds completely natural, flowing smoothly, using natural
language, and integrating seamlessly into the conversation as if it were part of a real human interaction.
Evaluation Steps:
1. Examine the conversational history, the provided fact (if given), and the user profile (if given) to gauge the
natural fit of the response within the conversation’s context.
2. Evaluate the tone, formality, and conversational flow of the generated response to determine how naturally
it fits into the dialogue.
3. Compare the generated response to a gold standard response to gauge the ideal level of naturalness.
4. Based on your assessment, assign a Naturalness score from 1 to 3, focusing on how naturally the response
fits into the conversation.</p>
      </sec>
      <sec id="sec-11-10">
        <title>Example:</title>
        <p>Conversation History: {{Document}}
Corresponding Fact: {{Fact}}
Corresponding User Profile: {{Persona}}
Gold Standard Response: {{Gold_Response}}
Generated Response: {{Response}}</p>
      </sec>
      <sec id="sec-11-11">
        <title>Evaluation Form (Scores ONLY without any additional text):</title>
        <p>- Naturalness:</p>
      </sec>
      <sec id="sec-11-12">
        <title>Instructions</title>
        <p>You will be given a conversation segment involving two participants: the user and the system. You will then
be given one potential response for the next turn in the conversation. Your task is to rate the generated
responses on one metric. Please make sure you read and understand these instructions carefully. Please keep
this document open while reviewing, and refer to it as needed.</p>
      </sec>
      <sec id="sec-11-13">
        <title>Evaluation Criteria:</title>
        <p>Engagingness (1-3) - Is the response dull or interesting?
Score of 1 (Dull): Assign this score if the response is generic and unremarkable, failing to spark interest or
engagement.</p>
        <p>Score of 2 (Somewhat Interesting): Assign this score if the response is moderately interesting and could
engage participants in the conversation, such as by introducing an opinion or thought.
Score of 3 (Interesting): Assign this score if the response is highly interesting or presents an intriguing fact,
significantly enhancing the conversation’s appeal.</p>
        <p>Evaluation Steps:
1. Examine the conversational history, the provided fact (if given), and the user profile (if given) to gauge the
potential interest or intrigue.
2. Assess how the generated response contributes to the conversation’s value and captivates interest.
3. Compare the generated response to a gold standard response.
4. Based on your analysis, assign an Engagingness score from 1 to 3, reflecting the response’s ability to
captivate and add value to the conversation.</p>
      </sec>
      <sec id="sec-11-14">
        <title>Evaluation Form (Scores ONLY without any additional text):</title>
        <p>- Engagingness:</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <article-title>From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models</article-title>
          ,
          <source>arxiv Preprint arXiv:2401.02777 abs/2401</source>
          .02777 (
          <year>2024</year>
          ). URL: https://api.semanticscholar.org/CorpusID:266818453.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <article-title>Attacks, defenses and evaluations for LLM conversation safety: A survey</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>6734</fpage>
          -
          <lpage>6747</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>375</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>375</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiesurin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimakopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. S.</given-names>
            <surname>Cabezudo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eshghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Papaioannou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rieser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Konstas</surname>
          </string-name>
          ,
          <article-title>The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          , Toronto, Canada, July 9-
          <issue>14</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>947</fpage>
          -
          <lpage>959</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>60</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-ACL.
          <year>60</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Hunter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          , Platypus: Quick, cheap, and
          <article-title>powerful refinement of llms</article-title>
          ,
          <source>in: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following</source>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/pdf/2308.07317.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          , J. Kang,
          <article-title>Learn to resolve conversational dependency: A consistency training framework for conversational question answering</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP</source>
          <year>2021</year>
          ,
          <article-title>(Volume 1: Long Papers)</article-title>
          ,
          <source>Virtual Event, August 1-6</source>
          ,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>6130</fpage>
          -
          <lpage>6141</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>478</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2021</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>478</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jang</surname>
          </string-name>
          , T. Lukasiewicz,
          <article-title>Consistency analysis of chatgpt</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>15970</fpage>
          -
          <lpage>15985</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>991</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          . EMNLP-MAIN.
          <year>991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Assessing the capabilities of large language models in coreference: An evaluation</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING</source>
          <year>2024</year>
          ,
          <volume>20</volume>
          -25 May,
          <year>2024</year>
          , Torino, Italy, ELRA and ICCL,
          <year>2024</year>
          , pp.
          <fpage>1645</fpage>
          -
          <lpage>1665</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>145</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Semnani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <article-title>Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>2387</fpage>
          -
          <lpage>2413</lpage>
          . URL: https: //doi.org/10.18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>157</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>157</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jandaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pujara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sidahmed</surname>
          </string-name>
          ,
          <article-title>Faithful persona-based conversational dataset generation with large language models</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <article-title>Findings of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2024</year>
          , Bangkok, Thailand and virtual meeting,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>15245</fpage>
          -
          <lpage>15270</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>904</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .FINDINGS-ACL.
          <year>904</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hallinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiegrefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Alon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prabhumoye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , K. Hermann,
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yazdanbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <article-title>Self-refine: Iterative refinement with self-feedback</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</source>
          <year>2023</year>
          , NeurIPS
          <year>2023</year>
          , New Orleans, LA, USA, December
          <volume>10</volume>
          -
          <issue>16</issue>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: http://papers.nips.cc/paper_files/paper/2023/hash/ 91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lei</surname>
          </string-name>
          , T. Chua,
          <article-title>Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non-collaboration</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>10602</fpage>
          -
          <lpage>10621</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>711</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>711</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yeo</surname>
          </string-name>
          ,
          <article-title>Commonsense-augmented memory construction and management in long-term conversations via context-aware persona refinement</article-title>
          , in: Y. Graham, M. Purver (Eds.),
          <source>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics, St. Julian's, Malta</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>104</fpage>
          -
          <lpage>123</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .eacl-short.
          <volume>11</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Encouraging divergent thinking in large language models through multi-agent debate</article-title>
          ,
          <source>arXiv preprint arXiv:2305.19118 abs/2305</source>
          .19118 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.19118. doi:
          <volume>10</volume>
          .48550/ARXIV.2305. 19118. arXiv:
          <volume>2305</volume>
          .
          <fpage>19118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Pride and prejudice: LLM amplifies self-bias in self-refinement</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>ACL</source>
          <year>2024</year>
          , Bangkok, Thailand,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>15474</fpage>
          -
          <lpage>15492</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>826</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>826</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Large language models cannot self-correct reasoning yet</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations, ICLR</source>
          <year>2024</year>
          , Vienna, Austria, May 7-
          <issue>11</issue>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/ forum?id=
          <fpage>IkmD3fKBPQ</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , J. Park,
          <article-title>Realistic conversational question answering with answer selection based on calibrated confidence and uncertainty measurement</article-title>
          , in: A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>EACL</surname>
          </string-name>
          <year>2023</year>
          , Dubrovnik, Croatia, May 2-
          <issue>6</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>477</fpage>
          -
          <lpage>490</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .eacl-main.
          <volume>35</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EACL-MAIN.
          <year>35</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Bansal,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Zhang, J. Liu,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Awadallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Burger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Autogen:
          <article-title>Enabling next-gen llm applications via multi-agent conversation framework</article-title>
          ,
          <source>in: COLM</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/pdf/2308.08155.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Wiest</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>Large language model based multi-agents: A survey of progress and challenges</article-title>
          ,
          <source>arxiv Preprint arXiv:2402.01680 abs/2402</source>
          .01680 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2402.01680. doi:
          <volume>10</volume>
          .48550/ARXIV.2402. 01680. arXiv:
          <volume>2402</volume>
          .
          <fpage>01680</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mordatch</surname>
          </string-name>
          ,
          <article-title>Improving factuality and reasoning in language models through multiagent debate</article-title>
          ,
          <source>in: Forty-first International Conference on Machine Learning, ICML 2024</source>
          , Vienna, Austria,
          <source>July 21-27</source>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=zj7YuTE4t8.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Small llms are weak tool learners: A multi-llm agent</article-title>
          ,
          <source>arXiv Preprint arXiv.2401.07324 abs/2401</source>
          .07324 (
          <year>2024</year>
          ). URL: https: //doi.org/10.48550/arXiv.2401.07324. doi:
          <volume>10</volume>
          .48550/ARXIV.2401.07324. arXiv:
          <volume>2401</volume>
          .
          <fpage>07324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , C. Cheng, H. Shi,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Agentscope: A flexible yet robust multi-agent platform</article-title>
          ,
          <source>arXiv Preprint arXiv.2402.14034 abs/2402</source>
          .14034 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2402.14034. doi:
          <volume>10</volume>
          . 48550/ARXIV.2402.14034. arXiv:
          <volume>2402</volume>
          .
          <fpage>14034</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weng</surname>
          </string-name>
          , W. Cheng,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Qin,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <article-title>The rise and potential of large language model based agents: A survey</article-title>
          ,
          <source>arxiv Preprint arXiv:2309.07864 abs/2309</source>
          .07864 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2309.07864. doi:
          <volume>10</volume>
          .48550/ARXIV.2309.07864. arXiv:
          <volume>2309</volume>
          .
          <fpage>07864</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23] OpenAI, GPT-4
          <source>technical report, arXiv preprint arXiv:2303.08774 abs/2303</source>
          .08774 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2303.08774. doi:
          <volume>10</volume>
          .48550/ARXIV.2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <article-title>The claude 3 model family: Opus, sonnet</article-title>
          , haiku,
          <year>2024</year>
          . URL: https://api.semanticscholar. org/CorpusID:268232499.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lachaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lacroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rozière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hambro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Azhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv.2302.13971 abs/2302</source>
          .13971 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2302.13971. doi:
          <volume>10</volume>
          .48550/ARXIV.2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>Rethinking the role of demonstrations: What makes in-context learning work?</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Abu</given-names>
            <surname>Dhabi</surname>
          </string-name>
          ,
          <source>United Arab Emirates, December</source>
          <volume>7</volume>
          -
          <issue>11</issue>
          ,
          <year>2022</year>
          , Association for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>11048</fpage>
          -
          <lpage>11064</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          . emnlp-main.
          <volume>759</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2022</year>
          .EMNLP-MAIN.
          <year>759</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems</source>
          <year>2022</year>
          , NeurIPS
          <year>2022</year>
          , New Orleans, LA, USA, November 28 - December 9,
          <year>2022</year>
          ,
          <year>2022</year>
          . URL: http://papers.nips.cc/paper_files/ paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. T.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yeo</surname>
          </string-name>
          ,
          <article-title>Dialogue chain-ofthought distillation for commonsense-aware conversational agents</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>5606</fpage>
          -
          <lpage>5632</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>342</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          . EMNLP-MAIN.
          <year>342</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papailiopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Prompted llms as chatbot modules for long open-domain conversation</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          , Toronto, Canada, July 9-
          <issue>14</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>4536</fpage>
          -
          <lpage>4554</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>277</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-ACL.
          <year>277</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wadhwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Durrett</surname>
          </string-name>
          ,
          <article-title>Learning to refine with fine-grained natural language feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2407.02397 abs/2407</source>
          .02397 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/ arXiv.2407.02397. doi:
          <volume>10</volume>
          .48550/ARXIV.2407.02397. arXiv:
          <volume>2407</volume>
          .
          <fpage>02397</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Challenges in building intelligent open-domain dialog systems</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>38</volume>
          (
          <year>2020</year>
          )
          <volume>21</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          :
          <fpage>32</fpage>
          . URL: https://doi.org/10.1145/3383123. doi:
          <volume>10</volume>
          .1145/3383123.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. Gai, Dialogbench:
          <article-title>Evaluating llms as human-like dialogue systems</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)</article-title>
          ,
          <source>NAACL</source>
          <year>2024</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          , Mexico, June 16-21,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>6137</fpage>
          -
          <lpage>6170</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>341</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .
          <article-title>NAACL-LONG</article-title>
          .
          <year>341</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <article-title>Knowledge-augmented language model verification</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>1720</fpage>
          -
          <lpage>1736</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>107</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EMNLP-MAIN.
          <year>107</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Finkelstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Juraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <article-title>Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Bethard (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2024</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          , Mexico, June 16-21,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>1429</fpage>
          -
          <lpage>1445</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .findings-naacl.
          <volume>92</volume>
          . doi:
          <volume>10</volume>
          . 18653/V1/
          <year>2024</year>
          .FINDINGS-NAACL.
          <year>92</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cassano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gopinath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Reflexion: language agents with verbal reinforcement learning</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</source>
          <year>2023</year>
          , NeurIPS
          <year>2023</year>
          , New Orleans, LA, USA, December
          <volume>10</volume>
          -
          <issue>16</issue>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: http://papers.nips.cc/paper_files/paper/2023/hash/ 1b44b878bb782e6954cd888628510e90-Abstract-Conference.html.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , React:
          <article-title>Synergizing reasoning and acting in language models</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations, ICLR</source>
          <year>2023</year>
          , Kigali, Rwanda, May 1-
          <issue>5</issue>
          ,
          <year>2023</year>
          , OpenReview.net,
          <year>2023</year>
          . URL: https://openreview.net/ forum?id=WE_
          <string-name>
            <surname>vluYUL-X.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nassereldine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>Large language models have intrinsic self-correction ability</article-title>
          ,
          <source>arXiv preprint arXiv:2406.15673 abs/2406</source>
          .15673 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2406.15673. doi:
          <volume>10</volume>
          .48550/ARXIV.2406. 15673. arXiv:
          <volume>2406</volume>
          .
          <fpage>15673</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          , W. Chen,
          <article-title>CRITIC: large language models can self-correct with tool-interactive critiquing</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations, ICLR</source>
          <year>2024</year>
          , Vienna, Austria, May 7-
          <issue>11</issue>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Sx038qxjek</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          , M. Bansal,
          <article-title>ReConcile: Round-table conference improves reasoning via consensus among diverse LLMs</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7066</fpage>
          -
          <lpage>7085</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>381</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>381</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hammoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Itani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khizbullin</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Ghanem,</surname>
          </string-name>
          <article-title>CAMEL: communicative agents for ”mind” exploration of large language model society</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</source>
          <year>2023</year>
          , NeurIPS
          <year>2023</year>
          , New Orleans, LA, USA, December
          <volume>10</volume>
          -
          <issue>16</issue>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: http://papers.nips.cc/paper_files/paper/2023/hash/ a3621ee907def47c1b952ade25c67698-Abstract-Conference.html.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rasal</surname>
          </string-name>
          ,
          <article-title>LLM harmony: Multi-agent communication for problem solving</article-title>
          ,
          <source>arXiv Preprint arXiv.2401.01312 abs/2401</source>
          .01312 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2401.01312. doi:
          <volume>10</volume>
          . 48550/ARXIV.2401.01312. arXiv:
          <volume>2401</volume>
          .
          <fpage>01312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhuge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Y. Cheng, J.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S. K. S.</given-names>
          </string-name>
          <string-name>
            <surname>Yau</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ran</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Metagpt: Meta programming for A multi-agent collaborative framework</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations, ICLR</source>
          <year>2024</year>
          , Vienna, Austria, May 7-
          <issue>11</issue>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https: //openreview.net/forum?id=
          <fpage>VtmBAGCN7o</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>C.</given-names>
            <surname>Qian</surname>
          </string-name>
          , W. Liu, H. Liu,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Chatdev: Communicative agents for software development</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>ACL</source>
          <year>2024</year>
          , Bangkok, Thailand,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>15174</fpage>
          -
          <lpage>15186</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>810</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>810</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Chateval:
          <article-title>Towards better llmbased evaluators through multi-agent debate</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations, ICLR</source>
          <year>2024</year>
          , Vienna, Austria, May 7-
          <issue>11</issue>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=FQepisCUWu.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Jauhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cucerzan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , Researchagent:
          <article-title>Iterative research idea generation over scientific literature with large language models</article-title>
          ,
          <source>arxiv Preprint arXiv:2404.07738 abs/2404</source>
          .07738 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2404.07738. doi:
          <volume>10</volume>
          .48550/ARXIV.2404. 07738. arXiv:
          <volume>2404</volume>
          .
          <fpage>07738</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Diaggpt:</surname>
          </string-name>
          <article-title>An llm-based chatbot with automatic topic management for task-oriented dialogue</article-title>
          ,
          <source>arxiv Preprint arXiv:2308.08043 abs/2308</source>
          .08043 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/ arXiv.2308.08043. doi:
          <volume>10</volume>
          .48550/ARXIV.2308.08043. arXiv:
          <volume>2308</volume>
          .
          <fpage>08043</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McDuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghassemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Breazeal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mdagents:</surname>
          </string-name>
          <article-title>An adaptive collaboration of llms for medical decision-making</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.15155. arXiv:
          <volume>2404</volume>
          .
          <fpage>15155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>A multi-agent conversational recommender system</article-title>
          ,
          <source>arxiv Preprint arXiv:2402.01135 abs/2402</source>
          .01135 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/ arXiv.2402.01135. doi:
          <volume>10</volume>
          .48550/ARXIV.2402.01135. arXiv:
          <volume>2402</volume>
          .
          <fpage>01135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <surname>K. D. Spurlock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Acun</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Saka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <article-title>Chatgpt for conversational recommendation: Refining recommendations by reprompting with feedback</article-title>
          ,
          <source>ArXiv abs/2401</source>
          .03605 (
          <year>2024</year>
          ). URL: https://api.semanticscholar.org/CorpusID:266844464.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <surname>J. C.-Y. Chen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Prasad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Stengel-Eskin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>Magicore: Multi-agent, iterative, coarse-to-fine refinement for reasoning, 2024</article-title>
          . URL: https://api.semanticscholar.org/CorpusID: 272708877.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Dinan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urbanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Szlam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Personalizing dialogue agents: I have a dog, do you have pets too?</article-title>
          , in: I. Gurevych, Y. Miyao (Eds.),
          <article-title>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2018</year>
          , Melbourne, Australia,
          <source>July 15-20</source>
          ,
          <year>2018</year>
          , Volume
          <volume>1</volume>
          :
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2204</fpage>
          -
          <lpage>2213</lpage>
          . URL: https://aclanthology.org/P18-1205/. doi:
          <volume>10</volume>
          .18653/V1/P18-1205.
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Parish</surname>
          </string-name>
          , H. Cheng, S. Min,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ammanabrolu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ostendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , Inscit:
          <article-title>Information-seeking conversations with mixed-initiative interactions</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>453</fpage>
          -
          <lpage>468</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00559. doi:
          <volume>10</volume>
          .1162/TACL\_A\ _
          <volume>00559</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>Call for customized conversation: Customized conversation grounding persona and knowledge</article-title>
          ,
          <source>in: Thirty-Sixth AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI</source>
          <year>2022</year>
          ,
          <source>The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1</source>
          ,
          <year>2022</year>
          , AAAI Press,
          <year>2022</year>
          , pp.
          <fpage>10803</fpage>
          -
          <lpage>10812</lpage>
          . URL: https://doi.org/10.1609/aaai.v36i10.21326. doi:
          <volume>10</volume>
          .1609/AAAI.V36I10.21326.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>D.</given-names>
            <surname>Occhipinti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Tekiroglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          ,
          <article-title>Prodigy: a profile-based dialogue generation dataset</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Bethard (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2024</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          , Mexico, June 16-21,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>3500</fpage>
          -
          <lpage>3514</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .findings-naacl.
          <volume>222</volume>
          . doi:
          <volume>10</volume>
          . 18653/V1/
          <year>2024</year>
          .FINDINGS-NAACL.
          <year>222</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serban</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Pineau,</surname>
          </string-name>
          <article-title>The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems</article-title>
          ,
          <source>in: Proceedings of the SIGDIAL 2015 Conference, The 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2-4 September</source>
          <year>2015</year>
          , Prague, Czech Republic, The Association for Computer Linguistics,
          <year>2015</year>
          , pp.
          <fpage>285</fpage>
          -
          <lpage>294</lpage>
          . URL: https://doi.org/10.18653/v1/w15-
          <fpage>4640</fpage>
          . doi:
          <volume>10</volume>
          .18653/V1/W15-4640.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mao</surname>
          </string-name>
          , W. Wu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gómez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)</article-title>
          ,
          <source>NAACL</source>
          <year>2024</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          , Mexico, June 16-21,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>279</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>15</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .
          <article-title>NAACL-LONG</article-title>
          .
          <year>15</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Globerson,</surname>
          </string-name>
          <article-title>LM vs LM: detecting factual errors via cross examination</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Table</source>
          <year>24</year>
          <article-title>The prompt template used for G-Eval (Engagingness).</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>