<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MedSyn: Enhancing Diagnostics with Human-AI Collaboration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Burcu Sayin</string-name>
          <email>burcu.sayin@unitn.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ipek Baris Schlicht</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ngoc Vo Hong</string-name>
          <email>ngoc.vohong@apss.tn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Allievi</string-name>
          <email>sara.allievi@apss.tn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacopo Staiano</string-name>
          <email>jacopo.staiano@unitn.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Minervini</string-name>
          <email>p.minervini@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Passerini</string-name>
          <email>andrea.passerini@unitn.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Santa Chiara Hospital</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The University of Edinburgh</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Clinical decision-making is inherently complex, often influenced by cognitive biases, incomplete information, and case ambiguity. Large Language Models (LLMs) have shown promise as tools for supporting clinical decisionmaking, yet their typical one-shot or limited-interaction usage may overlook the complexities of real-world medical practice. In this work, we propose a hybrid human-AI framework, MedSyn, where physicians and LLMs engage in multi-step, interactive dialogues to refine diagnoses and treatment decisions. Unlike static decision-support tools, MedSyn enables dynamic exchanges, allowing physicians to challenge LLM suggestions while the LLM highlights alternative perspectives. Through simulated physician-LLM interactions, we assess the potential of open-source LLMs as physician assistants. Results show open-source LLMs are promising as physician assistants in the real world. Future work will involve real physician interactions to further validate MedSyn's usefulness in diagnostic accuracy and patient outcomes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;hybrid decision making</kwd>
        <kwd>medical decision making</kwd>
        <kwd>hybrid intelligence</kwd>
        <kwd>clinical NLP</kwd>
        <kwd>LLM agents</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In traditional clinical practice, a physician’s diagnosis and treatment plan may be influenced by cognitive
biases, incomplete information, or the inherent complexity of the case [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Additionally, physicians
often work in time-sensitive, high-pressure environments (e.g., emergency departments), where cognitive
overload can increase the risk of misdiagnosis. Recent advancements in Large Language Models (LLMs)
ofer new opportunities for AI-assisted medical decision-making [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. We propose that physicians
and LLMs can efectively cooperate within multi-step interactive scenarios wherein the LLM’s
suggestions – whether accurate or flawed – serve as opportunities for deeper inquiry and reflection. Thus, in
this work, we investigate to what extent such a hybrid cooperative human-AI setup allows physicians
to uncover potential oversights, recognize overlooked symptoms, and reconsider treatment options.
Unlike static systems that provide one-time recommendations, we propose a dynamic conversational
framework that evolves based on real-time interactions, ensuring that physicians maintain control over
the clinical decision-making process. Specifically, we explore the collaboration of physicians and LLMs
on a specific and sensitive topic: a patient’s diagnosis. For instance, if the physician overlooks key
      </p>
      <sec id="sec-1-1">
        <title>Clinical note</title>
      </sec>
      <sec id="sec-1-2">
        <title>Chief Physician</title>
      </sec>
      <sec id="sec-1-3">
        <title>Physician Assistant</title>
        <p>Turn 1: Physician requests an initial evaluation
of the patient based on the provided clinical note.
Turn 2: Assistant provides the initial evaluation.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Discharge text</title>
        <p>Turn N: Physician collected sufficient information
and is confident in the diagnosis. Physician stops</p>
        <p>discussion and writes the discharge text.
symptoms or suggests a suboptimal treatment, the LLM can ask patient-specific follow-up questions
or recommend reconsidering the diagnosis. Conversely, if an LLM proposes an incorrect diagnosis,
the physician can critically examine its reasoning, prompting the model to refine its suggestion. This
iterative exchange improves diagnostic accuracy and therapeutic decision-making, serving as a cognitive
safety net that aids physicians in complex, ambiguous cases with a higher risk of error.</p>
        <p>
          This working paper presents our initial eforts on building MedSyn, a medical synergy framework
that positions LLMs as conversational partners in clinical decision-making. By fostering human-AI
collaboration, MedSyn aims to enhance diagnostics while preserving the physician’s critical role in
patient care. To evaluate MedSyn, we curate and merge data from MIMIC-IV [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and MIMIC-IV-Note
[
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ], creating a diverse set of patient records for model assessment. We then investigate 25 open-source
chat-based and medical-domain LLMs to evaluate their capacity for multi-turn engagement. Our analysis
highlights both the challenges and opportunities in developing open-source medical dialogue systems.
While several models struggled to maintain coherent, multi-turn interactions, others demonstrated the
ability to engage in sustained, in-depth discussions about patient conditions. From the 25 evaluated
models, we selected three promising candidates for further experimentation—LLaMA3 (8B and 70B)
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Gemma2 (27B) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. We also included DeepSeek-R1 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], distilled to Llama3.3-70B-Instruct
available via Ollama1, as a representative of state-of-the-art open-source models that currently fall
short in handling complex medical multi-turn dialogues. To assess the role of iterative questioning and
collaborative reasoning, we simulate physician–LLM conversations in a controlled setting. Preliminary
results show that interactive, multi-step exchanges yield more comprehensive patient assessments and
enhance diagnostic clarity. These findings are qualitatively supported by physician analysis of both
LLM decisions and their corresponding dialogue traces. As a next step, we aim to replace the simulated
physician LLM with real clinicians, enabling direct interaction with the assistant LLM. This will help
refine MedSyn for clinical deployment and further validate its utility in real-world medical settings.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. MedSyn</title>
      <p>all the details in the clinical note, while the physician is assumed to have access only to the patient’s
chief complaint. To gather necessary information about the patient and engage in a collaborative
discussion, the physician initiates a multi-turn interaction. In the first turn, the physician asks the
assistant for an initial evaluation of the patient. In response, the assistant carefully analyzes the clinical
note and provides a detailed observation. Following this, the physician and the virtual assistant engage
in a dynamic discussion about the patient’s condition. This exchange continues until the physician feels
they have gathered all the necessary information and is confident in their understanding of the patient’s
condition. At this point, the physician concludes the discussion and drafts the discharge text for the
patient. The discharge text may include several sections, such as the discharge diagnosis, condition,
medications, and the follow-up instructions. For this study, however, we focus solely on the “diagnosis”
and the corresponding “ICD-10 codes”2 used by clinicians to code and classify medical diagnoses.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Work</title>
      <p>
        We combined MIMIC-IV3 [
        <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
        ] and MIMIC-IV-Note4 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] datasets by selecting records with ICD-10
coding [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which covers diseases from coarse, “chapter” level (e.g. E00-E90) to finer granularities
(e.g. E10.9 where E10 is a disease category and 10.9 indicates the disease code). The resulting merged
dataset contained 122,266 records spanning 5,802 unique diagnoses. Upon analyzing the discharge text
ifeld in these records, we observed that most followed a common structure, though certain subsections
varied (e.g., the “major surgical or invasive procedure” section was present in some records but absent
in others). Samples with missing headings or free-form discharge notes hindered efective parsing and
prevented the establishment of a standardized format across all records. After consulting with three
physicians, we identified the most important sections for our experiments and excluded samples that
did not conform to the expected format. Specifically, we selected records that include the following
sections in their discharge texts: “chief complaint, history of present illness, social history, physical
exam, pertinent results, major surgical or invasive procedure, brief hospital course, medications on
admission, discharge medications, discharge diagnosis, discharge condition, and discharge instructions”.
Furthermore, we removed records where the patient’s status was “deceased” or “expired”. This filtering
process resulted in a final dataset of 74,850 records. Then, we randomly (seed=13) selected 1,000 records
as our test set.It consists of 2,350 unique diagnoses (on a total of 13,384). The average number of
ICD-10 codes appearing in a sample is 5.61. The most common diagnosis is ‘E78.5’ (Hyperlipidemia,
Unspecified), while 1,112 diagnoses are identified as the rarest (e.g. ‘H53.40’: Unspecified visual field
defects). Since access to this dataset requires completing specialized training, CITI,5 we are unable to
publicly share our test set and LLM outputs. However, we have detailed our preprocessing steps above
and made our code available.6
      </p>
      <sec id="sec-3-1">
        <title>3.1. Models &amp; Frameworks</title>
        <p>
          We investigated 25 open-source models7 across general-purpose, chat-based, and medical domains,
ifnding that most struggled with multi-turn dialogues. Some chat-based models (e.g., OpenChat:7B [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ])
performed poorly in medical conversations, while certain medical domain models (e.g., Meditron:7B
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and MedLlama2:7B)8 exhibited limitations in handling real-world dialogues. Among the evaluated
models, we identified three promising candidates within our experimental setup: Llama3 (8B and 70B)
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Gemma2:27B [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. To illustrate the challenges even state-of-the-art models face in medical
2https://www.icd10data.com/ICD10CM/Codes/
3https://physionet.org/content/mimiciv/3.0/
4https://physionet.org/content/mimic-iv-note/2.2/
5https://physionet.org/about/citi-course/
6See our source code here: https://github.com/burcusayin/MedSyn
7command-r-plus:104b, command-r:35b, openchat:7b, mistral:7b, mistrallite:7b, mixtral:8x7b, qwen2:7b, meditron:7b,
meditron:70b, medllama2:7b, llama3-chatqa:8b and 70b, llama3:8b and 70b, llama3.1:8b, llama3.2:3b, dolphin-llama3:8b,
dolphinllama3:70b, phi3:14b, nemotron:70b, alfred:40b, deepseek-R1-Distill-Llama-70B, tulu3:8b and 70b, gemma2:27b
8https://huggingface.co/llSourcell/medllama2_7b
dialogues, we present results with DeepSeek-R1:70B [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (Distilled to Llama-70B, available by Ollama9).
We implemented our multi-agent environment using Ollama10 and Langroid.11
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Use cases</title>
        <p>To assess the potential of our framework for real-world deployment in medical decision-making systems,
we simulated interactions using LLMs—one serving as the chief physician and another as the physician
assistant. As a baseline, we defined the “phy w/complaint” scenario, in which the physician LLM
receives only the patient’s chief complaint from the clinical note and generates the discharge text
without any interaction or dialogue. In contrast, the “two agent” setup simulates the collaboration
between physicians and assistants in the real world by implementing the MedSyn pipeline (Section 2).
Here, the physician agent is limited to the chief complaint, while the assistant agent has access to the
complete clinical note, including the history of present illness, physical examination, and pertinent
results. Both configurations employ zero-shot prompting, with full prompt details provided below.
Baseline Case</p>
        <p>We use the baseline prompt in the “phy w/complaint” case.</p>
        <p>Baseline Prompt
You are Dr. Ellis, the chief physician responsible for reviewing clinical notes and writing a discharge text
for patients.
**Here is the clinical note for the patient:** {clinicalNote}.
### Instructions:
1. Carefully analyze the given clinical note to ensure that no symptoms are overlooked.
2. You are not allowed to ask any questions or make assumptions beyond the information provided
in the clinical note.
3. Once you are ready, write the discharge text for the patient.
4. The discharge text should include only the ‘diagnosis’ and ‘codes’ fields:
• ‘diagnosis’ field should specify the patient’s final diagnosis. Please note that you should
decide the final diagnosis.
• ‘codes’ field should list the ICD-10 codes corresponding to the diagnosis specified in the
‘diagnosis’ field. Keep in mind that this field is a string, do not use ‘[]’ while listing the codes.
5. Remember to refer the clinical note while writing the discharge text. Ensure that the ‘diagnosis’,
and ‘codes’ fields are complete and unambiguous; they must not be left empty or unclear.</p>
        <p>6. Return your dischargeText using the TOOL ‘baseline_discharge_text_tool’.</p>
        <p>Two-agent Case</p>
        <p>We use diferent prompts for the chief physician and physician assistant LLMs.</p>
        <p>Chief Physician Prompt
You are Dr. Ellis, the Chief Physician, collaborating with Dr. Lee, your assistant. Your task is to review a
clinical note by initiating an evaluation from Dr. Lee and engaging in a natural, focused conversation to
assess the patient’s condition. Avoid fabricating interactions or simulating dialogue with Dr. Lee. Instead,
clearly articulate your questions or follow-ups, analyze Dr. Lee’s responses, and use this information to
guide your decision-making.</p>
        <p>Your responsibilities include the following:
• Verify the patient’s condition, symptoms, and diagnosis.
• Ensure all symptoms are accounted for and thoroughly understand the patient’s condition to
deliver optimal care.
• Address doubts regarding the diagnosis or treatment plan by conducting further evaluations with</p>
        <p>Dr. Lee to achieve accurate and efective results.
9https://ollama.com/library/deepseek-r1
10https://github.com/ollama
11https://github.com/langroid/langroid
**Here is the clinical note for the patient:** {clinicalNote}.
### Instructions:
1. Begin by requesting an initial evaluation of the patient from Dr. Lee.
2. Engage in a collaborative discussion to confirm the patient’s diagnosis. Please note that Dr. Lee
has access to a more detailed clinical note, so you MUST consult to Dr. Lee to obtain the necessary
information for making the diagnosis.
3. Keep in mind that you have limited time for every patient. Please avoid duplicate recommendations,
conversations, and questions related to treatments. Keep each message CONCISE and to the point.
4. Once you have gathered suficient information and are confident in the diagnosis, stop the
discussion and write the patient’s discharge text.
5. The discharge text should include only the ‘diagnosis’ and ‘codes’ fields:
• ‘diagnosis’ field should specify the patient’s final diagnosis. Please note that you should
decide the final diagnosis.
• ‘codes’ field should list the ICD-10 codes corresponding to the diagnosis specified in the
‘diagnosis’ field.
6. Remember to refer to your discussion with Dr. Lee and the clinical note while writing the discharge
text. Ensure that the ‘diagnosis’, and ‘codes’ fields are complete and unambiguous; they must not
be left empty or unclear.
7. Do NOT ask Dr. Lee to check or write your dischargeText. It is YOUR RESPONSIBILITY to write
and submit the dischargeText.
8. Return your dischargeText using the TOOL ‘discharge_text_tool’. Do NOT mention the TOOL
‘discharge_text_tool’ to Dr. Lee.</p>
        <p>Physician Assistant Prompt
You are Dr. Lee, an assistant physician working under the supervision of Dr. Ellis, the chief physician.
Your role is to review the patient’s clinical notes to perform an initial evaluation, which will support Dr.
Ellis in assessing the patient’s condition and writing the discharge text. Following your evaluation, you
will engage in a collaborative discussion with Dr. Ellis to confirm the diagnosis and determine the next
steps.
**Here is the clinical note for the patient:** {clinicalNote}.
### Task: Thoroughly analyze the clinical note and provide a structured summary that includes:
• Key symptoms: Highlight notable symptoms that may require further investigation.
• Preliminary diagnosis: Ofer an initial diagnosis based on the patient’s symptoms and medical
history.
• Potential complications: Identify any critical issues or risks Dr. Ellis should consider.
• Recommendations: Suggest further evaluations if uncertainties remain about the patient’s
condition.
### Instructions:
1. Ensure your evaluation is clear, precise, and structured to facilitate an informed discussion.
2. In each round of the discussion, limit yourself to a CONCISE message.
3. Keep in mind that you have limited time for every patient. Please avoid duplicate recommendations,
conversations, and questions related to treatments.
### Process: You will first receive a message from Dr. Ellis, asking for your initial assessment. Afterward,
you can follow up in each discussion round to collaboratively refine the diagnosis.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        Directly comparing discharge texts with LLM responses using standard metrics presents several
challenges: (i) Discharge texts lack the conversational tone of LLM responses, (ii) LLMs may generate
variable lengths of ICD-10 codes and diagnoses, including occasional hallucinated codes,12 (iii)
Physicians often employ abbreviations and specialized formatting in discharge texts, whereas LLMs produce
more standard, conversational sentences, and (iv) The ground truth for diagnoses and ICD-10 codes is
longer than LLM outputs. According to two physicians from the in-house annotators, this discrepancy
arises because physicians include codes for current and past illnesses based on system recommendations,
while LLMs are limited to the information provided in the prompt, which in our case focuses on current
symptoms rather than a comprehensive patient history. Thus, specific metrics designed for ICD code
detection [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are unsuitable.
      </p>
      <p>ICD-10 Classification As stated in §3, ICD-10 contains coarse and fine-grained definitions of diseases.
In preliminary experiments, we observed that all LLMs tended to not generate fine-grained codes, which
could be expected in our zero-shot multi-label classification setup. We explored this issue by discussing
several ground truth examples with physicians: they brought to our attention that when selecting
ICD-10 subcodes – often very specific to the diagnosis – diferent physicians might choose diferent
codes among those corresponding to the same primary diagnosis; most importantly, it was highlighted
how physicians tend to include codes for all the acute or chronic conditions a patient is afected in the
patient’s medical record, hence including several codes actually unrelated to the specific chief complaint.
This characteristic of the ground truth makes the selection of evaluation metrics challenging, as it is
impossible to selectively remove the ICD codes unrelated to the chief complaint. For this reason, we
resort to compute precision, recall, F1-score, and Jaccard similarity score13 on a per-sample basis, and
report the mean values in Table 1. F1 and Recall show that the agents struggled to accurately predict
disease categories, frequently missing ICD codes present in the ground truth. Regarding Precision, all
models performed better in predicting disease chapters, a simpler task than detecting disease categories.
DeepSeek-R1 and Llama3:70B performed best in the “phy w/complaint” case (in terms of precision),
with the former excelling in Disease Category and the latter in Disease Chapter.</p>
      <p>In two-agent case, we observed that DeepSeek-R1 struggled to engage in dialogue. Despite explicitly
stating in the prompt that it must consult the assistant before making a diagnosis, it often relied on
internal reasoning and directly generated the discharge text, with minimal interaction with its assistant.
Figure 2 shows the number of turns each &lt;chief physician agent,Llama3:8B&gt; pair produced per sample
in the “two-agent” case. Notably, DeepSeek-R1:70B engaged in conversations infrequently, whereas
Llama:70B exhibited higher interaction, averaging 19.2 turns per sample. Both the Llama3:70B and
Gemma2:27B models demonstrated strong performance in engaging in efective dialogues with their
assistants and generating well-structured discharge summaries. However, Gemma2:27B was more
efective in dialogues, generating the discharge text in 9.3 turns in average. Additionally, Llama3:8B
proved to be an efective physician assistant by responding concisely to the chief physician and extracting
the necessary information from the clinical note. This is evident from their performance, which closely
approaches the performance in “phy w/full_note” case and generates the discharge text without any
interaction with the assistant. Our preliminary findings suggest that open-source LLMs hold promise
as physician assistants in real-world clinical settings. However, further analysis needed to clarify the
limitations and improve performance.</p>
      <p>Qualitative Analysis by Physicians The use of LLMs in a healthcare setting has shown interesting
results from a clinical perspective. The “phy w/complaint” case showed that, starting from the main
symptom, LLM was able to identify a possible diagnosis despite having no access to additional clinical
and instrumental information. However, it could only align with a subset of the physician’s diagnostic
hypothesis and was unable to provide a detailed diagnosis. On the other hand, the “two-agent” scenario
yielded better results in terms of diagnostic precision and completeness. In particular, the Gemma2:27B
model made precise diagnoses when interacted with the Llama3:8B model, identifying even rare
12For instance, writing the code M3459 for diagnosis “Multiple Sclerosis Flare”: the code M3459 does not exists; “M34”
corresponds to “systemic sclerosis” disease which is unrelated to “multiple sclerosis” (“G35”).
13Please see our code for the evaluation: https://github.com/burcusayin/MedSyn/blob/main/src/evaluation/metrics.py
800
600
y
c
n
e
u
q
rFe400
200
0
DeepSeek-R1:70B
Llama3:70B
Gemma2:27B
conditions that could be overlooked by a physician (e.g., Ludwig’s angina). The interaction between
the physician LLM and the assistant LLM allowed for a more complete diagnosis, as the physician
could obtain additional information regarding the patient’s characteristics and instrumental exams. In
this case, the main challenge was distinguishing between acute and chronic conditions, as there were
instances where the chief physician agent identified a pre-existing condition as the primary diagnosis.
DeepSeek-R1 did not perform well in “two-agent” case, and did not improve the diagnosis compared to
“phy w/complaint” case, often merely repeating the diagnosis already made. Regarding the identification
of ICD-10 codes, LLMs were consistently able to identify the general category of the clinical condition,
although the specific subcode often difered from the dataset. Two-agent scenario is found to be a
valuable resource for physicians, as it allows them to interact with an assistant that provides information
and often suggests dificult diagnoses. It can be a useful tool in speeding up the diagnostic process.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        Prior studies explored multi-LLM frameworks to enhance accuracy and reasoning, primarily focusing
on closed-ended questions [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21 ref22 ref23 ref24">17, 18, 19, 20, 21, 22, 23, 24</xref>
        ]. However, their applications remain confined to
controlled settings, with limited exploration of real-world human-LLM collaboration. Evaluating LLMs’
multi-turn dialogue capabilities is a step toward practical applications. Kwan et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] introduced
the MT-Eval benchmark, finding that closed-source models outperform open-source ones, though
multi-turn dialogues degrade performance due to retrieval dificulties and error propagation. Bai et al.
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] proposed MT-Bench-101 to assess LLMs in multi-turn dialogues, noting issues with adaptability
and interactivity. Alignment techniques like RLHF [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and DPO [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], as well as chat-specific designs,
ofered limited benefits for multi-turn tasks. Campedelli et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] examined open-source LLMs in
goal-driven collaborations and observed mixed success, with models like Mixtral [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and Mistral [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
exhibiting higher failure rates.
      </p>
      <p>
        In healthcare, LLMs have been explored for clinical note summarization [
        <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
        ], aiming to assist
physicians, though issues such as hallucinations and missing information persist [
        <xref ref-type="bibr" rid="ref34 ref35">34, 35</xref>
        ]. Additionally,
metrics like ROUGE [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] and BLEU [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] used to assess summary quality have faced criticism regarding
their efectiveness in evaluating clinical content. Furthermore, simulated patient-doctor interactions
have been explored to enhance diagnostic accuracy. Liao et al. [38] improved accuracy by prompting
LLMs to ask clarifying questions, though hallucinations persisted. Liu et al. [39] introduced the
LLMspecific clinical pathway (LCP) to evaluate diagnostic performance using subjective and objective patient
data, revealing challenges in handling multi-turn dialogues and clinical specialties, though their study
focused solely on the Chinese language. Xie et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] emphasized LLMs as supportive tools rather than
replacements, developing the DoctorFLAN dataset and DotaBench to benchmark medical tasks. While
most LLMs underperformed, DotaGPT, trained on DoctorFLAN, achieved superior results, demonstrating
the dataset’s efectiveness. However, its availability only in Chinese limits the generalizability of the
ifndings to other languages. Kim et al. [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] proposed MDAgents, a framework that improves LLM
efectiveness in complex medical decision-making by dynamically structuring collaboration models.
It adapts to clinical needs by assigning LLMs independently or in groups based on task complexity.
However, it fails to consider the critical role of physicians in medical decisions. Finally, Fan et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
proposed the AI Hospital framework for simulated clinical diagnostics, whereas our approach focuses
on iterative physician-LLM collaboration to refine clinical reasoning and decision-making.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This work-in-progress paper introduced MedSyn, a dynamic human-AI collaboration framework
designed to enhance clinical decision-making through multi-turn, conversational interactions between
physicians and LLMs. Unlike traditional, static decision-support tools, MedSyn fosters an iterative
diagnostic process where human expertise and AI-generated insights evolve together, aiming to create
a safety net in complex medical scenarios. Through controlled simulations and qualitative analysis, we
demonstrated that open-source LLMs are promising in meaningfully assisting physicians by uncovering
overlooked information, proposing alternative hypotheses, and contributing to more comprehensive
diagnostic reasoning. Our results revealed that while model performance varies, open-source LLMs
show promise in improving diagnostic completeness and identifying rare conditions. In addition,
physician evaluations highlighted the value of AI assistants not only in information retrieval, but also
in hypothesis generation and diagnostic refinement. Despite encouraging results, challenges remain in
aligning model outputs with clinical standards, particularly in the accurate generation of ICD-10 codes
and managing nuances like chronic vs. acute conditions. These findings underscore the importance of
continued iteration on evaluation metrics and dialogue strategies.</p>
      <p>Future work will involve human-in-the-loop evaluations, enabling real physicians to engage with
MedSyn in real-world settings and provide feedback on usability, relevance, and trustworthiness. We
also plan to enhance MedSyn’s factual accuracy in clinical reasoning and coding, ensuring more robust
and reliable support. This line of research is critical for the responsible integration of AI into clinical
workflows—aiming to reduce diagnostic errors, support clinician decision-making, and ultimately
improve patient outcomes. MedSyn represents a step toward more adaptive, intelligent healthcare
systems where AI serves not as a replacement, but as a reliable and responsive partner in healthcare.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Funded by the European Union. Views and opinions expressed are however those of the author(s) only
and do not necessarily reflect those of the European Union or the European Health and Digital Executive
Agency (HaDEA). Neither the European Union nor the granting authority can be held responsible for
them. Grant Agreement no. 101120763 - TANGO. Andrea Passerini also acknowledges the support of
the MUR PNRR project FAIR - Future AI Research (PE00000013) funded by the NextGenerationEU.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this manuscript, the authors utilized ChatGPT and Grammarly to assist with
paraphrasing, improving writing style, and refining grammar. After using these tools, the authors
reviewed and edited the content as needed and took full responsibility for the publication’s content.
[38] Y. Liao, Y. Meng, H. Liu, Y. Wang, Y. Wang, An automatic evaluation framework for multi-turn
medical consultations capabilities of large language models, arXiv abs/2309.02077 (2023). URL:
https://arxiv.org/abs/2309.02077.
[39] L. Liu, X. Yang, F. Li, C. Chi, Y. Shen, S. Lyu, M. Zhang, X. Ma, X. Lv, L. Ma, Z. Zhang, W. Xue,
Y. Huang, J. Gu, Towards automatic evaluation for llms’ clinical capabilities: Metric, data, and
algorithm, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, KDD ’24, Association for Computing Machinery, New York, NY, USA, 2024, p.
5466–5475. URL: https://doi.org/10.1145/3637528.3671575. doi:10.1145/3637528.3671575.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Saposnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Redelmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ruf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Tobler</surname>
          </string-name>
          ,
          <article-title>Cognitive biases associated with medical decisions: a systematic review, BMC medical informatics and decision making 16(1</article-title>
          ):
          <volume>138</volume>
          (
          <year>2016</year>
          ).
          <source>doi:10.1186/s12911-016-0377-1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , T. D.
          <string-name>
            <surname>Giardina</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Khawaja</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Patient and clinician experiences of uncertainty in the diagnostic process: Current understanding and future directions</article-title>
          ,
          <source>Patient Education and Counseling</source>
          <volume>104</volume>
          (
          <year>2021</year>
          )
          <fpage>2606</fpage>
          -
          <lpage>2615</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0738399121004870. doi:https://doi.org/10.1016/j.pec.
          <year>2021</year>
          .
          <volume>07</volume>
          .028.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Llms for doctors: Leveraging medical llms to assist doctors, not replace them</article-title>
          ,
          <source>arXiv abs/2406</source>
          .18034 (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2406.18034.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McDuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghassemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Breazeal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mdagents:</surname>
          </string-name>
          <article-title>An adaptive collaboration of llms for medical decision-making</article-title>
          ,
          <source>arXiv abs/2404</source>
          .15155 (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2404.15155.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grau-Vilchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McDuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Breazeal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>A demonstration of adaptive collaboration of large language models for medical decision-making</article-title>
          ,
          <source>arXiv abs/2411</source>
          .00248 (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2411.00248.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Siyuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>AI hospital: Benchmarking large language models in a multi-agent medical interaction simulator</article-title>
          , in: O.
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Apidianaki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Khalifa</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          <string-name>
            <surname>Eugenio</surname>
          </string-name>
          , S. Schockaert (Eds.),
          <source>Proceedings of the 31st International Conference on Computational Linguistics</source>
          , Association for Computational Linguistics, Abu Dhabi,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          ,
          <year>2025</year>
          , pp.
          <fpage>10183</fpage>
          -
          <lpage>10213</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .coling-main.
          <volume>680</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , L. Bulgarelli,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gayles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shammout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Horng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pollard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hao</surname>
          </string-name>
          , B. Moody, B.
          <string-name>
            <surname>Gow</surname>
            ,
            <given-names>L.</given-names>
            -w. Lehman, L.
          </string-name>
          <string-name>
            <surname>Celi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mark</surname>
          </string-name>
          ,
          <article-title>Mimic-iv, a freely accessible electronic health record dataset</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <article-title>1</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41597-022-01899-x.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. Pollard,
          <string-name>
            <given-names>S.</given-names>
            <surname>Horng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Celi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mark</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mimic-</surname>
          </string-name>
          iv-note:
          <article-title>Deidentified free-text clinical notes (version 2</article-title>
          .2),
          <source>PhysioNet</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .13026/1n74-
          <fpage>ne17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Amaral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Havlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hausdorg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mietus</surname>
          </string-name>
          , G. Moody, C.
          <article-title>-</article-title>
          K. Peng,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stanley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Physiobank</surname>
          </string-name>
          ,
          <article-title>Components of a new research resource for complex physiologic signals</article-title>
          ,
          <source>PhysioNet</source>
          <volume>101</volume>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. . M.</given-names>
            <surname>Llama Team</surname>
          </string-name>
          ,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>G. D. Gemma</surname>
            <given-names>Team</given-names>
          </string-name>
          ,
          <article-title>Gemma 2: Improving open language models at a practical size</article-title>
          .,
          <source>arXiv abs/2501</source>
          .12948 (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>D.-A. Team</surname>
          </string-name>
          , Deepseek-r1:
          <article-title>Incentivizing reasoning capability in llms via reinforcement learning</article-title>
          ,
          <source>arXiv abs/2501</source>
          .12948 (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13] WHO,
          <source>International Classification of Diseases (ICD)</source>
          ,
          <year>2016</year>
          . URL: http://www.who.int/classifications/ icd/en/, accessed on 2021-
          <volume>04</volume>
          -14.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          , S. Cheng,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          , Y. Liu, Openchat:
          <article-title>Advancing open-source language models with mixed-quality data</article-title>
          ,
          <source>arXiv preprint arXiv:2309.11235</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Cano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Matoba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Salvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pagliardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohtashami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sallinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhaeirad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Swamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Krawczuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bayazit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marmet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montariol</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Hartley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jaggi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bosselut</surname>
          </string-name>
          , Meditron-70b:
          <article-title>Scaling medical pretraining for large language models</article-title>
          ,
          <source>arXiv abs/2311</source>
          .16079 (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2311.16079.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Edin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Junge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Havtorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Borgholt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruotsalo</surname>
          </string-name>
          , L. Maaløe,
          <article-title>Automated medical coding on mimic-iii and mimic-iv: A critical review and replicability study</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>2572</fpage>
          -
          <lpage>2582</lpage>
          . URL: https://doi.org/10.1145/3539618.3591918. doi:
          <volume>10</volume>
          .1145/3539618.3591918.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C.-M. Chan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , Chateval:
          <article-title>Towards better LLM-based evaluators through multi-agent debate</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=FQepisCUWu.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mordatch</surname>
          </string-name>
          ,
          <article-title>Improving factuality and reasoning in language models through multiagent debate</article-title>
          ,
          <source>in: Proceedings of the 41st International Conference on Machine Learning, ICML'24</source>
          , JMLR.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>LLM-blender: Ensembling large language models with pairwise ranking and generative fusion</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>14165</fpage>
          -
          <lpage>14178</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>792</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>792</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Al Kader Hammoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Itani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khizbullin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <article-title>Camel: communicative agents for "mind" exploration of large language model society</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Neural Information Processing Systems</source>
          , NIPS '23, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <article-title>Encouraging divergent thinking in large language models through multi-agent debate</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>17889</fpage>
          -
          <lpage>17904</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>992</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . emnlp-main.
          <volume>992</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization</article-title>
          ,
          <source>ArXiv abs/2310</source>
          .02170 (
          <year>2023</year>
          ). URL: https: //api.semanticscholar.org/CorpusID:276421095.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , L. Kong,
          <article-title>Corex: Pushing the boundaries of complex reasoning through multi-model collaboration</article-title>
          ,
          <source>arXiv abs/2310</source>
          .00280 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Bansal,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Zhang, J. Liu,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Awadallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Burger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Autogen:
          <article-title>Enabling next-gen LLM applications via multi-agent conversations</article-title>
          ,
          <source>in: First Conference on Language Modeling</source>
          ,
          <year>2024</year>
          . URL: https: //openreview.net/forum?id=
          <fpage>BAakY1hNKS</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>W.-C. Kwan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-F. Wong</surname>
          </string-name>
          ,
          <article-title>MT-eval: A multi-turn capabilities evaluation benchmark for large language models</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>20153</fpage>
          -
          <lpage>20177</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>1124</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . emnlp-main.
          <volume>1124</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W. Ouyang,
          <article-title>MT-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7421</fpage>
          -
          <lpage>7454</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>401</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>401</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaufmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bengs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hüllermeier</surname>
          </string-name>
          ,
          <article-title>A survey of reinforcement learning from human feedback abs/2312</article-title>
          .14925 (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2312.14925.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rafailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , E. Mitchell,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <article-title>Direct preference optimization: your language model is secretly a reward model</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Neural Information Processing Systems</source>
          , NIPS '23, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Campedelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Penzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dessì</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lepri</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Staiano,</surname>
          </string-name>
          <article-title>I want to break free! persuasion and anti-social behavior of llms in multi-agent settings with social hierarchy</article-title>
          ,
          <source>arXiv abs/2410</source>
          .07109 (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2410.07109.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Hanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel, G. Bour,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Antoniak</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gervet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mixtral of experts,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2401.04088.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <source>arXiv abs/2310</source>
          .06825 (
          <year>2023</year>
          ). URL: https://arxiv.org/ abs/2310.06825.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bigham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <article-title>Generating SOAP notes from doctor-patient conversations using modular summarization techniques</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>4958</fpage>
          -
          <lpage>4972</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>384</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>384</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bajracharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sills</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>D.</given-names>
            <surname>Berlowitz</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Generation of patient after-visit summaries to support physicians</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>C.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
          </string-name>
          , K.-S. Choi,
          <string-name>
            <surname>P.-M. Ryu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-H. Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Donatelli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kurohashi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paggio</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hahm</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Santus</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bond</surname>
          </string-name>
          , S.-H. Na (Eds.),
          <source>Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          , Gyeongju, Republic of Korea,
          <year>2022</year>
          , pp.
          <fpage>6234</fpage>
          -
          <lpage>6247</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .coling-
          <volume>1</volume>
          .544/.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          , W.-w. Yim,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>An empirical study of clinical note generation from doctorpatient encounters</article-title>
          , in: A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , Association for Computational Linguistics, Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>2291</fpage>
          -
          <lpage>2302</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .eacl-main.
          <volume>168</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .eacl-main.
          <volume>168</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>F.</given-names>
            <surname>Moramarco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Papadopoulos</given-names>
            <surname>Korfiatis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Juric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Flann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Belz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Savkov</surname>
          </string-name>
          ,
          <article-title>Human evaluation and correlation with automatic metrics in consultation note generation</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>5739</fpage>
          -
          <lpage>5754</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>394</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>394</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013/.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040/. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>