<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Behavioural Study of an Intelligent Agent Guided by Local LLM Models in a Doctor-Patient Scenario</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Borque-Benedí</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Tellería</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paula Peña-Larena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael del-Hoyo-Alonso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aragon Institute of Technology (ITA), María de Luna</institution>
          ,
          <addr-line>7-8, 50018 Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Aragonés de Ciencias de la Salud (IACS), C. de San Juan Bosco</institution>
          ,
          <addr-line>13, 50009 Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This study analyzes the behaviour of intelligent agents powered by Large Language Models (LLMs) in a doctorpatient interaction scenario. Various configurations were developed and tested, where agents played specific roles using both open-source and proprietary models, such as AI-Growth-Lab_llama-2-7b-clinical-innovation, meditron7b and GPT-4. The main objective was to evaluate these models' ability to generate realistic and consistent interactions in a clinical context. Anonymized medical records were used and a conversational interface based on AutoGen was implemented. The results indicate that while proprietary models like GPT-4 exhibit greater coherence and accuracy in medical reasoning, certain open-source models demonstrated promising capabilities, achieving reasonable performance particularly with careful prompt engineering, although requiring further optimization to match the robustness observed in proprietary counterparts. This work lays the foundation for improving conversational agents in medical training and opens the door for future research on optimizing open-source models for use in healthcare environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Artificial Intelligence</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Medical Simulation</kwd>
        <kwd>Conversational Agents</kwd>
        <kwd>Healthcare Training</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have experienced rapid advancements recently, enabling their
application in diverse fields, including healthcare [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These models, trained on vast amounts of text
data, are capable of generating human-like responses and engaging in complex interactions. In the
medical domain, AI-driven conversational agents hold significant potential to revolutionize training [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
diagnosis support [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and patient interaction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>This study focuses on developing and evaluating intelligent agents powered by LLMs to simulate
doctor-patient interactions. Originating from a collaboration between ITA and IACS, this work evaluates
the feasibility of using LLM-based agents for realistic clinical dialogue simulation. The primary goal is
to determine efective configurations and compare the performance of diferent LLMs in generating
realistic and coherent interactions between a virtual patient and a virtual or human doctor.</p>
      <p>
        To accomplish this, we explored various LLMs, including the proprietary model GPT-4 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], known
for its strong performance on medical benchmarks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and several open-source alternatives specifically
tuned or relevant for medical contexts, such as AI-Growth-Lab_llama-2-7b-clinical-innovation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
meditron-7b [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The agents were designed to participate in simulated medical consultations, where a
virtual patient describes symptoms and a simulated or human doctor responds with diagnostic reasoning
and recommendations. Diferent configurations were tested to evaluate the models’ ability to simulate
realistic behaviours, maintain contextual coherence, and adapt to potentially specified emotional states
within the conversation.
      </p>
      <p>
        Additionally, an interactive interface was developed using the AutoGen framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], enabling
dynamic role-based interactions between agents. The system was tested with anonymized medical
records representing common clinical scenarios (e.g., congestive heart failure) to ground the simulations.
      </p>
      <p>Our evaluation provides comparative insights into the strengths and limitations of these LLMs
for medical dialogue simulation. While proprietary models like GPT-4 demonstrated generally higher
coherence and clinical reasoning alignment, certain open-source models exhibited promising capabilities,
achieving reasonable performance with careful prompt engineering, though often requiring further
ifne-tuning for consistent role adherence and nuanced responses. This research lays the groundwork for
improving AI-driven medical simulations and highlights the challenges and opportunities in optimizing
open-source LLMs for reliable use within clinical training environments. This paper details the related
work (Sec. 2), the system architecture (Sec. 3), presents our evaluation methodology and findings (Sec. 4),
and discusses implications and future directions (Sec. 5).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The use of Large Language Models (LLMs) in healthcare has rapidly expanded [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These models have
explored for various purposes, including diagnostic assistance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], summarizing clinical notes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
enhancing patient interaction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and notably, for training simulations for healthcare professionals [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Central to simulation is the management of conversational agents, leading researchers to investigate
diferent frameworks for managing intelligent agents in complex, role-based dialogue scenarios.
      </p>
      <p>
        One generative agent framework is Smallville [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which simulates believable agent behaviours
in a virtual town. Smallville enables complex multi-agent interactions. However, our analysis and
preliminary tests [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] suggested its architecture, emphasizing memory and planning over long horizons,
was less suited for the immediate, reactive dialogue flow typical of medical consultations compared to
frameworks designed specifically for conversational agent orchestration.
      </p>
      <p>
        An alternative approach adopted in this work is AutoGen [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a framework designed for
orchestrating multiple LLM-driven agents with distinct roles and interactive capabilities. AutoGen provides
mechanisms for defining agent roles, communication patterns, and tool usage, making it particularly
suitable for simulating multi-turn conversations like doctor-patient dialogues. Unlike Smallville’s focus
on long-term planned behaviour, AutoGen facilitates flexible conversational flows where agents react
to each other’s utterances, while still adhering to predefined personas and instructions.
      </p>
      <p>
        Additionally, our work involved evaluating various open-source LLMs released through platforms
like Hugging Face [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], focusing on models with potential medical relevance like
AI-Growth-Lab_llama2-7b-clinical-innovation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and meditron-7b [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These models were compared against GPT-4, a
highperforming proprietary model whose capabilities in medical knowledge and reasoning have been
documented [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While benchmarks like the USMLE provide one dimension of evaluation [
        <xref ref-type="bibr" rid="ref3 ref6">6, 3</xref>
        ],
assessing performance in nuanced, interactive dialogue simulation remains crucial.
      </p>
      <p>Despite the growing interest in AI-driven medical agents, rigorous comparative studies focusing
specifically on the conversational fidelity and role consistency of diferent LLMs (open-source vs.
proprietary) in simulated clinical dialogues are still emerging.</p>
    </sec>
    <sec id="sec-3">
      <title>3. AI-Driven Doctor-Patient Interaction System</title>
      <p>The overall architecture is described in the (Sec. 3.1), the data processing pipeline (Sec. 3.2), and the
framework used for recommendation generation and evaluation (Sec. 3.3).</p>
      <sec id="sec-3-1">
        <title>3.1. Architecture</title>
        <p>
          The system architecture is designed to support structured doctor-patient conversations, enabling
simulations involving various configurations: two virtual agents (AI Doctor-AI Patient), a human
doctor interacting with a virtual patient, or three virtual agents AI Doctor-AI Patient-AI Nurse). The
architecture is built using the AutoGen framework [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which facilitates multi-agent interactions and
integrates diferent Large Language Models (LLMs) to drive the agents’ behaviour.
        </p>
        <sec id="sec-3-1-1">
          <title>3.1.1. System Components</title>
          <p>
            The system consists of four main components, illustrated in Figure 1:
• Agent Manager: Orchestrates the conversation flow using AutoGen’s capabilities, ensuring that
agents adhere to their assigned roles (e.g., doctor, patient, nurse) and communication protocols.
• LLM Engine: Generates agent responses based on role-specific prompts, conversation history,
and contextual medical data. It supports multiple models, including GPT-4,
AI-Growth-Lab_llama2-7b-clinical-innovation, meditron-7b, and mistral [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
• Conversational Interface: Implemented with chainlit, software, providing a user-friendly chat
interface for real-time interaction and visualization of agent conversations.
• Medical Data Processor: Handles anonymized clinical source data (provided by IACS),
transforming them into structured patient profiles or case summaries suitable for initializing agent
interactions.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Role-Based Agent Configuration</title>
          <p>Each agent operates based on a predefined role and behavior, defined via customized prompts. Examples
include:
• Virtual Patient: Simulates a patient with specific medical conditions, instructed to respond
based on provided case data and potentially exhibit predefined emotional states (e.g., anxious,
calm).
• Virtual Doctor: Engages in diagnostic reasoning, asks relevant questions, and suggests potential
diagnoses or next steps based on the interaction and its underlying LLM’s capabilities.
• Virtual Nurse (Optional): Can be configured to assist in patient care coordination, act as an
intermediary, or provide specific information when needed.</p>
          <p>The Autogen framework allows flexible configuration of these roles and their interaction patterns (e.g.,
turn-taking sequences) based on the specific simulation scenario.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Model Selection</title>
          <p>
            The system architecture, leveraging Autogen, supports flexible integration of various LLMs to power
the agents, enabling comparative evaluations. Proprietary models such as GPT-4 ofer strong baseline
performance due to extensive training. Open-source models like
AI-Growth-Lab_llama-2-7b-clinicalinnovation, meditron-7b, and mistral [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] are evaluated for their ability to simulate realistic medical
dialogues, particularly given potential domain-specific tuning. These models can be accessed through
API endpoints (e.g., for GPT-4) or local deployments (e.g., for open-source models using frameworks
like Ollama or vLLM), ensuring flexibility in system configuration and experimentation.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pipeline</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Data Processing</title>
          <p>The interaction pipeline follows a structured sequence to ensure coherence and traceability in the
simulated conversations.</p>
          <p>The system processes anonymized clinical data sourced from IACS’s medical data infrastructure. This
data is prepared into three primary formats for agent initialization:
• Full medical history: Provides comprehensive background, potentially used for grounding
more complex simulations (though less frequently used in prompt due to length limitations).
• Summarized patient profiles: Concise summaries, often automatically generated (e.g., using
another LLM pass), highlighting key symptoms, history, and demographics for initializing the
patient agent’s persona.
• Clinical case questionnaires: Structured Question-Answer formats representing key facts or
expected responses, used to constrain or guide the patient agent’s replies in specific scenarios.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Interaction Workflow</title>
          <p>The conversation process typically follows four main stages:
1. Patient Initialization: A virtual patient agent is instantiated, assigned a medical condition
(based on the processed data) and potentially an initial emotional state.
2. Doctor-Patient Conversation: The doctor agent (human or AI) interacts with the patient agent,
asking questions and gathering information. The virtual patient responds based on its persona,
instructions, and the underlying LLM’s generation capabilities.
3. Decision Making / Recommendation: The virtual doctor agent synthesizes information and
may propose diagnostic hypotheses, suggest further tests, or outline potential treatment directions.</p>
          <p>In multi-agent setups, the virtual nurse may intervene according to predefined rules.
4. Logging and Analysis Output: The entire conversation transcript, including agent roles and
timestamps, is logged. Key interaction points or generated recommendations are flagged for
subsequent evaluation (detailed in Section 4).</p>
          <p>This pipeline ensures that conversations are structured and reproducible, facilitating the analysis of AI
agent behaviour in simulated medical interactions for training or research purposes.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Recommendation Generation and Evaluation Framework</title>
        <p>This section outlines the framework within our system for (Sec. 3.3.1) generating medical
recommendations by the virtual doctor agent and (Sec. 3.3.2) evaluating the quality of these recommendations and
the associated patient agent reactions. The goal is to assess the coherence, medical plausibility, and
contextual appropriateness of the simulated interaction outputs.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Generation of Medical Recommendations</title>
          <p>The virtual doctor agent, powered by its configured LLM, generates medical recommendations
dynamically within the conversation based on the patient’s reported symptoms, the ongoing dialogue, and
its internal knowledge derived from training data. These recommendations can include diagnostic
hypotheses, treatment suggestions, referrals to specialists, or suggestions for further medical tests.
Crucially, the agent simulates the reasoning process of a medical professional rather than performing
actual clinical decision-making.</p>
          <p>The generation process is guided by the agent’s prompt, which typically instructs it to follow a logical
lfow: gather patient information, potentially form diferential diagnoses (implicitly or explicitly), and
then propose next steps or recommendations based on the synthesized information.</p>
          <p>The efectiveness of this recommendation generation depends on the LLM’s ability to produce logical,
medically plausible, and contextually appropriate responses while adhering to its assigned role and
avoiding clinically unsafe or nonsensical suggestions.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Evaluation of Recommendation Quality and Patient Reaction</title>
          <p>The evaluation framework assesses the generated recommendations and patient reactions along several
dimensions, serving as the basis for the results presented in Section 4.</p>
          <p>Recommendation Quality: This involves evaluating whether the doctor agent’s recommendations
are:
• Clinically Plausible: Consistent with general medical knowledge for the presented symptoms,
avoiding major contradictions or harmful suggestions.
• Contextually Relevant: Appropriate given the specific information exchanged during the
dialogue.</p>
          <p>• Coherent: Logically consistent with the preceding conversation turns.</p>
          <p>This is assessed qualitatively using expert human raters who applied a structured rubric focusing on
clinical plausibility, contextual relevance, coherence, and the role-consistency and interactional
plausibility of the patient agent’s reactions, with additional preliminary feedback from medical professionals.</p>
          <p>Patient Agent Reaction: We analyze the virtual patient agent’s response to the doctor’s
recommendations. Given that the patient agent is also LLM-driven and guided by its persona (including potential
emotional states), we evaluate if its reaction is:
• Role-Consistent: Aligns with the defined patient profile and emotional state.
• Interactionally Plausible: Represents a believable human-like reaction (e.g., seeking
clarification, expressing understanding or concern) in the context of the recommendation.
This provides insight into the LLMs’ ability to handle interaction dynamics.</p>
          <p>The evaluation methodology (detailed in Section 4) applies these criteria across multiple simulated
cases and diferent LLM configurations (both AI-AI and Human-AI setups) to compare performance
systematically.</p>
          <p>This framework enables a structured assessment of key interaction aspects, providing quantitative
and qualitative data (see Section 4) to compare LLM behaviours and guide future improvements in
AI-based medical simulation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>The evaluation phase of this study was focused on assessing the efectiveness, accuracy, and realism
of the AI-driven doctor-patient interaction system described previously. Several experiments were
conducted to analyze the performance of diferent language models in simulated medical conversations,
the consistency of agent behaviours, and the system’s ability to generate plausible and context-aware
medical recommendations. The evaluation process considered multiple factors, including the nature
of the input data, model performance, experimental settings, and the overall quality of AI-generated
recommendations within the simulated clinical context.</p>
      <p>The dataset used in the experiments consisted of anonymized clinical records, primarily focusing on
congestive heart failure. These records served as a realistic foundation for testing the AI agents’ ability
to simulate authentic medical interactions. The data was pre-processed to extract relevant information
for agent initialization. Additionally, summarization techniques were applied to condense medical
histories, ensuring that the AI models received structured and focused inputs, primarily through the
summarized patient profile format mentioned in Section 3.2.1.</p>
      <p>Two main experimental scenarios were employed: The first scenario involved a human user (acting as
a doctor) interacting with an AI patient agent via the Chainlit interface, aimed at assessing the patient
agent’s interactive realism. The second scenario focused on AI-to-AI interactions, where both the doctor
and patient roles were played by LLM-driven agents managed by Autogen, designed to evaluate the
models’ ability to maintain coherent dialogue and role consistency over multiple turns. Throughout
these experiments, diferent LLMs were tested, including GPT-4,
AI-Growth-Lab_llama-2-7b-clinicalinnovation, meditron-7b, and mistral. The generated interactions for each model were qualitatively
analyzed based on criteria including apparent clinical accuracy, logical consistency, adaptability to the
simulated context, and emotional responsiveness, where applicable.</p>
      <p>Qualitative analysis of the results revealed observable diferences in model performance. GPT-4
generally produced more coherent and contextually appropriate responses compared to the tested
open-source models, often demonstrating a clearer link between symptoms and potential diagnoses
while maintaining a structured conversational flow. AI-Growth-Lab_llama-2-7b-clinical-innovation
demonstrated reasonable capacity for understanding the medical context but exhibited occasional
inconsistencies, particularly in longer multi-turn interactions. Meditron-7b and mistral, while capable of
processing medical terminology, appeared to struggle more frequently to maintain contextual accuracy
and often generated generic rather than case-specific recommendations in comparison. Another notable
observation was the increased dificulty observed in the multi-agent scenarios if a nurse agent is
included, where role confusion and inconsistent responses appeared more frequently across models
when more than two agents interacted.</p>
      <p>To further assess system performance, a subjective evaluation analysis was undertaken, focusing
on three key aspects derived from the interactions: clinical plausibility, consistency/coherence, and
user acceptance. Clinical plausibility was considered by qualitatively assessing whether AI-generated
recommendations aligned with general medical knowledge. Consistency and coherence were evaluated
by examining the AI’s ability to maintain logical continuity across multiple conversation turns without
obvious contradictions. Finally, aspects related to user acceptance were gauged via preliminary feedback
from a small group of medical professionals and trainees, who provided comments on the perceived
realism and potential utility of the interactions.</p>
      <p>In Table 1 an excerpt is displayed an excerpt from a test. In this AI-AI scenario, two virtual agents
(doctor and patient) utilized the AI-Growth-Lab_llama-2-7b-clinical-innovation model. The patient
information used the summary style, and the virtual doctor was prompted with instructions intended to
evoke a ’lack of seriousness’ persona. As highlighted in the image, the conversation progresses, and the
doctor’s response marked in blue ("Of course, my friend...") illustrates the model adopting the prompted
informal persona, which serves as an example of behaviour control evaluated during the study. On the
other hand, in Table 2, a second excerpt using GPT-4 is displayed in an AI-AI-AI scenario with three
virtual agents (doctor, patient, and nurse). In this second excerpt, the roles are correctly assigned by the
virtual agents and the agent manager, as reflected in the behavior displayed in the chat interactions.</p>
      <p>The evaluation process highlighted both the potential and the current limitations of using LLMs for
simulating medical interactions via this system. While the experiments demonstrated the feasibility
of generating doctor-patient interactions using diferent models, challenges persist for improved
contextual awareness, convincing emotional portrayal, and consistent generation of clinically plausible
recommendations. The analysis suggests that future work should address fine-tuning approaches,
refining prompt engineering techniques, and potentially incorporating more structured domain knowledge</p>
      <p>Message
Oh, thank goodness! I’m so glad to hear that my blood pressure is stable, and I’ll
be sure to drink moderately from now on. My wife says I’ve been drinking too
much water lately, and she’s worried that I might overhydrate or even start having
electrolyte imbalances. I don’t want that to happen, so I’ll try to drink more smartly.</p>
      <p>Excellent, I’m glad you understand the importance of drinking water in moderation.</p>
      <p>Now, let me ask you something else. Have you ever been diagnosed with congestive
heart failure?
Congestive heart failure? I’m not sure, Doc. You see, lately, I’ve been feeling short of
breath a lot, especially during exercise or when I exert myself physically. My doctor
in Teruel has told me that I have chronic heart failure, but he said nothing about
congestive heart failure. Can you explain what it means?
Certainly. Congestive heart failure (CHF) is a condition in which the heart is unable
to pump blood efectively, causing fluid buildup in the body. It can lead to shortness
of breath, fatigue, swelling in the legs, and weight gain. Congestive heart failure
is a more advanced stage of heart failure, which means that the heart has lost its
pumping power and is unable to deliver adequate blood flow to the body.</p>
      <p>Oh, I see. So that’s why I’ve been feeling so short of breath lately. It’s a good thing
I came here to see you, Doc, because I had no idea that my heart was in such bad
shape. I promise to take my treatment seriously and do everything you say to help
me feel better. Can you please tell me more about my condition and what you plan
to do to help me?
Of course, my friend. I’m glad you’re taking this seriously. Based on your medical
history, it seems that you have advanced heart failure, which means that your heart is
incapable of pumping blood efectively throughout your body. We need to develop a
treatment plan that will help alleviate your symptoms and slow down the progression
of your condition.</p>
      <p>Okay, I see. So you’re going to develop a treatment plan to help me feel better. Great!</p>
      <p>What kind of treatment do you think would be best for me?
or feedback mechanisms to enhance reliability for training applications.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>This study explored the potential of AI-driven intelligent agents in simulated medical consultations,
focusing on their ability to simulate realistic doctor-patient interactions using the Autogen framework
and comparing diferent LLMs. The evaluation indicated that while agents powered by models like GPT-4,
AI-Growth-Lab_llama-2-7b-clinical-innovation, meditron-7b, and mistral can generate medically relevant
dialogue, performance varies notably. GPT-4 generally exhibited greater coherence and consistency
in generating plausible responses, whereas the tested open-source models, exemplified by
AI-GrowthLab_llama-2-7b-clinical-innovation, while promising, showed more variability, particularly regarding
consistency in longer dialogues and in multi-agent settings. The findings suggest that LLM-based
simulation holds potential for medical training and related applications, but achieving consistent
reliability requires further development.</p>
      <p>One of the main challenges observed during evaluation was variability in model performance when
handling multi-turn medical dialogues. The tested open-source models sometimes showed
inconsistencies in role adherence and in consistently portraying specified emotional states. Additionally,
multi-agent interactions introduced challenges related to role confusion, where agents sometimes
appeared to deviate from their assigned identities. These issues highlight the need for continued work
on prompt engineering, consideration of model fine-tuning strategies, and potentially integrating more
explicit mechanisms for maintaining dialogue state and role clarity.</p>
      <p>Message
I’m Dr. Smith, your attending cardiologist for today. Could you tell me about your
medical history? Have you ever been diagnosed with congestive heart failure or do
you have any family history of it?
No, never been diagnosed with that. Got no family history either. I’ve been having
trouble with my left leg though, real bad pain, can’t walk right. Got me scared, Doc.</p>
      <p>I understand your concern. Let’s take a closer look at this. In terms of heart health,
congestive heart failure can cause swelling in the legs, resulting from fluid
accumulation. Are you experiencing any other symptoms like shortness of breath, fatigue,
rapid heart rate, persistent cough or wheezing, or loss of appetite?
Well, Doc, now that you mention it, I’ve been feeling kind of tired lately. Can’t say
I’ve noticed anything else strange though, just the leg mainly.</p>
      <p>Thank you for letting me know. The combination of pain and swelling in your leg
along with fatigue could be signs related to heart health but these symptoms can
also arise from other conditions. To ensure we have a complete picture, I am going
to order some tests including a blood test and echocardiogram right away which
will help me determine the health of your heart. Please wait for a moment while the
nurse gets those arranged.</p>
      <p>Sure, Doctor. I’ll arrange for the patient’s blood test and echocardiogram right away.</p>
      <p>Thanks, nurse. I ain’t never liked tests, but I know they gotta be done. I’m just hoping
to get all this sorted soon.</p>
      <p>I understand your concerns. Rest assured, these tests are crucial in helping us
understand your condition and how to best address it. We will do our best to make
the process as comfortable and easy as possible for you. Once the test results are
available, the doctor will discuss the next steps with you.</p>
      <p>A key area for future work is the further investigation and potential fine-tuning of open-source models
specifically for clinical dialogue simulation. Techniques involving high-quality medical dialogue data or
reinforcement learning with clinician feedback could enhance accuracy and contextual understanding.
Additionally, further experiments are needed to optimize the interaction protocols for multiple agents
within Autogen, ensuring better coordination in complex scenarios involving doctors, patients, and
potentially nurses or medical assistants.</p>
      <p>Another promising direction is the development of adaptive feedback mechanisms that allow the
simulation environment to refine its behaviour based on user input during interaction. Furthermore,
implementing more sophisticated emotion simulation models could enhance patient-agent interactions,
making them potentially more realistic and engaging for medical training. From an ethical perspective,
this study underscores the importance of addressing potential biases, transparency, and patient privacy
when using AI in medical settings. Ensuring fairness and avoiding the propagation of biases present
in training data is crucial. Moreover, patient data privacy is paramount, requiring strict adherence to
anonymization techniques and data protection regulations. Transparency is important, users understand
the capabilities and limitations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was funded by the Department of Big Data and Cognitive Systems at the Aragon Institute
of Technology, under Retech Tourism-Spain Living Lab Agreement and by the Government of Aragon.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used both GPT-4 and Gemini to ensure accurate
grammar and spelling in this work. After using these tools/services, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Thirunavukarasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. J.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Elangovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Jebb</surname>
          </string-name>
          ,
          <article-title>Large language models in medicine</article-title>
          ,
          <source>Nature Medicine</source>
          <volume>29</volume>
          (
          <year>2023</year>
          )
          <fpage>1930</fpage>
          -
          <lpage>1940</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41591-023-02448-8.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Scherr</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halaseh</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spina</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andalib</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivera</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <article-title>ChatGPT interactive medical simulations for early clinical education: Case study</article-title>
          ,
          <source>JMIR Medical Education</source>
          <volume>9</volume>
          (
          <year>2023</year>
          )
          <article-title>e49877</article-title>
          . doi:
          <volume>10</volume>
          .2196/49877.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Tanwani</surname>
          </string-name>
          , H. ColeLewis,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pfohl</surname>
          </string-name>
          , et al.,
          <article-title>Large language models encode clinical knowledge</article-title>
          ,
          <source>Nature</source>
          <volume>620</volume>
          (
          <year>2023</year>
          )
          <fpage>116</fpage>
          -
          <lpage>124</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41586-023-06291-2.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Ayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Leas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Faix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hogarth</surname>
          </string-name>
          , et al.,
          <article-title>Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum</article-title>
          ,
          <source>JAMA Internal Medicine</source>
          <volume>183</volume>
          (
          <year>2023</year>
          )
          <fpage>589</fpage>
          -
          <lpage>596</lpage>
          . doi:
          <volume>10</volume>
          .1001/jamainternmed.
          <year>2023</year>
          .
          <year>1838</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] OpenAI, GPT-4,
          <year>2025</year>
          . URL: https://openai.com/index/gpt-4/, accessed:
          <fpage>2025</fpage>
          -04-03.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <source>Sparks of artificial general intelligence: Early experiments with GPT-4</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.12712. arXiv:
          <volume>2303</volume>
          .
          <fpage>12712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>AI-Growth-Lab</surname>
          </string-name>
          ,
          <article-title>Hugging face page of model llama-2-7b-clinical-</article-title>
          <string-name>
            <surname>innovation</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https: //huggingface.co/AI-Growth-Lab/llama-2
          <article-title>-7b-clinical-innovation</article-title>
          , accessed:
          <fpage>2025</fpage>
          -04-03.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] EPFL, EPFL's new large language model for medical knowledge</article-title>
          ,
          <year>2025</year>
          . URL: https://actu.epfl.ch/ news/epfl-s
          <article-title>-new-large-language-model-for-medical-knowle/?utm_source=chatgpt</article-title>
          .com, accessed:
          <fpage>2025</fpage>
          -04-03.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Bansal,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Zhu,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Wang,
          <article-title>AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework</article-title>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2308.08155. arXiv:
          <volume>2308</volume>
          .
          <fpage>08155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Liévin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Hother</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Motzfeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Winther</surname>
          </string-name>
          ,
          <source>Can large language models reason about medical questions?</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2207.08143. arXiv:
          <volume>2207</volume>
          .
          <fpage>08143</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. O'Brien</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <article-title>Generative agents: Interactive simulacra of human behavior</article-title>
          .,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2304.03442.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <fpage>nickm980</fpage>
          ,
          <string-name>
            <surname>Smallville</surname>
            <given-names>github</given-names>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://github.com/nickm980/smallville, accessed:
          <fpage>2025</fpage>
          - 04-03.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Research</surname>
          </string-name>
          , Hugging face,
          <source>Contrary Research</source>
          (
          <year>2025</year>
          ). URL: https://research.contrary.com/ company/hugging-face, accessed:
          <fpage>2025</fpage>
          -04-03.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Anaissi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Braytee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Akram</surname>
          </string-name>
          ,
          <article-title>Fine-tuning llms for reliable medical question-answering services</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.16088. arXiv:
          <volume>2410</volume>
          .
          <fpage>16088</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>