<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI-Driven Clinical Reporting: A Case Study on IQLINIQ</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seyedeh Leili Mirtaheri</string-name>
          <email>leili.mirtaheri@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reza Shahbazian</string-name>
          <email>reza.shahbazian@unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Narges Movahedkor</string-name>
          <email>narges.movahed@unical.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Trubitsyna</string-name>
          <email>i.trubitsyna@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Greco</string-name>
          <email>greco@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mechanical, Energy, and Management Engineering Department (DIMEG), University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Leveraging the power of generative artificial intelligence (AI) may assist the clinicians in providing prompt care, as well as reducing the financial pressure on patients. Although current generative AI technologies present a possible route for automation, they often lack in report accuracy, quality, and addressing privacy concerns. This paper presents a novel framework for the automatic creation of high-quality reports from clinical meeting audio transcripts. This framework is a part of our clinic management platform (IQLINIQ) and benefits from a three-stage process, including precise audio-to-text transcription, internal anonymization, and a refining phase to ensure consistency and conformity to clinical standards. Using both quantitative measures, including cost and time analysis, and qualitative evaluations, we compare the AI-driven reports against expert-generated ones, individual large language models (LLMs), and a state-of-the-art baseline model, GPT-4-o. Our findings show that our framework noticeably enhances the report preparation process by significantly reducing the time and cost of generating a report from expert-surveyed 3 hours and 750 US-dollars for a complete report to less than 5 minutes and 1 US-dollar, respectively. Improving the report quality by over 10 points compared to existing techniques also underlines the efectiveness of proposed solution.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clinical Reports</kwd>
        <kwd>Generative AI Report Generation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Mental Health</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Generative AI is one of the revolutionary trends in technology, quickly becoming mainstream
across multiple use cases. The big potential of this disruptive technology can overhaul the
nature of industries across multiple sectors, including healthcare [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and cybersecurity [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The prospect of generative AI making processes more eficient and streamlined with recent
applications in the generation of clinical reports [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Physicians and clinicians today are devoting
a significant amount of their time to prepare medical reports and paperwork. It is stated that
physicians spend on-average 2 to 6 hours per day dealing with this activity, a substantial amount
of time which can be put to better use in more important tasks like patient care and patient
throughput enhancement. This administrative time is not only decreasing face-to-face time
with patients; it can be a further cause of burnout among physicians, and possibly has an impact
      </p>
      <p>
        © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
on the overall quality of the healthcare system. So surely freeing this wasted time would allow
practitioners to better focus on their core competency, which is diagnosis and treatment. For
example, ecological momentary assessment (EMA) deployed via a smartphone application [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
has been used to capture the working life of 61 U.S. physicians across 28 ambulatory practice
sites. Physician utilization of electronic health record (EHR) consisted of 66.5% of their time
spent on direct clinical contact; 20.7% on EHR alone; 7.7% on administrative functions; and 5%
on other activities. Most shocking was EHR time: a staggering 44.9% of total physician time
spent on EHR. The research finds the relatively low figure on direct patient care to be a concern
on eficiency and points toward the imperative need for redesign of EHR and workflow.
      </p>
      <p>
        The unparalleled success of generative AI models in text generation has made them prevalent
in a wide range of applications, and healthcare is no diferent [
        <xref ref-type="bibr" rid="ref4 ref6">6, 4</xref>
        ]. The ability of these models
to learn and transfer knowledge has created huge interest in how they can revolutionize medical
practice. In medicine generally, and most especially in the delicate field of mental health, there
have been investigations into using generative AI to speed the creation of clinical reports after
psychiatric assessments and clinical visits. Although these early forays into the subject ofer us
reason to be hopeful regarding the healthcare record of the future, there are still many obstacles.
While existing generative AI models have shown powerful text generation capabilities, they
still lack the consistency of correct responses and fine-grained comprehension required to
generate truly high-quality clinical reports. Their use also raises critical ethical and practical
questions about the privacy and security of highly sensitive patient information. Consequently,
the consistency gap and the privacy issues clearly represent major obstacles to their broad
clinical use, and their use in critical applications requires overcoming such problems.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Problem Statement</title>
        <p>Typically, clinicians are weighed down with the overabundance of administrative tasks,
especially clinical report preparation, which takes an average time of 2 to 6 hours a day and
significantly restricts time spend on direct patient care. In this respect, EHR systems have
overloaded physicians enormously well above spending up to nearly 45% time on EHR-related
activities, and thus requires the refurbishment of workflows. With respect to generative AI
models in recent years, there has been a current wave of interest in them from academic and
clinic applications. Unfortunately, despite strokes of formulating efective means of
automatically generating clinical reports, their deficiency in fine-grained understanding coupled with
inconsistencies in accuracy hinders their quality. Also, they usually overlook privacy and
security concerns regarding the way sensitive patient information is handled. Thus, these
limitations present obstacles to the widespread clinical adoption of these models and requires
solutions to improve:
• The quality and accuracy of AI-based generated reports.
• Ensuring the privacy and security of patient data.</p>
        <p>• Reduced time between use of EHR and report preparation.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Contributions</title>
        <p>To overcome such challenges in report generation, this study presents a novel high-quality
production architecture to build reports in clinical setting, giving emphasis to the mental health
domain here. We focus on audio transcripts from patient meetings, utilizing multiple unique
generative models. By combining rapid generation and moderate level generative models from
diverse vendors, we harness their strengths and compensate for their limitations. Therefore,
our multimodel approach supports generating high-quality initial reports. Most critically, we
ofer a novel refinement step performed on these initial reports, where these are combined
and polished. The refinement step brings the resulting outcome in accordance with experts
written reports, achieving professional precision and maximizing the appropriate contents.
We comprehensively assess our system in performance on key performance metrics, such as
quality in reports, expenses and costs, and process times. Moreover, we compare consistency in
produced reports with professional clinician-produced reports, direct evaluation in reference
to humans. Lastly, we compare with state-of-the-art models, comparing performance with
high-powered generative model GPT4-o, where we emphasize the performance benefits in our
system. The following includes our summarized contributions:
• We present a new multimodel architecture for generating clinical reports, with a
renfiement step for improved quality and report-appropriate content.
• Rigorous quality, cost, and time evaluations allowing extensive performance evaluations
• Clinical applicability of the generated reports is validated by similarity with expert
clinician reports.</p>
        <p>• Evaluating with GPT4-o proves the efectiveness and benefits of our proposed method.</p>
        <p>We arrange the presented paper in the following structure: initially, sections 2 explores the
related literature and existing clinical applications of generative AI models and LLMs. Then, we
present detailed description of the proposed architecture in section 3, and evaluation results of
the conducted comprehensive assessments, along with detailed description of the utilized dataset
and oficial clinical assessment procedure, are presented in section 4. Finally, we conclude the
study with a discussion 4.4 and conclusion 5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        The requirement for automated clinical report generation have been noticeably noted, and
actively researched over the past several years [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This is due to sensing the increasing demand
for reducing the administrative burden on healthcare professionals and improving eficiency in
clinical documentation. In this section, we present the relevant literature in a general sense,
on the applications of generative AI in healthcare, across the intersection of natural language
processing (NLP), automating report creation, and the use of LLMs.
      </p>
      <p>
        Automating clinical report generation has been a focus due to its potential to improve
eficiency [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. Transformer models like Biobart-V2 have been used for radiology report
summarization, demonstrating their efectiveness in medical text processing [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. LLMs also
ofer powerful tools for healthcare tasks such as clinical documentation and diagnosis, although
concerns about trust and safety exist [
        <xref ref-type="bibr" rid="ref11 ref6">11, 6</xref>
        ]. To address these challenges, we utilize multimodel
approaches and data anonymization. However, more research into ethical considerations like
privacy and bias in LLM use is required [
        <xref ref-type="bibr" rid="ref12 ref4 ref6">6, 4, 12</xref>
        ]. Moreover, they should still be comprehensively
evaluated for their efectiveness in specific contexts [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], particularly for medical purposes
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Multimodal learning across diferent data modalities have also been studied for medical
imaging applications [
        <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
        ]. The implementation of LLMs in healthcare necessitates robust
privacy measures and regulatory frameworks to ensure responsible and secure deployment [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Concentrating on clinical reports, the literature has witnessed a growing interest in utilization
of LLMs in the medical field. For instance, Google’s Med-PaLM 2 has introduced their model,
ifne-tuned on medical data for medical tasks [
        <xref ref-type="bibr" rid="ref19">19, 20</xref>
        ]. Additionally, AI-based systems like Nabla1,
Nannonets 2, Notable 3, Amelia 4, Cognigy5, and the AI for Health research conducted by Stanford
University have introduced new areas for AI to be integrated into and assist the professionals
in the healthcare field. Researchers have also conducted studies on evaluating and fine-tuning
models like ChatGPT and InstructGPT for specific medical applications. BioGPT [ 21],
pretrained on PubMed abstracts, has demonstrated superior performance in question answering,
relation extraction, and document classification. Similarly, BioMedLM 2.7B and GPT-4-based
multi-modal LLMs showcase advancements in this area [22, 23]. Domain-specific versions
of BERT, such as BioBERT, PubMedBERT, ClinicalBERT, and BioLinkBERT, have also been
developed for scientific and clinical text, demonstrating the adaptability of BERT architectures
for medical tasks [24]. Google’s PaLM, fine-tuned as Flan-PaLM and Med-PaLM, has achieved
state-of-the-art results in medical question answering and clinical reasoning [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Additionally,
proprietary and open-source medical LLMs like GatorTron [25], Claude [26], and PMC-LLaMA
[27] are emerging, contributing to the field’s growth. These models, including DRAGON [ 28],
Megatron [29], and Vicuna [30], are also enabling the development of multi-modal LLMs [
        <xref ref-type="bibr" rid="ref19">19, 31</xref>
        ].
      </p>
      <p>We recognize that two key components of guaranteeing the quality and dependability of
automated systems are the refinement of produced text and the assessment of clinical reports.
We tackle this by including a special refinement phase whereby the results of many LLMs
are synthesized and polished to meet clinical standards. To give a thorough analysis of our
recommended approach, moreover, we combine quantitative measures (cost, time, tokens) with
qualitative assessments (quality, clarity, alignment).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Architecture</title>
      <p>This section explores the architecture of our approach, proposed for generating clinical reports
from meeting audio transcripts. As it is shown in Figure 1, our approach comprises three
main stages, including components for real-time transcription, independent and individual
report generation models, and a high-level report integration and refinement stage to enhance
the report’s quality and clarity. The following will go through each of these components in
details. Using this architecture, we aim to provide clinicians with precise and detailed clinical
1https://www.nabla.com/
2https://www.nanonets.health/
3https://www.notablehealth.com/
4https://amelia.ai/solutions/healthcare/
5https://www.cognigy.com/solutions/healthcare
reports, following specified formats. By this multi-stage architecture, we can ensure the quality,
consistency, and privacy of sensitive patient data while also reducing the administrative burden
on clinicians.
Real-time Transcription The initial component of our proposed architecture concentrates on
transcribing the input clinical meeting audio into a written format. Various models, either ofline
or online, can be used as audio transcription tool in the proposed architecture, including OpenAI
Whisper [32], Deepgram Nova, Google speech to text, AssemblyAI, Nabla, Amazon Transcribe,
and Azure AI Speech models. There are several features of audio transcription tools that should
be taken into account, among which pricing, processing time, and accuracy among can be
considered the most important ones. Therefore, selecting an appropriate tool is all about
balancing these features. Some tools may be more expensive to run or take longer to deliver, but
richer in features or greater in accuracy. While, others may be quick to respond but inaccurate. In
this regard, the choice is one of compromise between speed and accuracy, regarding what best fits
the task at hand. Accordingly, we have analyzed and evaluated various audio transcription tools
with respect to user needs and resources, and utilized the Assembly AI real-time transcription
API, as this model is highly accurate and can deal with long audio files. Greater details of the
ifndings can be found in the Section 4.2.</p>
      <p>In terms of the real-time transcription, the input audio is required to initially be divided into
small, manageable pieces by a chunking process. This enables faster processing and supports
real-time transcriptions. The transcript so generated by the transcription module is then run
through an Anonymization module to maintain patient confidentiality by eliminating any
personally identifiable information. This anonymized transcript then forms the input for the
further report generation processes.</p>
      <p>Independent Report Generators We utilize a range of LLMs from diferent sources to create
the initial draft reports. We recommend the use of mid-range LLMs, to make it afordable yet
still maintain quality, such as Google’s Gemini Flash 2.0, Anthropic’s Haiku 3.5, and OpenAI’s
GPT4-0-mini. Once anonymized, the audio transcripts are fed into every LLM as input, along
with a prompt that is designed to guide the generative models towards producing clinical reports.
We use oficial designed questionnaires along with the prompt message to extract relevant
information from the audio transcript. Moreover, to provide more accurate analysis by LLMs,
we divide the audio script and questionnaires into sections with focused subject, such as history
of present illness, concerns, anxiety, allergies and as such.</p>
      <p>A specific report template can also be given to every LLM according to the report type to
be employed, i.e., psychiatric evaluation or treatment session note, to ensure that the reported
output bears a standard form. Therefore, each LLM is processing the provided transcript, clinical
questionnaires, designed system prompt, and template, producing an initial report. Multiple
LLMs are executing reports simultaneously, which enables us to leverage each model’s inherent
advantages and procure a diversity of viewpoints. In addition, since each model may have
its own weaknesses and strengths, we can benefit from utilizing their natural strengths and
compensate for their potential weaknesses. We have used the following system prompt:
Report Generator System Prompt: You are given a list of psychiatric questions and the
transcript of a session with the patient. According to the given questions and answers, write
the following information as a paragraph in the format of a psychiatric report. Don’t add any
further information that is not included in the answers.</p>
      <p>Content:
Questions: {Questionnaire}</p>
      <p>Transcript: {Audio Transcript}
Report Integration and Refinement The final component focuses on unifying the
individually generated reports into one consistent, polished, and high-quality final report. For this
integration and refinement operation, we utilized the Gemini 03-mini model. We use a particular
prompt, meant to steer the integration and refining process by highlighting consistency,
thoroughness, and fidelity to medical norms, and the report template is also used in the prior stage
guarantees format and structure into consistency. Following the provided medical template,
the refined clinical report benefits from the information provided in individual reports, as well
as being well organized and thorough, presenting the significant findings and notes from the
clinical meeting. We have used the following system prompt for the refinement:
Refinement Prompt: Please combine the following three psychiatric reports for a single
patient into one comprehensive and professionally written report, presented in paragraph
form. Maintain accuracy and consistency across all information, and use appropriate medical
terminology.</p>
      <p>First Report: {Content of the first report }
Second Report: {Content of the second report}</p>
      <p>Third Report: {Content of the third report}</p>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Results</title>
      <p>To evaluate the performance of the proposed architecture, we have conducted an extensive
assessment of both cost and time analysis, and the quality of the generated reports. We assess
our proposed architecture against several LLMs, the state-of-the-art GPT 4-o model, which is
currently the most powerful OpenAI model, and the reports written by professional clinicians.
We have specified the following criteria for the qualitative assessments. By Quality, we mean
the overall quality of the report, including thoroughness, organization, and use of appropriate
terminology. Similarity refers to how close the report is to the expert report with respect to
diagnoses, treatment recommendations, and general evaluation. Ultimately, Clarity assesses its
writing style and logical flow, how straightforward the report is to grasp.</p>
      <p>We surveyed mental health clinicians 6 to establish a baseline for the time and cost linked with
conventional report creation, to assess the practical efect of our suggested design. The average
time spent and costs accrued by practitioners when manually generating clinical reports were
provided in this survey. Using the survey results as a guide to evaluate the possible increases in
eficiency and cost-efectiveness that our system might provide in a medical environment, we
then measured the performance of our proposed design against these real-world benchmarks.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>We used a dataset including five real-case clinical sessions focused on Autism Spectrum Disorder
(ASD) evaluations for children aged 6-17 to assess the performance of our proposed architecture
7. With related records such as patient demographics, clinical notes, and clinician notes, each
session featured audio recordings of the medical interaction. It is noteworthy that we have
obtained appropriate ethical approvals and consents for using this dataset. Examining our
system’s performance on this dataset ofers a sensible and medically applicable starting point.
Emphasis on ASD evaluations makes possible a particular and thorough examination of the
capacity of the system to produce within a specific clinical setting precise and complete reports.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Expert Assessment Procedure</title>
          <p>Tailored to several age groups, the considered use case clinic uses a thorough, multi-stage
process for autism spectrum disorder (ASD) evaluations. Every patient first has a compulsory online
interview and autism diagnostic interview-revised (ADI-R) session via Microsoft Teams
administered by a psychologist. Lasting up to three hours, this session comprises a clinical assistant and
psychiatrist; the ADI-R component varies somewhat depending on the psychiatrist used. Clients
might then be given further evaluations depending on their age and particular requirements.
Autism diagnostic observation schedule (ADOS), cognitive assessment tests, motor evaluations,
school well-being assessments, and speech-language pathology (SLP) evaluations are among
the tools. While the others are face-to-face, the SLP and school well-being assessments could be
carried out on internet. Once all evaluations have been finished, a phone call feedback meeting
6The medical data is processed through a clinic management software called IQLiniQ, www.iqliniq.com
7This study utilizes anonymized medical data provided by clinicians from a mental health clinic in North America.
Due to privacy and confidentiality considerations, the name of the clinic cannot be disclosed
is held to go over the results. The facility also has support groups including parent training and
resource referral. Important features of this approach are its flexibility across age groups, the
compulsory first interview and ADI-R session, the range of extra evaluations available, and the
organization of feedback and assistance meetings.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Audio Transcription</title>
        <p>Considering cost, processing time, and word error rate (WER), Table 1 presents a comparative
assessment of diferent audio transcription tools for a 60 minutes audio. For a clear comparison
of the models, they are further split into real-time and ofline modes. The findings indicate that
the Assembly AI Real-time model has the lowest price among the real-time transcription tools
with merely $0.58, while also resulting in a higher WER. This is while Speechmatics has a higher
price but less error rate, and Nabla is free for merely 30 consultations per month, and then
demands $120 per month. Thus, despite its slightly higher WER, the Assembly AI Real-time tool
seems more cost-eficient and reasonably accurate for large-scale AI-based report generation.</p>
        <p>Turning to ofline transcription tools, the Deepgram Enhanced Model is the cheapest model at
$0.1, and took around 0.5 minutes to process. This is while having a high WER of 43, which
means that it has the lowest accuracy among ofline models. Similarly, other ofline models also
exhibit a high error rate of over 30 to 43, except for Assembly AI tool that showed a remarkably
lower WER of merely 9. This model, together with OpenAI Whisper and Deepgram tools also
require a reasonably low price of below 1 dollar, which shows their cost-efectiveness, compared
to Google speech to text, Amazon Transcribe, and Azure AI Speech. Despite this, Deepgram tools,
including both Nova and the Enhanced model, experienced a significantly lower processing
time, requiring merely 10 to 50 seconds for a 60-minute audio. This is particularly important
when one demands prompt report generation after the psychiatric session with a patient.</p>
        <p>Selecting an appropriate audio transcription model is all about balancing cost, precision,
and processing rate. Although processing times range greatly and Deepgram Nova provides
the fastest ofline processing, generally ofline models like Assembly AI show better accuracy
(lower WER) than real-time alternatives. Cost also has a key influence since budget-friendly
alternatives frequently sacrifice accuracy whereas more costly models tend to be more accurate.
Particularly with its high-performance ofline model, Assembly AI ofers a balanced
cost-toaccuracy ratio; however, its real-time service falls in accuracy. Ultimately, the decision hinging
on the demands of the particular application will depend on whether critical tasks need more
accuracy or time-sensitive activities call for more velocity. Budget limits must also be taken
into account. All things considered, and noting that we focused on real-time transcription in
our application, we selected the real-time Assembly AI tool to provide audio transcriptions, as
inputs fed to the report generator stage of our architecture.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Report Generation</title>
        <p>This section discusses the performance analysis of the proposed report generation architecture,
regarding both the quantity (cost and time), and the quality based evaluation criteria, including
quality, clarity, and similarity to real clinical reports.</p>
        <p>Model
→ Real-Time</p>
        <p>Assembly AI Real-time
Speechmatics Real-time</p>
        <p>Nabla Real-time
→ Ofline</p>
        <p>AssemblyAI
Openai Whisper</p>
        <p>Deepgram - Nova
Deepgram - Enhanced Model</p>
        <p>Google speech to text
Amazon Transcribe</p>
        <p>Azure AI Speech</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Cost and Time Analysis</title>
          <p>Table 2 shows a contrast of the cost, time, and output tokens among various language
models used in our report generation system, comparing individual generators, our suggested
refinement-including approach, and the GPT4-o baseline. For a simple comparison of
performance across several stages and models, each part of this table is devoted to one of these
models. The results are illustrated on average for a complete clinical report. The findings from
surveyed clinicians indicate that it could take around 3 hours for an expert to prepare the clinical
assessment report. Moreover, the report provision process would cost the patients on average
about $7508, which is noticeably higher than the average expense of AI-generated reports.</p>
          <p>Regarding the production of individual reports, three diferent systems Haiku, GPT4-o-mini,
and Gemini Flash 2.0 are used. While GPT4-o-mini produced the most output tokens (meaning
longest responses), it imposes the second-lowest cost of $0.003. Still, this model required more
time nearly as much as Haiku, and as twice as that of the Gemini. By contrast, the Haiku has a
somewhat faster processing time (52.518 seconds), and it produced fewer tokens (2174) than
the GPT4-o-mini model. However, it had a greater cost of $0. 028 for a complete clinical report.
With 1675 tokens, Gemini Flash 2 yielded the fewest amount of output tokens, having the fastest
processing rate (28.165 seconds) and the lowest cost ($0.003). This can imply that Gemini Flash
2. 0 is the most cost-efective alternative for producing first drafts in terms of both speed and
expense. Although it is noteworthy that since the token count is lower, it might afect the
thoroughness of the report.</p>
          <p>The refinement stage (with the o3-mini model) required the most processing time (68.169
seconds) and a medium cost of $0.025, producing the most output tokens (3,853) much like
GPT4o-mini. The main goal for this stage is to integrate the information from individual records,
and improve the quality of clinical reports. Hence, it would require longer processing time and
heavier output token weights because of attempting to expand the content and thoroughness
8The reported value is based on 3+ hour evaluation of ASD in North America and does not necessarily reflects the
values in EU.
of the reports. Compared to that, the GPT4o baseline’s average processing time was about 53
seconds, a time that was in accordance with the Haiku and GPT-4o-mini. Even so, the cost of
a complete report ($0.077) was higher than the personalized report-generator models. At the
same time, this model was the one with the least token count (2069), which means that it leads
to more brief report texts.</p>
          <p>From the findings, there exists a trade-of between processing speed and cost. The processor
that was both the fastest and the cheapest is Gemini Flash 2.0, while the most expensive one
was GPT4-o, and Haiku and GPT4-o-mini models fell in between. Looking at the output token
counts may give us a clue about the length as well as the quantity of details of the reports
generated. Regarding this, Gemini Flash 2.0 made the most concise reports, while the o3-mini
generated the most verbose. As for GPT4-o, it favored conciseness as well. With regard to the
refinement stage, it is more likely to synthesize the reports and expand the initial information
from the individual ones, as can be seen in the increased token count. Even though this can
mean a longer time of processing and extra cost. The most expensive one of the models was the
baseline, meaning that it may not be the most cost-efective model when the purpose is creating
numerous reports.</p>
          <p>Time (s)</p>
          <p>The table outlined in Table 3 is an illustration of the contrasts in report generation time and
cost between the proposed multimodel approach, human experts, and the baseline of GPT4-o.
The table very clearly highlights the diferences between the two methods in terms of
costefectiveness as well as the measure for time saving. The proposed approach is evaluated in
two scenarios: Minimum (Min) and Maximum (Max) of the required time. These account for
diferences in processing the audio transcripts. In this case, min and max refer to the minimum
and the maximum required time taken by the models, both in individual stages and in refinement.
The Max scenario refers to the worst case (upper-bound) of sequential implementation of the
models.</p>
          <p>The proposed approach in generating reports showed a minimum of 96 seconds in processing
time and a maximum of 204.495 seconds, incurring an overall cost of $0.061. For the GPT4-o
baseline, the average time spent per report was 52.5 seconds and the average cost per report
was $0.077. We can see that although more time is incurred using the proposed approach, the
time is still comparable to the baseline model, and it yields lower cost than the GPT4-o baseline.
Notably, both generative AI based approaches markedly surpass the human-based one, which
requires significantly more time and cost. Therefore, for high-scale reporting, the proposed
approach could be more cost-efective, and be more beneficial at a large scale. In brief, the
decision to select one over the other between the proposed technique and GPT4-o baseline
will depend on the exact need in the particular scenario. Speed-wise, it can be convenient to
consider that GPT4-o succeeds; however, if the dominant factor is cost, the proposed multimodel
technique becomes a better alternative notwithstanding more prolonged processing times.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Quality Evaluation Module</title>
          <p>We developed a report evaluation module to thoroughly evaluate the performance of our
suggested approach using the GPT4-o AI model. Shown in Figure 2, this module helps one
systematically evaluate the qualitative components of the produced clinical reports. The first
stage of the assessment is the message creation of the input report, which covers compilation
of the AI generated report and the expert-authored report that provide the gold standard of
reference.</p>
          <p>At the core of our evaluation method lies the structural comparison of the generated
documents against the expert-written ones. We create a consistent message format for this that
includes all relevant information for this analysis. This message contains the generated content,
clearly delimited for each model, and the reference point to specify the appropriate expert
report and diferentiating between professional and generated material. Using the following
system prompt, we guide the model through the review process to thoroughly analyze the
generated contents on several quality measures. First, the quality of the generated reports is
assessed based on thoroughness and organization, along with their clarity of language and
structure, emphasizing on readability and simplicity of grasp. At last, the produced material is
also assessed based on similarity and alignment with the professional-written reports; thus, by
means of this criterion, we can show how much the generated reports represent the expert’s
patient assessment and advice.</p>
          <p>Evaluation System Prompt: I have four distinct psychiatric reports and an expert report
for the same patient. Please evaluate each of the four reports individually and assign scores
based on the following criteria:
Quality: Overall quality of the report, including thoroughness, organization, and use of
appropriate terminology. (Scale: 1-100, 100 being the highest)
Clarity: How easy the report is to understand, including its writing style and logical flow.
(Scale: 1-100, 100 being the highest)
Similarity: How closely the report aligns with the expert report in terms of diagnoses,
treatment recommendations, and overall assessment. (Scale: 1-100, 100 being the most
similar)
Present your scores in the following format for each report: &lt;Report Number&gt;:&lt;Q:Quality</p>
          <p>Score&gt;; &lt;C:Clarity Score&gt;; &lt;S:Similarity Score&gt;. For example: &lt;R1&gt;:&lt;Q:7&gt;; &lt;C:8&gt;; &lt;S:6&gt;.</p>
          <p>A pre-defined clinical questionnaire can also be included to steer the evaluation process and
give particular attention to major elements of report quality and content. This analysis and
assessment can guarantee a systematic and uniform review throughout all documents.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. Quality Analysis</title>
          <p>Comparison of diferent report generation approaches is given through an evaluation criterion
by Quality, clarity, and alignment with expert-written reports. There are various critical
aspects of the quality evaluation of the generated clinical reports. Quality measures on the
completeness of the report in examining thoroughness and organization, logical flow, and
structure, as well as representation of using proper medical terms — ensuring that the clinical
ifndings or recommendations are expressed or articulated accurately. Clarity measures how
easily the report can be understood by its intended audience, considering various aspects like
writing style, sentence construction, and logical idea flow. Thus, the clear report is conceived
concisely without jargon wherever possible and organizes the information competently. Finally,
alignment with expert-written reports assesses how closely the AI-generated report matches the
content and conclusions of reports written by experienced clinicians. It is, thus, an appraisal of
how well the AI has assessed an issue, how good it is at capturing essential points regarding that
issue, and how closely its recommendations resemble those given by its human counterparts.
This ensures that what the generative models produce accords with established clinical best
practices. This section considers all the individual models, the proposed multimodel approach
and the baseline GPT4-o regarding their quality performance. The findings indicate that the
quality of the GPT4-o-mini model results in a quality score of 81, clarity of 76, and alignment
of 71, like the baseline of GPT4o. More improvement in performance is noted in Haiku and
then higher in Gemini Flash 2.0 in all metrics by around 5 scores. From the findings, it can also
be said that the reports produced by Google’s Gemini Flash 2.0 ranked the highest among the
individual generator models. The proposed method, on the other hand, with its enhancement
step outperformed all the independent report generator and gave an overall high score on
all metrics. It could achieve 91.8, 88, and 83 on quality, clarity and similarity to the
expertwritten report, respectively. Therefore, using multiple medium-level models combined with the
refinement can yield a clear integration of diferent independent reports and enhance the depth,
clarity of the reports, and be more aligned with the reports written by experts.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Discussion</title>
        <p>In this study, we applied quantitative and qualitative evaluation measures to diferent generative
AI models for the generation of clinical reports and reveal time, cost, and quality diferences
attributed to diferent models. The findings show that the individual report generator models
provided rather quick processing time with the reasonably low cost, attracting interest for
the rapid and cheap generation of reports. Among them, Gemini Flash 2.0 was particularly
notable, having the fastest processing time (28.165 seconds) and lowest per-report cost ($0.003),
although it also yielded the fewest tokens (1675) and maybe less thorough reports. Their speed
and cheapness, as shown by the qualitative analysis, came at the cost of report quality and
alignment. In comparison, while the multimodel approach was slower in terms of processing
time, it has proved to be more cost-efective with respect to total cost across multiple reports
when compared to the GPT4-o baseline. Given a processing time about equal to that of personal
models (about 53 seconds), the GPT4-o baseline cost $0.077 per report. This is an indication
that large-scale deployment would ofer a more sustainable solution through the use of the
proposed architecture.</p>
        <p>Regarding the output tokens produced, the proposed approach, being with a refinement stage,
naturally produced the longest reports, averaging 3,853 tokens, suggesting more comprehensive
synthesis of information. On the other hand, Gemini Flash 2.0 and GPT4-o were more likely to
generate short reports, with Gemini Flash 2.0 at 1675 tokens and GPT4-o at 2069 tokens. The
token counts, along with the qualitative analysis, do help in understanding the level of detail and
elaboration provided by each model. The qualitative assessment confirmed that the proposed
multimodel system was many steps ahead in terms of the quality, clarity, and alignment with
expert reports. Particularly, significantly surpassing single models like Gemini Flash 2.0, which
scored about 87, 81, and 77, and the GPT4-o baseline, which scored 81, 76, and 80, the suggested
approach got ratings of 91.8 for quality, 88 for clarity, and 83 for alignment. Therefore, by
merging the outputs of several generative models and then fine-tuning them through a second
dedicated stage, our approach managed to achieve quite substantial improvements on all three
quality metrics.</p>
        <p>The implications are relevant for real-life applications of generative AI within the
healthcare industry. A conceivable method for using the proposed multimodel architecture is the
automation of the production of high-quality clinical reports that may relieve clinicians of some
administrative burden so they can focus on patient care. This is especially relevant given that
expert-written reports are said to require about three hours and run patients nearly $750 well
more than the expenses linked with artificial intelligence-generate reports. Although, regarding
the limitations, further studies are required to optimize the proposed architecture in terms of
speed and cost, trying out various combinations of LLMs, and making further improvements
to the refinement step. User studies with clinicians can also be beneficial, to test the practical
usefulness and acceptance of generated reports in real clinical settings.</p>
        <p>We acknowledge the importance of data privacy and in particular in the context of General
Data Protection Regulation (GDPR). While we utilized cloud-based services for transcription and
LLM inference, our system is modular and portable to ofline deployments. Transcription can be
handled locally (e.g. Whisper), and LLM inference can be executed on-premise (e.g. LLaMA2).
It is also important to stress that cloud-based LLM services are not necessarily incompatible
with GDPR. For example, as demonstrated in [33], hybrid approaches can ensure compliance by
ifltering sensitive content locally before using external services. Our architecture is capable to
support strict anonymization pipelines to ensure its compliance with GDPR principles such as
data minimization, purpose limitation, and integrity/confidentiality (Articles 5 and 32 GDPR).
The anonymization module employs Named Entity Recognition (NER)-based redaction and
pseudonymization.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper attempted to address the urgent requirement for AI-driven high-eficiency and
highquality automation in clinical report generation. This is because we recognize the necessary
requirement for streamlining the clinical report production, and reducing the pressure on the
clinicians. Noting the limitations of the present technologies, and many automated systems
failing with the optimal combination of being accurate, high-quality, and trustworthy, we
proposed our application with a three-stage architecture. This includes real-time transcription,
low-level report generation, and high-level report integration and refinement. Our architecture
exploits the best of various generative AI models, with an added refinement to synthesize
information derived from other modalities and align with expert-identified templates. At
the same time, this power-packed architecture promotes complete report generation while
protecting patient data through anonymization approaches.</p>
      <p>The efectiveness of our approach toward significant improvements in report quality, clarity,
and alignment with expert-written reports has been demonstrated in evaluations of the proposed
system as compared to individual AI report generators (GPT4-o-mini, Gemini Flash 2.0, and
Haiku) and the GPT4-o baseline. Processing time is found to be longer for our approach, but
it is proved to be more cost-efective, in particular in large-scale application, and probably
ofers the most detailed reports. Regarding qualitative evaluation, the findings show that the
proposed approach showed the highest quality ratings of over 83 in all metrics. These results
underline how our design might help to free doctors from administrative burden and therefore
raise healthcare.</p>
      <p>As for future directions, we aim to pay attention to optimizing the proposed architecture in
terms of speed and cost, and examine sophisticated model selection and integration techniques.
We will carry out thorough evaluations of its performance under real-world clinical criteria in
the following study. Furthermore, we also aim to present top consideration to resolving privacy
concerns spawned from these systems by means of techniques including diferential privacy
and federated learning, therefore guaranteeing responsible and safe deployment. Evaluating
the clinical usefulness and acceptance of the produced reports will be best achieved by user
studies with professionals, therefore hastening their integration into medical operations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by PNRR MUR project FAIR(PE0000013) and project SERICS
(PE00000014) under the MUR National Recovery and Resilience Plan funded by the European
Union - NextGenerationEU.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used LanguageTool and QuillBot AI tools in
order to grammar and spelling check, and text paraphrasing. After using these tools, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s
content.
[20] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani,
H. Cole-Lewis, S. Pfohl, et al., Large language models encode clinical knowledge, arXiv
preprint arXiv:2212.13138 (2022).
[21] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, T.-Y. Liu, Biogpt: generative pre-trained
transformer for biomedical text generation and mining, Briefings in bioinformatics 23
(2022) bbac409.
[22] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing vision-language
understanding with advanced large language models, arXiv preprint arXiv:2304.10592
(2023).
[23] J. Zhou, X. He, L. Sun, J. Xu, X. Chen, Y. Chu, L. Zhou, X. Liao, B. Zhang, X. Gao, Pre-trained
multimodal large language model enhances dermatological diagnosis using skingpt-4,
medRxiv (2023) 2023–06.
[24] S. L. Mirtaheri, S. Greco, R. Shahbazian, A self-attention tcn-based model for suicidal
ideation detection from social media posts, Expert Systems with Applications 255 (2024)
124855.
[25] X. Yang, A. Chen, N. PourNejatian, H. C. Shin, K. E. Smith, C. Parisien, C. Compas, C. Martin,
A. B. Costa, M. G. Flores, et al., A large language model for electronic health records, NPJ
digital medicine 5 (2022) 194.
[26] J. A. Omiye, J. Lester, S. Spichak, V. Rotemberg, R. Daneshjou, Beyond the hype: Large
language models propagate race-based medicine, medRxiv (2023) 2023–07.
[27] C. Wu, X. Zhang, Y. Zhang, Y. Wang, W. Xie, Pmc-llama: Further finetuning llama on
medical papers, arXiv preprint arXiv:2304.14454 2 (2023) 6.
[28] M. Yasunaga, A. Bosselut, H. Ren, X. Zhang, C. D. Manning, P. S. Liang, J. Leskovec, Deep
bidirectional language-knowledge graph pretraining, Advances in Neural Information
Processing Systems 35 (2022) 37309–37323.
[29] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-lm:
Training multi-billion parameter language models using model parallelism, arXiv preprint
arXiv:1909.08053 (2019).
[30] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang,
J. E. Gonzalez, et al., Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt
quality, See https://vicuna. lmsys. org (accessed 14 April 2023) 2 (2023) 6.
[31] J. A. Omiye, H. Gui, S. J. Rezaei, J. Zou, R. Daneshjou, Large language models in medicine:
the potentials and pitfalls: a narrative review, Annals of internal medicine 177 (2024)
210–220.
[32] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, I. Sutskever, Robust speech
recognition via large-scale weak supervision, in: A. Krause, E. Brunskill, K. Cho, B.
Engelhardt, S. Sabato, J. Scarlett (Eds.), Proceedings of the 40th International Conference on
Machine Learning, volume 202 of Proceedings of Machine Learning Research, PMLR, 2023,
pp. 28492–28518. URL: https://proceedings.mlr.press/v202/radford23a.html.
[33] S. Montagna, S. Ferretti, L. C. Klopfenstein, M. Ungolo, M. F. Pengo, G. Aguzzi, M. Magnini,
Privacy-preserving llm-based chatbots for hypertensive patient self-management, Smart
Health 36 (2025) 100552.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Greene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Mohr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Holton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. Olmos</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , et al.,
          <article-title>Large language models generate functional protein sequences across diverse families</article-title>
          ,
          <source>Nature biotechnology 41</source>
          (
          <year>2023</year>
          )
          <fpage>1099</fpage>
          -
          <lpage>1106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Meyers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fabian</surname>
          </string-name>
          , N. Brown, De novo molecular design and generative models,
          <source>Drug discovery today 26</source>
          (
          <year>2021</year>
          )
          <fpage>2707</fpage>
          -
          <lpage>2715</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Mirtaheri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pugliese</surname>
          </string-name>
          ,
          <article-title>Leveraging generative ai to enhance automated vulnerability scoring</article-title>
          ,
          <source>in: 2024 IEEE Conference on Dependable, Autonomic and Secure Computing (DASC)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          . doi:
          <volume>10</volume>
          .1109/DASC64200.
          <year>2024</year>
          .
          <volume>00014</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Rashidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pantanowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chamanzar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fennell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Gullapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tafti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deebajah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Albahra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Glassy</surname>
          </string-name>
          , et al.,
          <article-title>Generative artificial intelligence in pathology and medicine: A deeper dive</article-title>
          ,
          <source>Modern Pathology</source>
          <volume>38</volume>
          (
          <year>2025</year>
          )
          <fpage>100687</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Toscano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. O</given-names>
            <surname>'Donnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Broderick</surname>
          </string-name>
          , M. May,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Unruh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Messina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Casalino</surname>
          </string-name>
          ,
          <article-title>How physicians spend their work time: an ecological momentary assessment</article-title>
          ,
          <source>Journal of General Internal Medicine</source>
          <volume>35</volume>
          (
          <year>2020</year>
          )
          <fpage>3166</fpage>
          -
          <lpage>3172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N. Kamel</given-names>
            <surname>Boulos</surname>
          </string-name>
          ,
          <article-title>Generative ai in medicine and healthcare: promises, opportunities and challenges</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>286</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Thirunavukarasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. J.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Elangovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. W.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <article-title>Large language models in medicine</article-title>
          ,
          <source>Nature medicine 29</source>
          (
          <year>2023</year>
          )
          <fpage>1930</fpage>
          -
          <lpage>1940</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. A. K.</given-names>
            <surname>Balouch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hussain</surname>
          </string-name>
          ,
          <article-title>A transformer based approach for abstractive text summarization of radiology reports</article-title>
          ,
          <source>in: International Conference on Applied Engineering and Natural Sciences</source>
          , volume
          <volume>1</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>476</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lv</surname>
          </string-name>
          , et al.,
          <article-title>Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (bert) and in-domain pre-training (idpt)</article-title>
          ,
          <source>BMC Medical Informatics and Decision Making</source>
          <volume>22</volume>
          (
          <year>2022</year>
          )
          <fpage>200</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sloan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clatworthy</surname>
          </string-name>
          , E. Simpson,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirmehdi</surname>
          </string-name>
          ,
          <source>Automated radiology report generation: A review of recent advances, IEEE Reviews in Biomedical Engineering</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shokrollahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yarmohammadtoosky</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Nikahd</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of generative ai in healthcare</article-title>
          ,
          <source>arXiv preprint arXiv:2310.00795</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Thirunavukarasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanghera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Barzangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>El Mukashfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Trialling a large language model (chatgpt) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care</article-title>
          ,
          <source>JMIR Medical Education</source>
          <volume>9</volume>
          (
          <year>2023</year>
          )
          <article-title>e46599</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Medenilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sillos</surname>
          </string-name>
          , L. De Leon,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elepaño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madriaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aggabao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diaz-Candido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maningo</surname>
          </string-name>
          , et al.,
          <article-title>Performance of chatgpt on usmle: potential for ai-assisted medical education using large language models</article-title>
          ,
          <source>PLoS digital health 2</source>
          (
          <year>2023</year>
          )
          <article-title>e0000198</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Ayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Leas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Faix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Longhurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hogarth</surname>
          </string-name>
          , et al.,
          <article-title>Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum</article-title>
          ,
          <source>JAMA internal medicine 183</source>
          (
          <year>2023</year>
          )
          <fpage>589</fpage>
          -
          <lpage>596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Williamson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Chow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ikemura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pouli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          , et al.,
          <article-title>A multimodal generative ai copilot for human pathology</article-title>
          ,
          <source>Nature</source>
          <volume>634</volume>
          (
          <year>2024</year>
          )
          <fpage>466</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Homer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wilsdon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Walsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Goebel</surname>
          </string-name>
          , I. Sansano,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sonawane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cockenpot</surname>
          </string-name>
          , et al.,
          <article-title>Assessment of pathology domain-specific knowledge of chatgpt and comparison to human performance, Archives of pathology &amp; laboratory medicine 148 (</article-title>
          <year>2024</year>
          )
          <fpage>1152</fpage>
          -
          <lpage>1158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , et al.,
          <article-title>Fit-net: Feature interaction transformer network for pathologic myopia diagnosis</article-title>
          ,
          <source>IEEE Transactions on Medical Imaging</source>
          <volume>42</volume>
          (
          <year>2023</year>
          )
          <fpage>2524</fpage>
          -
          <lpage>2538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matias</surname>
          </string-name>
          , G. Corrado,
          <article-title>Our latest health ai research updates</article-title>
          , Google [Internet]
          <volume>14</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>