<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktor Schlegel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hao Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuping Wu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Subramanian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh-Tung Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhinav Ramesh Kashyap</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Beck</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaojun Zeng</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riza Theresa Batista-Navarro</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Winkler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Goran Nenadic</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ASUS Intelligent Cloud Services (AICS)</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science, National University of Singapore</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Computer Science, University of Manchester</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computing and Information Systems, University of Melbourne</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes PULSAR, our system submission at the ImageClef 2023 MediQA-Sum task on summarising patient-doctor dialogues into clinical records. The proposed framework relies on domainspecific pre-training, to produce a specialised language model which is trained on task-specific natural data augmented by synthetic data generated by a black-box LLM. We find limited evidence towards the eficacy of domain-specific pre-training and data augmentation, while scaling up the language model yields the best performance gains. Our approach was ranked second and third among 13 submissions on task B of the challenge. Our code is available at https://github.com/yuping-wu/PULSAR.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Abstractive Summarisation</kwd>
        <kwd>AI for Healthcare</kwd>
        <kwd>Dialogue Summarisation</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the recent successes of generative large language models (LLMs) on a variety of tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and domains [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], even in the face of data scarcity [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], there is vivid interest in identifying
potential application scenarios that could benefit from the power of LLMs. One of the promising
domains is healthcare [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as many administrative tasks involve the transformation of textual
data. LLM-based approaches that assist hospital staf in repetitive administrative tasks have
the potential to improve operational eficiency and documentation quality, optimise revenue
streams, reduce cognitive load on healthcare experts, and ultimately result in better and more
efective patient care [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        A range of diferent scenarios have been investigated for the suitability of LLM-based
assistance, such as summarising patient progress notes as discharge summaries [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or identifying
problems that need treatment during a patient’s hospital course [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. One of the potential tasks
is summarising doctor-patient dialogue as medical records [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Dialogue summarisation, an
established task in the Natural Language Processing (NLP) community, aims to identify salient
topics in a multi-turn dialogue [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. State-of-the-art approaches typically formulate the problem
as abstractive summarisation, making the task a prime candidate for further investigation of
the potential of LLMs in clinical settings. In this scenario, conversations between patients and
doctors need to be transformed into (excerpts of) clinical documentation. For example, if a
27 year old female patient mentions that they are experiencing “Sore throat, runny nose, dry
cough and fever 37.5 ∘ C”, the corresponding entry can be the “Subjective” section of a medical
record excerpt, e.g., “Patient is a 27 year old female who presents with sore throat, runny nose
dry cough and a fever of 37.5 ∘ C.” This documentation is typically performed by the consulting
doctor or an attending nurse. Despite bearing potential impact for automation, with clinical
staf spending at least 35 minutes of their time every other day on writing such clinical notes
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], this task was underexplored by the NLP community, compared to other hospital-related
tasks, such as clinical coding [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], or generating radiology reports [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. More recently, the
task has received more attention [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], however studies thus far have either focussed on narrow
department selections [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], did not focus on medical documentation generation [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], or
have not released their data publicly [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        To that end, the ImageClef 2023 MediSum shared task released a collection of dialogues
and corresponding clinical notes in an efort to spark interest and advance the state of the
art in dialogue as clinical note summarisation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The task revolves around three core
subtasks: (A) identifying the topic of a conversation from a selection of possible medical note
sections (i.e., “Subjective” in the previous example), (B) summarising conversation snippets to
appropriate sections in medical records, and, finally, (C) summarising full conversations to full
medical records. While conversations are synthetic, the corresponding clinical notes are real,
doctor-written documentation.
      </p>
      <p>
        Our guiding objective to participate in this task was to investigate, how well a recently
proposed LLM training framework can generalise to new tasks with as little adaptation as
possible [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. At its core, the framework (i) fine-tunes a LLM with a pre-training objective that
learns to reconstruct a pseudo-summary consisting of automatically extracted medical terms
and (ii) employs data augmentation (DA) by instructing Black-Box LLMs to obtain task-specific
training data. As such, the DA framework supports any LLM, such as Bloom [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], GPT-3 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
or GPT-3.5 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>Our submission for task B was ranked second best overall among all participants. Although
we have not actively sought to compete in Task C, we observed that our data augmentation
technique could improve the performance, particularly when the training data is scarce. These
ifndings underline the potential of LLMs in various settings as well as the generalisability of
our proposed approach.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Definition</title>
      <p>In this section, we describe and formalise the three tasks of the ImageClef 2023 MediSum
challenge.</p>
      <p>Task A – Dialogue2Topic Classification In this task, participants need to identify the
topic of a conversation. The list of possible topics corresponds to the 20 diferent fine-grained
sections that can be part of a medical record, such as “Subjective”, i.e., the subjective description
of symptoms by the patient.</p>
      <p>
        Task B – Dialogue2Note Summarization Here, participating systems need to convert a
conversation on a specific topic into a corresponding section in the medical record. This task
can be regarded as conditional generation, sequence-to-sequence translation or abstractive
summarisation. Approaches are evaluated on multiple natural language generation metrics,
both based on n-gram overlap, i.e., ROUGE [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], as well as semantic similarity [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ]. 1201
training and 100 validation examples are provided. 200 examples form the test set.
Task C – Full-Encounter Dialogue2Note Summarization This task is formulated similarly
to Task B, however here, the inputs are full notes and the evaluated systems need to generate
medical record outputs for the four general sections “Subjective”, “Objective Exam”, “Objective
Results” and “Assessment and Plan”. This task features only 67 training and 20 validation
examples, with 40 examples reserved for testing. The systems are evaluated based on their
output for each of the sections using the ROUGE metrics from Task B; the results are averaged
across all sections. An alternative mode of evaluation combines all outputs into one single
record and measures the n-gram overlap by means of the ROUGE score.
      </p>
      <p>The tasks appear to be arranged as a progression, where, given a dialogue, a segmentation
and classification model could segment the topics of the conversation (Task A) to be used
as input for a Dialogue Snippet Summarisation Model (Task B), the output of which can be
arranged as a full medical record (Task C). However, as our goal was to evaluate how well the
proposed framework generalises to the tasks with as little adaptation as possible, we decide not
to make any task-specific adaptations even if they could provide beneficial given the particular
arrangement of the tasks. Thus, we do not rely on any additional information, treat tasks B and
C in isolation, and disregard task A for not being a generative task.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Language model Pre-training</title>
        <p>
          Motivated by the success of predicting masked words [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and contiguous spans [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] as
selfsupervised training objectives, we customised the pre-training objectives for the medical domain
generation task to concatenate “gap text spans (sentences)” into a pseudo-summary. Each
masked span is a medical term from the input text identified by the QuickUMLS [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] or a
NER model fine-tuned on a N2C2 dataset (i2b2-2010 challenge [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]). Specifically, as shown in
Figure 1, pre-training consisted of three diferent policies: first, when both QuickUMLS and
N2C2 NER models identified entities, the QuickUMLS results were used in 70% of cases and the
results of the N2C2 NER model were used in 30%. Second, when only one of them predicted any
output, that output was used for masking. Third, when neither had any output, then 15% of the
N2C2
CPAP
sat drifts
caffeine
        </p>
        <p>He is alert and active with cares.</p>
        <p>Policy:
70%
30%</p>
        <p>PULSAR-3B/11B
MLM</p>
        <p>GSG</p>
        <sec id="sec-3-1-1">
          <title>Outputs</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Inputs</title>
          <p>
            UMLS N2C2
Infant remains on prong [MASK 1,2] of 5. Occaisional brief O2 [MASK 1] noted.
Breath sounds are clear and equal. Remains on [MASK 2], no spells thus far
tonight. Infant remains in off isolette with stable temp. [Sentence].
sentences were masked at random. These text spans were replaced with “sentinel” mask tokens
&lt;    &gt; to inform the model that input was masked. In order to provide the model with
suficient medical knowledge, we used the MIMIC-III [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], a pre-trained corpus of 2 million
data, which consists of a large number of clinical records, such as admission notes, discharge
summaries or lab results.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Augmentation (DA)</title>
        <p>
          Both tasks sufer from scarcity of training data, specifically Task C, which requires generating
comprehensive clinical notes based on lengthy patient-doctor conversations based on only 67
training examples. These may be insuficient to train a model capable of performing well on the
task. To address this issue, we adopt data augmentation to generate additional examples for
training, as this has been shown to improve performance in data-scarce scenarios [
          <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
          ].
Prompting Strategy We observed that Large Language Models (LLMs) such as ChatGPT
are proficient in understanding clinical context and manipulating clinical data. Therefore,
we utilise a pre-existing LLM to generate data for the model’s training. Ideally, the data
generation approach would involve providing conversations and requesting the LLM to produce
the corresponding medical note. However, we are limited by the fact that we only have 67
full-length conversations in our dataset. Nonetheless, we have access to a significantly larger
number of medical notes. Hence, we invert the task by prompting the LLM with a medical note
(or its snippet) and ask it to generate a hypothetical conversation between the doctor and the
patient. We then use the generated conversations as input to train our model to produce the
corresponding clinical note.
        </p>
        <p>We employ the OpenAI ChatGPT API (gpt-35-turbo) for data augmentation, utilising a
twostage prompting strategy to generate data efectively. In the first stage, we use in-context learning
with one-shot prompting to prompt the LLM to generate a fictitious conversation between
the doctor and patient based on the medical note, while adhering to important guidelines. We
provide only one example picked from the training set as we are limited by token context
windows for the API. In the second stage (only performed for task C), we prompt the model to
include conversational fillers such as “ums”, “uh”, and “hmm” to the generated conversation from
the first stage, as we noticed that the model did not include these fillers despite our instructions
in the first stage.</p>
        <p>
          Dataset Utilised For task B, we extract matching subsection headings from the MIMIC-III
database [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], adapting the pre-processing method from Yang et al. [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] to identify section
headers. We rank the generations based on their average Rouge similarity to all training
instances and pick the top-scoring  conversations.
        </p>
        <p>For task C, we utilise a corpus of freely available medical notes scraped from MTSamples,
which is available on Kaggle1. Since the dataset contains medical transcriptions of notes from
various medical specialities, we devise a method to pick samples from the dataset that are the
closest to the medical notes in our training set. To do this, we identify and curate a list of the
section headers in the training set through a heuristic approach by exploiting the fact that
section headers are usually written in all capital letters. We split the document by newline
and extract the lines which are fully upper-cased and add these contents to our list of section
headers. We then score the medical notes in MTSamples based on the number of headers
that each document has based on the curated list from the previous step and pick the top 
documents from MTSamples with the highest scores to use as input for DA. We end up with
a corpus of 746 data samples due to the fact that some inputs were flagged as ofensive by
OpenAI’s content moderating policy.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Empirical Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Experiments set-up</title>
        <p>We aim to empirically evaluate, how well our framework can solve the problem of converting
patient dialogues to medical records. We pursue the following questions:
(i) How well can our proposed approach convert doctor-patient dialogues to Medical Records?
(ii) Does the domain-specific pre-training objective improve performance?
(iii) What is the impact of model scale on the performance?
(iv) Does synthetic data augmentation improve performance on the tasks?</p>
        <p>To answer question (i) we empirically evaluate our proposed framework on the task B and
C test sets of the ImageClef Challenge. For evidence towards question (ii), we compare the
performance of PULSAR to that of equally-sized Flan-t5 models. Regarding question (iii), we
compare the performance of variously sized models of the same architecture and for question
(iv), we compare the performance of models trained on available data only to those fine-tuned
on synthetically generated conversation data.
1https://mtsamples.com/ and https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions, respectively</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation Details</title>
        <p>
          Pre-training PULSAR-* is initialised with weights from the corresponding Flan-t5-*
models [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] and pre-trained with four NVIDIA Tesla A100 80GB GPUs for 1 epoch on all MIMIC-III
notes. Huggingface Accelerate is used to optimise GPU memory usage with the Fully Sharded
Data Parallel (FSDP) paradigm. We set the training batch size per GPU device as 4 and the
gradient accumulation step as 8 to accelerate the training process.
        </p>
        <p>
          Fine-tuning We fine-tune all models for 3 epochs. We experiment with encoder-decoder
Flan-T5, PULSAR and Clinical-T5 [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]models, with the configurations *-Large (0.9B
Parameters), *-3B and *-11B. Unless stated otherwise, the models are trained on two A100
80GB GPUs with cumulative batch size of 8 and the learning rate of 3− 5. For the largest of them,
i.e., (Flan-t5-11B and PULSAR-11B), we use FSDP with CPU ofloading. We also experiment
with a decoder-only model, LLAMA-13B, freezing and quantising the base model in eight bit [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]
and using the parameter-eficient LoRA [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] method. More details on hyper-parameter choice
are reported in the appendix.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results and analysis</title>
        <p>At a glance, Table 1 shows the results of our empirical study and Table 2 shows the final
ranking of all participating systems according to the oficial evaluation by task organisersIn the
following, we discuss our findings in context of the questions outlined in the motivation of this
empirical study.
Our approach generalises well to the dialogue summarisation task. Overall, our
approach generalises well to Task B, with our best model (Table 1, 11B1) surpassing the 50 Rouge-1
scores mark, which means that on average, half of the prediction tokens are found in reference
and vice versa. The high Rouge-L score of 44 suggests that most of these overlapping tokens
indeed form a sequence. However, these scores may be “boosted” by the presence of many short
target sequences in the evaluation data, such as “Noncontributory.” or “No known allergies.”,
when a dialogue revolves around a topic that does not contribute to the patients’ hospital visit.</p>
        <p>We find that the utilising the outputs of task A—the section headers—does not contribute to
improving the overall performance, compare Table 1, L2 and L4. We observed the same trend
across all model sizes (not reported here for brevity).</p>
        <p>In the absence of established baselines, we interpret the oficial rankings of the shared task in
Table 2 as additional evidence towards the success of our approach.</p>
        <p>
          There is no conclusive evidence that domain-specific pre-training is beneficial.
Comparing 11B1 and 11B2, and 3B1 and 3B2 in Table 1, respectively, we observe that domain-specific
pre-training by learning to predict missing medical terms in MIMIC-III notes appears not
beneifcial, with the gap being smaller for bigger models. One possible reason for this is the domain
mismatch between pre-training and application data. MIMIC-III is dominated by inpatient
progress notes which track the patients’ status along the hospital stay and contain abbreviations,
repetitions, incomplete sentences and medical jargon. Conversely, the medical records in the
challenge are well-written and stem most likely from admission notes or outpatient encounters,
where most of the initial documentation about a new patients’ particulars, such as their chief
complaint, medical history and drug allergies happens. Additionally, input dialogues have a
colloquial tone, further adding to the domain mismatch between pre-training and fine-tuning.
Model scale yields the biggest performance improvements. Comparing L*, 3B* and 11B*
results in Table 1, we can see a clear trend where larger models of the same family consistently
perform better. The biggest hike in performance is observed between the 3B and 11B models.
This observation is in line with most literature on model scale as driver of performance and the
reason for emergent abilities in LLMs [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ].
        </p>
        <p>We also find that the model trained with adapters can learn to perform on the task successfully,
despite the relatively small (around 1.1% of the full 7B model) number of trainable parameters.
However, our results suggest that updating all models’ parameters is more efective, as even
smaller models outperform the 7B adapter model (Table 1, L2, 3B* compared to 7B1).
Data Augmentation can be helpful if training data is extremely scarce. Larger models
obtain enough signal from the training data of Task B, as there is no clear improvement in
scores for the 3B models (Table 1, 3B1 vs. 3B3 and 3B2 vs. 3B4). Meanwhile, data augmentation
can lead to consistent, albeit minor, improvements for smaller models (Table 1, L2 vs. L3).
When training data is scarce (i.e., Task C) data augmentation helps with the performance.
Subjectively, models exhibit typical generation errors such as hallucination and input copying,
(see Figure 2 in Appendix) and data augmentation seems to alleviate this issue. Quantitatively,
data augmentation improves performance across all metrics (27.64 vs 29.41 R1, 9.79 vs 11.60
R2, 16.24 vs 19.18 RL and 23.63 vs 26.08 RLSum without and with DA, respectively). We find
the results promising, as the optimised model seems to perform well without any task-specific
adaptation. Ultimately, however, this simple approach does not compete with other, potentially
task specific information exploiting submissions, with the best of them scoring almost 20
Rouge-1 points higher (20.32 R2, 24.30 RL and 45.06 RLSum).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>
        In this work, we present an LLM framework and adapt it to the task of dialogue note
summarisation. While we find that the approach generalises well to this new task, there is mixed evidence
of the eficacy of both domain-specific pre-training and data augmentation. Our experiments
seem to align with the “bitter lesson of AI”2, in that model scale seems to trump domain-specific
adaptations. This, in turn, supports the narrative of the transformative potential of LLMs in
healthcare [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], as larger LLMs become more readily available.
      </p>
      <p>Our findings suggest further avenues for future work: We argued that the pre-training
objective may sufer from domain mismatch. As such, experimenting with other
domainspecific objectives might improve the performance of the downstream tasks. Furthermore,
it is unclear how the choice of hyper-parameters for both training and inference stages (i.e.,
decoding arguments) impacts the overall performance. Finally, we have left it for future work
to investigate, whether data augmentation could provide beneficial with a more advanced
ifltering strategy, for example by only augmenting examples with certain length or specific
section headers. As such, we will expand the work reported in this paper by experimenting with
diferent pre-training objectives, performing a more rigorous hyper-parameter optimisation
and investigating the impact of data augmentation more closely.
2http://www.incompleteideas.net/IncIdeas/BitterLesson.html
The results described in this paper should be interpreted within the following context:
• The language of the conversations is English. Due to the dominance of English data
during pre-training, it is expected that all LLMs that we inspected perform better on
English. It is unclear how well the approach will transfer to other languages.
• The conversations are synthetic in that they have been written based on existing medical
notes, rather than transcribed from real patient-doctor dialogues. While the quality has
been evaluated by medical professionals, it is unclear how well the performance would
translate to real-world scenarios.
• The obtained results should be regarded as preliminary, as robust empirical results such
as hyper-parameter optimisation for fine-tuning, pre-training policy selection, exhaustive
search for best-performing prompts for data augmentation and strategies for data selection
are often impossible given the time constraints of academic challenges and shared tasks.
We initialise LoRA with  = 16,  = 16 on the query, key, value and output projection weights
of all layers of the base model (, ,  and , respectively). The model is trained on a single
A100 80GB GPU with a learning rate of 3− 4 for the adapter weights. For both encoder-decoder
and decoder only settings, during training, we optimise the parameters of the language models
to minimise the cross-entropy loss between each token of the prediction and the corresponding
token of the ground truth answer sequence using teacher forcing. For encoder-decoder models,
we limit the length of input dialogues to at most 496 and the length of output notes to at most
214 tokens, respectively (95th percentile). For the decoder model, we limit the length of input
and output combined to at most 696 tokens. During inference, we set no limits to input and
output sequence lengths and decode the prediction using beam search with 6 (4 for LLaMa),
temperature of 1.0, top k of 50 (40 for LLaMa) and top p of 1.0 (0.7 for LLaMa).</p>
      <p>For task C, we use the same arguments as for task C, with the exception of limiting the input
length to 2048 and output length to 990 during training, in order to fit the GPU during training.</p>
    </sec>
    <sec id="sec-6">
      <title>B. Qualitative example</title>
      <p>Flan-T5-3B Flan-T5-3B-746DG
Alexander is a 62-year-old male, with a past HISTORY OF PRESENT ILLNESS: Alexander is
medical history significant for reflux, who a 62-year-old male with a past medical history
presents for follow-up of his chronic problems. significant for reflux. He presents for follow-up
He is so thankful you put him on that medicine of his chronic problems. He has a good support
for his reflux. The protonix that he had, w- system at home. He has a big family. All his
made an amazing change in his life. He’s re- kids call and check on him every day. He does
ally comfortable now. He eats whatever he not have any symptoms of chest pain,
shortwants, and he feels so much better. The doctor ness of breath, belly pain, nausea or vomiting.
is glad to hear that. Okay. How are you doing, He does not have any symptoms of nausea or
kind of, managing your diet? I know, you have vomiting. On physical examination, I do not
to do some lifestyle modifications, like cutting hear any carotid bruits in his neck. I do hear a
back on cafeine and spicy foods and alcohol. slight 2/6 systolic ejection murmur, which I’ve
How are you doing with that? [patient] I’m heard in the past, so that’s stable. His lungs
doing really well. I moved over from cafeine, are nice and clear, and he does have 1+ pitting
over to green tea. [doctor] Okay. [patient] and edema bilaterally in his lower extremities. I
it, it is so, m- it doesn’t cause as much problem think he is doing a good job watching his diet.
as it did with, when he was drinking so many He could just be retaining a little bit of fluid,
energy drinks a day... maybe just from standing all day.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alyafeai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stiegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dey</surname>
          </string-name>
          , et al.,
          <article-title>Multitask prompted training enables zero-shot task generalization</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          <year>2022</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hegselmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sontag</surname>
          </string-name>
          ,
          <article-title>Large language models are fewshot clinical information extractors</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1998</fpage>
          -
          <lpage>2022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tanwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cole-Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pfohl</surname>
          </string-name>
          , et al.,
          <article-title>Large language models encode clinical knowledge</article-title>
          ,
          <source>arXiv preprint arXiv:2212.13138</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Topol</surname>
          </string-name>
          ,
          <article-title>Ai in health and medicine</article-title>
          ,
          <source>Nature medicine 28</source>
          (
          <year>2022</year>
          )
          <fpage>31</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Searle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Dobson</surname>
          </string-name>
          ,
          <article-title>Discharge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          <volume>141</volume>
          (
          <year>2023</year>
          )
          <fpage>104358</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Afshar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dligach</surname>
          </string-name>
          , Bionlp workshop 2023 shared task 1a:
          <article-title>Problem list summarization, "</article-title>
          <source>Proceedings of the 22nd Workshop on Biomedical Language Processing"</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Snider</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Yetisgen, Overview of the mediqa-sum task at imageclef 2023: Summarization and classification of doctor-patient conversations</article-title>
          ,
          <source>in: CLEF 2023 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <article-title>A survey on dialogue summarization: Recent advances and new frontiers</article-title>
          ,
          <source>in: CLEF 2023 Working Notes, CEUR Workshop Proceedings</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hripcsak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Vawdrey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Fred</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Bostwick</surname>
          </string-name>
          ,
          <article-title>Use of electronic clinical documentation: time spent and team interactions</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>18</volume>
          (
          <year>2011</year>
          )
          <fpage>112</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>L.-w. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghassemi</surname>
            , B. Moody, P. Szolovits,
            <given-names>L. Anthony</given-names>
          </string-name>
          <string-name>
            <surname>Celi</surname>
          </string-name>
          , R. G. Mark,
          <article-title>Mimic-iii, a freely accessible critical care database</article-title>
          ,
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>T.-T. Nguyen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Schlegel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kashyap</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>S.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Lin</surname>
          </string-name>
          ,
          <article-title>Mimic-iv-icd: A new benchmark for extreme multilabel classification</article-title>
          ,
          <source>arXiv preprint arXiv:2304.13998</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M. M. A. Monshi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Poon</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Chung</surname>
          </string-name>
          ,
          <article-title>Deep learning in generating radiology reports: A survey</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>106</volume>
          (
          <year>2020</year>
          )
          <fpage>101878</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>A. B. Abacha</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Yim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>An empirical study of clinical note generation from doctor-patient encounters</article-title>
          , in: EACL, Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>2283</fpage>
          -
          <lpage>2294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kazi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kahanda</surname>
          </string-name>
          ,
          <article-title>Automatically generating psychiatric case notes from digital transcripts of doctor-patient conversations using text mining</article-title>
          ,
          <source>PeerJ Prepr</source>
          .
          <volume>7</volume>
          (
          <year>2019</year>
          )
          <article-title>e27497</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Enarvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Amoia</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D.-A. Teba</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Delaney</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Diehl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hahn</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>McGrath</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pinto</surname>
          </string-name>
          , et al.,
          <article-title>Generating medical reports from patient-doctor conversations using sequence-to-sequence models</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Natural Language Processing for Medical Conversations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17] W.-w. Yim,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen-Yildiz</surname>
          </string-name>
          ,
          <article-title>Towards automating medical scribing: Clinic visit dialogue2note sentence alignment and snippet summarization</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Katariya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Amatriain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kannan</surname>
          </string-name>
          , Dr. summarize:
          <article-title>Global summarization of medical dialogue by exploiting local structures</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>08666</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Batista-Navarro</surname>
          </string-name>
          , T.-T. Nguyen,
          <string-name>
            <given-names>A. Ramesh</given-names>
            <surname>Kashyap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Winkler</surname>
          </string-name>
          , G. Nenadic, Pulsar:
          <article-title>Pre-training with extracted healthcare terms for summarising patients' problems and data augmentation with black-box large language models, arXiv preprint (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Akiki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ilić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Castagné</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yvon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallé</surname>
          </string-name>
          , et al.,
          <article-title>Bloom: A 176b-parameter open-access multilingual language model</article-title>
          ,
          <source>arXiv preprint arXiv:2211.05100</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: NeurIPS,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          , in: Text summarization branches out,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>Bleurt: Learning robust metrics for text generation</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7881</fpage>
          -
          <lpage>7892</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , Spanbert:
          <article-title>Improving pretraining by representing and predicting spans</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>64</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <volume>140</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>140</lpage>
          :
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Soldaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <article-title>Quickumls: a fast, unsupervised approach for medical concept extraction</article-title>
          , in: MedIR workshop, sigir,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Ö. Uzuner</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>South</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          <string-name>
            <surname>DuVall</surname>
          </string-name>
          ,
          <year>2010</year>
          i2b2/
          <article-title>va challenge on concepts, assertions, and relations in clinical text</article-title>
          ,
          <source>J. Am. Medical Informatics Assoc</source>
          .
          <volume>18</volume>
          (
          <year>2011</year>
          )
          <fpage>552</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>Generating datasets with pretrained language models</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6943</fpage>
          -
          <lpage>6951</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Batista-Navarro</surname>
          </string-name>
          , G. Nenadic,
          <article-title>Do you hear the people sing? key point analysis via iterative clustering and abstractive summarisation</article-title>
          ,
          <source>arXiv preprint arXiv:2305.16000</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P. S.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Knowledge injected prompt based ifne-tuning for multi-label few-shot icd coding</article-title>
          ,
          <source>arXiv preprint arXiv:2210.03304</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.11416</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>E.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wulf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szolovits</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , E. Alsentzer,
          <article-title>Do we still need clinical language models?</article-title>
          ,
          <source>arXiv preprint arXiv:2302.08091</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          , L. Zettlemoyer,
          <string-name>
            <surname>Llm.</surname>
          </string-name>
          int8 ()
          <article-title>: 8-bit matrix multiplication for transformers at scale</article-title>
          ,
          <source>arXiv preprint arXiv:2208.07339</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lora: Lowrank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <article-title>Emergent abilities of large language models</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          <volume>8</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. P.-W.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , et al.,
          <article-title>Large ai models in health informatics: Applications, challenges, and the future</article-title>
          ,
          <source>arXiv preprint arXiv:2303.11568</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>