<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Decoder-Based Distillation for Enhancing Multilingual Clinical Case Report Summarization⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicolay Rusnachenko</string-name>
          <email>n.rusnachenko@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoxiao Liu</string-name>
          <email>xliu@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Chang</string-name>
          <email>jchang@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Jun Zhang</string-name>
          <email>jzhang@bournemouth.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>, Faculty of Media and Communications</institution>
          ,
          <addr-line>Bournemouth</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Centre for Applied Creative Technologies (CFACT</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic summarization of clinical reports represent an important field of studies that contribute to shortening long textual narratives written in various languages. Efective report summarization poses numerous challenges, including density of medical terms mentions, semantic interdependency among mentioned entities. The most recent advances of instruction-tuned models illustrate promising capabilities of models at various scale across numerous fields of Natural Language Processing, including textual summarization. A hybrid teacher-student distillation process leverages the power of knowledge distillation by transferring knowledge from a large model (teacher) to a smaller model (student). To our best knowledge, numerous existing studies broadly exploit Seq2seq models. Despite their efectiveness for dialogues and summarization of short texts, such techniques have not become common for supporting multilingual and long input contexts. To bridge the gap in exploring distillation tuning, this paper proposes an adaptation of the teacher-student framework for decoder based systems. In this paper, we experiment with a teacher-student framework for summarising clinical case reports. We adopt the Qwen2.5 models family and evaluate our setup on the MultiClinSumsmall dataset. We demonstrate that ifne-tuning the 0.5B model with the knowledge transferred from the 72B model results in 2.4%-4% performance increment by Rouge metrics compared to the conventional fine-tuning process, highlighting our model's practical benefits in clinical information processing. Our framework is publicly available: https://github.com/nicolay-r/ distil-tuning-llm</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Model</kwd>
        <kwd>Hybrid Distillation</kwd>
        <kwd>Clinical Report Summarization</kwd>
        <kwd>Multilingual Summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text summarization is a task of shortening textual content while preserving crucial information. The
approaches on automated shortening of the textual content are commonly divided into: extractive
methods (keeping salient segments) and abstractive methods (essay generation). As a task within
the clinical domain, textual summarization lies at the intersection of various information retrieval
challenges, including but not limited by question-answering [1], entities extraction [2]. The texts to
be summarised may vary in length, ranging from short texts (conversational dialogues [3]) to long
narratives (clinical case reports [4]).</p>
      <p>The advent of transformer-based architectures [5] with appearance of self-attention [5] caused a
significant impact on automated text translation systems and as a result Seq2seq systems [6, 7, 8] and
decoder-based solutions [9]. However, the benefit of attention comes at the cost of quadratic complexity
with respect to the input sequence length. Such tradeof raised a number of further works on attention
sparsification techniques [8, 10]. However, the most recent tendency towards exploiting pretrained
generalized systems [9, 11, 12, 13] shaped architectural concepts towards such factors as (i) scalability,
and (ii) alignment with next-token prediction training; for which decoder-based systems are suited better.
The generalized approach of adoption models for various problems results in so-called instruction-tuned
models [9]. Despite the vast amount of benefits and adaptation for the downstream tasks, the trade-of
of such systems is their scale. Such factor requires adoption of the specific fine-tuning techniques.</p>
      <p>BioASQ [14] represents one of the most recent competition challenges on biomedical semantic
indexing and question answering. The MultiClinSum challenge [4] dedicated to advance automated
long texts summarization systems in multilingual conditions. In this paper we propose a system that
represent a decoder-based distillation framework for multilingual clinical case report summarization [4].
Our approach exploits distillation technique for transferring clinical key information derived from
reports via large (teacher) model to its smaller scaled (student) model. The contribution of these studies
are two fold:
• We propose distillation framework with role-based dialogue modeling notation [15, 9] for enhancing
small-scaled models (student models) with clinical key information derived from reports via
large-scaled model (teacher model); the designed system exploits system, user, and assistant roles
which are commonly supported by instruction-tuned models [12, 11, 13, 16].
• We experiment with decoder-based distillation technique adaptation in clinical case report
summarization task for Qwen-2.5 models family [12]; we demonstrate that extracting clinical key
information from large-scaled (72B params) teacher model (ClinicalKeyInfosmall dataset) and using
this information in tuning of small-scaled (0.5B params) student model results in 2.4%-4% on
MultiClinSumsmall [4] while at evaluation stage.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Medical summarization research has evolved through two key approaches: addressing data scarcity
and refining distillation techniques. Recent progress in Natural Language Processing (NLP), the field of
building systems that understand and generate human language, has been driven by Large Language
Models (LLMs)—neural networks trained on large-scale text corpora. These models have recently been
applied in medical settings: for example, [17] introduced a hybrid distillation framework using LLMs to
enhance medical term extraction. This builds on earlier pointer-generator models [18] and training-free
methods like SummQA [19].</p>
      <p>Knowledge Distillation has been efective for model compression [ 20, 21], with [22] proposing a
foundational step-by-step method that uses LLMs-generated rationales to supervise small models. Liu
et al.extended this approach to medical summarization with concept-level supervision [17]. However,
both eforts focus on encoder-decoder architectures and leave the challenges of domain adaptation and
decoder-only models underexplored.</p>
      <p>Encoder-decoder models have been widely adopted for summarization tasks due to their strong
sequence-to-sequence performance [23]. T5 [6] unified various NLP tasks into a text-to-text framework,
while LongT5 [10] extended this architecture with sparse attention to better handle long sequences.
mT5 [7] further scaled the T5 framework to cover over 100 languages, enabling multilingual
summarization. These models have been applied to medical summarization benchmarks such as PubMed [2],
MultiMedQA [24], and MTS-Dialogue [3]. Despite these advances, such models often face challenges in
clinical summarization scenarios, where inputs are lengthy, domain-specific, and multilingual. Their
ifxed-length encoders and tokenizer limitations hinder generalization across diverse note types and
clinical terminologies [23]. Besides T5-series, the other existing alternatives designed for long-input
summarization, such as BigBird [25] and LED [8], require specialized pretraining and remain
computationally intensive, making them less practical for real-world multilingual clinical applications.</p>
      <p>In contrast, decoder-only models generate text sequentially, conditioning each token on the previously
generated context. Recent work has explored applying these models to summarization tasks through
prompting and fine-tuning, particularly in settings where large-scale instruction data is available [ 26, 27].
ChatGPT [9] and LLaMA-2-Chat [13] have been used as instruction-tuned models for abstractive
summarization in zero-shot or few-shot settings, where the models are prompted with summarization
tasks without further supervised training. These systems have shown competitive results on both
general-domain and biomedical summarization benchmarks, including BioASQ [14] and MEDIQA [19].
In the context of knowledge distillation, decoder-only architectures have been leveraged as both teachers
and students, allowing smaller generative models to learn the reasoning steps and instruction mapping
behaviour demonstrated by larger models [28, 29, 30]. These studies show that decoder-only student
models can benefit from rationale-augmented supervision and multi-task distillation [ 17], enabling
efective transfer of reasoning ability and task generalization. Furthermore, such models ofer practical
advantages in handling longer input sequences and multilingual instruction formats, motivating recent
extensions [16, 12], which support extended context lengths and multilingual tokenization.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        To enhance both faithfulness and medical relevance, we propose two-stage distillation-based
framework, as illustrated in Figure 1. The framework takes as input: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) training collection, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) large-scale
instruction-tuned teacher model, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) small-scaled decoder-only student model. The result of the
framework application is a fine-tuned student model capable of generating more accurate summaries.
      </p>
      <p>
        Stage 1 refers to a process of teacher model application for clinical key information extraction
from training collection. Stage 2 refers to student model fine-tuning process with dual supervision:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) supervision from the reference summary to the original clinical case report, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) supervision
from the extracted clinical key information to the one obtained from the teacher model. The design of
the proposed system relies on models that support role-based dialogue modelling notation, and ChatML
format1 in particular. Specifically, the input is structured as a sequence of , , and 
roles.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Stage 1: Knowledge Extraction by the Teacher Model</title>
        <p>Given (i) extraction prompt and (ii) Training Collection we adopt Teacher Model (Figure 1) in a zero-shot
setting to infer textual responses (treated as Clinical Key Information). The format of the input data for
Teacher Model of each clinical case report from Training Collection is as follows:</p>
        <p>( system: extraction prompt, user: clinical case report, assistant: ∅ )</p>
        <p>Where «∅» refers to the absence of the  role. The results of this stage represent a Clinical
Key Information collection (see Figure 1), which we use for the Stage 2.</p>
        <sec id="sec-3-1-1">
          <title>1https://platform.openai.com/docs/guides/text</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Stage 2: Fine-tuning Student Model</title>
        <p>Given (i) reports from Training Collection and (ii) Clinical Key Information collection (obtained from
Stage 1, Section 3.1), we adopt Student Model in the distillation tuning process. Further in this section
we declare methodology for evaluating the alignment of the student model towards the expected output,
followed by construction of the combined loss function.</p>
        <p>Our methodology of student model fine-tuning assumes a hard alignment of the output towards raw
textual content with no additional span annotations. We use strict position-wise cross-entropy (hard
alignment) in the fine-tuning process. If the two sequences difer in length, comparison continues
position-wise until either sequence ends, with any remaining positions scored against the
end-ofsequence symbol. Let the input formatted sequence as x, ground truth answer as y = (1, . . . ,  ),
inferred text from student model as y^ = (^1, . . . , ^^ ) ( and ^ denotes the total number of tokens
for y and y^ respectively). For the hidden state of the student model ( ), and step  (index of generated
token), we define strict position-wise loss calculation () as follows:</p>
        <p>(y, y^, , x) = − log  (︀ ^ =  | ^&lt;, x)︀</p>
        <p>We use the formula above in separate calculations of extraction loss and summarization loss (see
Figure 1).</p>
        <p>For the extraction supervision, the input is structured as:</p>
        <p>( system: extraction prompt, user: clinical case report, assistant: clinical key information )
Given the extraction-formatted input sequence (x), output from student model y^e, clinical key
information y, we compute extraction loss (ℒext) as sum from step start that marks the first token of
the assistant segment x to the final token position :
ℒext =

∑︁ (y, y^, , x)
=start</p>
        <sec id="sec-3-2-1">
          <title>For the summarization supervision, the input is constructed as:</title>
          <p>( system: summarization prompt, user: clinical case report, assistant: summarized case report )
Given the summarization-formatted input sequence (x), output from student model y^s, summarized
case report y, we compute summarization loss (ℒsum) as sum from step start that marks the first token
of the assistant segment x to the final token position :
ℒext =

∑︁ (y, y^, , x)
=start
ℒ =  ℒsum + (1 −  ) ℒext</p>
          <p>Finally, we calculate the combined loss (ℒ) as superposition of the losses with the decay coeficient
( ):</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>Data preparation. The complete set of the available annotated data represent texts with summaries
written in four diferent languages: English, Portuguese, Spanish, and French. For each language, the
originally provided reports with their summaries portioned into two groups [4]: small (≈ 500 texts per
language), and large (≈ 25000 texts per language). In this work we utilize only small groups in our
experiments, majorly due to both (i) limitation of computational resources and (ii) time required for
experiments organization. In further, we refer to a small part of the collection as MultiClinSumsmall.
Table 1 illustrates the statistics of clinical case reports and summaries for MultiClinSumsmall.</p>
      <p>Knowledge Extraction by the Teacher Model (Stage 1). As for the teacher model we adopt
Qwen-2.5-72B-instruct2 for extracting clinical key information from clinical case reports. Given clinical
case report () from Training Dataset (MultiClinSumsmall), we use the following extraction prompt3:
“Extract the key information from clinical text: ”. We denote the result dataset composed with the
selected teacher model as ClinicalKeyInfosmall. Table 2 illustrates the statistic of the ClinicalKeyInfosmall,
separately for reports written in each language.</p>
      <p>Data-split. All the reports were divided into three train/valid/test subsets with the following
proportion of 80%, 1%4, and 19% respectively. For the test, the 456 of original reports were chosen (19%
of MultiClinSumsmall).</p>
      <p>
        Fine-tuning. In these studies we consider fine-tuning a single model instance for all languages /
subtasks. We use GoogleColab service and publish Jupyter-Notebook at the project repository. We rent
a single instance with NVidia A100 40GB VRAM with 80GB RAM. To accomplish this goal, we consider
model instruction Qwen-2.5-0.5B which both (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) covers a set of languages utilized in MultiClinSumsmall
and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) support long context input. Towards the setup of the model parameters for the fine-tuning
process. We majorly refer to the initial list of parameters proposed for Qwen-2.5-VL5. In particular, we
use bf16 mode precision for model weight representation in memory. We set  (see Section 3.2) to 0.8
according to the earlier organized studies [17]. We limit6 the amount of input tokens to 3078, among
which 2566 were assigned altogether for system and user roles and remaining 512 for the assistant role.
For the output, we set a limit of 768 tokens in accordance to the mean-length statistic for summaries,
mentioned in Table 1. The statistics of the MultiClinSumsmall dataset train and valid subsets, cropped
by max threshold presented in Table 3. Towards the formatting of the input data, for the given clinical
report () we use the following text summarization prompt7: “Summarize clinical text: ”. We pad
input data by maxim length calculated across all the formatted texts at the dataset tokenization stage.
      </p>
      <p>Models Evaluation. We assess the performance of the fine-tuning models every 250 optimization
steps by using texts from a valid set (see Table 1). As for the metrics, we assess rouge-1, rouge-2, rouge-L,</p>
      <sec id="sec-4-1">
        <title>2https://huggingface.co/Qwen/Qwen2.5-72B-Instruct</title>
        <p>3In this work we limit our analysis on relevant types of information in depth.
4Such a small amount majorly due to computational issues caused by OOM Trainer Exception using Huggingface library
5https://github.com/QwenLM/Qwen2.5-VL
6Limitation of the utilized computational resources
7The prompt for the explanation is similar to the one utilized at the Stage 1 (see Section 3)
rouge-Lsum. We adopt the policy of keeping the best performing instance of the model throughout the
whole fine-tuning process.</p>
        <p>Inference: We use T4 GPU with 16GB VRAM hosted by GoogleColab to infer the results of the
following versions of the Qwen2.5-0.5B-Instruct8. Since the Qwen2.5 models input context window size
significantly exceed the assigned amount of tokens for input, we increase this threshold for up to 16384
tokens. We follow a similar policy of the restrictions towards the amount of output tokens. We set
the max amount of output generated tokens to 1024 which surpasses the mean amount of characters
in average per summary (see Table 1). We use localized summarization prompts to align output
with the source language utilized in original report. The following templates for summarization of
non-English clinical case report () were used: "Resumir texto clínico: " (Portuguese), "Resumir el texto
clínico: " (Spanish), "Résumer le texte clinique: " (French).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Result Analysis and Discussion</title>
      <p>Following the fine-tuning procedure organization described in Section 4, we prepare models:
• Qwen2.5-0.5B: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
• Qwen2.5-0.5Bstandard: https://huggingface.co/nicolay-r/qwen25-05b-multiclinsum-standard
• Qwen2.5-0.5Bdistil: https://huggingface.co/nicolay-r/qwen25-05b-multiclinsum-distil</p>
      <p>To fit in the 40GB VRAM limitation, we set BatchSize=2 for the Qwen2.5-0.5Bstandard version9 and
BatchSize=1 for the Qwen2.5-0.5Bdistil. Figure 2 illustrates the analysis of variation of the results
obtained on MultiClinSumsmall (valid) for 3 individual fine-tuning runs and 10 evaluation steps. According</p>
      <sec id="sec-5-1">
        <title>8https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct</title>
        <p>9We noticed that attempting to fine-tune Qwen2.5-0.5B standard with smaller BatchSize of 1 results in worse performing model
among all the rouge metrics (except Rouge-1).
to the related analysis, using Qwen2.5-0.5Bdistil results in 2.4%-4% of the improved performance in
comparison with the conventional fine-tuning (Qwen2.5-0.5B standard).</p>
        <p>To obtain the results we follow the inference setup mentioned in Section 4. We infer our models on
non-oficial test subset (testnon-oficial ) and on oficially provided test sets ( testoficial ). Due to limited
amount of time, for testoficial we reduced the total number of generated token by half (up to
512) from the initially defined limit (see inference details, Section 4) to gain inference performance.
We use bulk-chain10 library to perform inference with the custom implementation of the Qwen2.5
provider based on the HuggingFace pipelines API11. We manually implement evaluation for testnon-oficial
which difers from one utilized by competition organizers for testoficial . In particular, to evaluate
results on testnon-oficial we adopt Rouge metrics (see Evaluation, Section 4) and BertScore based on
DistilBERTbase-uncased [31].</p>
        <p>Non-oficial Evaluation Results. Table 4 provides the results on testnon-oficial for baseline,
Qwen2.5-0.5Bstandard, Qwen2.5-0.5Bdistil models. According to the obtained results, we first noticed
the gap in the results on English subtask in comparison with all the other languages. Towards the
results of the particular models, using fine-tuning techniques outperform the Qwen2.5-0.5B approach
by ≈ 1% (BertScoreF1) and ≈ 8%-20% in average for Rouge results. Towards the individual subtasks,
using Qwen2.5-0.5Bdistil for summarization clinical reports in English results in ≈ 1-2% (Rouge) over
Qwen2.5-0.5Bstandard. In the case of all the other non-English subtask, both fine-tuned version of the
student model (Qwen2.5-0.5Bstandard and Qwen2.5-0.5Bdistil) illustrate relatively similar performance. As
for the content comparison, we noticed that Qwen2.5-0.5Bdistil model tend contains a high proportion of
words that are semantically similar to words in the reference sentence (high BertScore precision) unlike
other models and in exchange of the lowered recall. We believe that such an efect is due to enhanced
alignment of the composed report summarization to the ground truth texts in MultiClinSumsmall while
at during Stage 2 (Section 3.2).</p>
        <p>Oficial Submission Results. In the case of oficial evaluations, organisers utilise BertScore and
rouge-score that involve evaluation of precision, recall, and F1-measure. Due to limited amount of time
during the test stage, it was decided to submit the results for Qwen2.5-0.5Bdistil model. The results
evaluation on testoficial for the summaries obtained by Qwen2.5-0.5Bdistil model illustrated in Table 5.
Similar to the results on testnon-oficial , we noticed significant gap in the results performance for the
reports written in English comparing with the results obtained for reports written in the other languages.</p>
        <p>
          Limitations Discussion and Further Works According to the obtained results, we believe that
application of the proposed methodology could be enhanced in following directions: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) enlarging of
10https://github.com/nicolay-r/bulk-chain
11https://huggingface.co/docs/transformers/main_classes/pipelines
the training data, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) highlighting relevant features for clinical key information, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) overcoming hard
alignment in student model fine-tuning (Section 3.2). In particular, we see no technical limitations in
adaptation of larger dataset. According to the MultiClinSum statistic mentioned in data preparation
of Section 4, we believe that switching from MultiClinSumsmall to MultiClinSumlarge result in 5-times
longer process (excluding the time required for evaluation steps). However, we believe that addressing
limitations for other directions (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) is crucial for employing MultiClinSumlarge. With the existing
extraction prompt, observations regarding the most relevant features mention in extracted clinical key
information are considered out of scope. The extraction of such relevant features from outputs of student
model could also address on hard-alignment limitation in model fine-tuning process (Section 3.2).
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper we propose a system for automated clinical case report summarization within the scope of
the MultiClinSum challenge. Our approach exploits distillation framework for fine-tuning small-scaled
(student) decoder-based models by relying on clinical key information derived reports via large-scaled
model (teacher model). Unlike previously existing work on distillation technique adaptation for Seq2seq
architectures, our system is dedicated for decoder-based models that support role-based dialogue
modelling notation. We assess our approach on MultiClinSum reports written in English, Portuguese,
French and Spanish. According to the related analysis, the use of the proposed distillation framework for
Qwen-2.5 model series results in a 2.4%-4% better performing model (Qwen2.5-0.5Bdistil) on validation
data compared to the one fine-tuned with the conventional approach (Qwen2.5-0.5B standard). From
our final evaluation on test data, we conclude that the Qwen2.5-0.5B distil model surpasses
Qwen2.50.5Bstandard by ≈ 1-2% in the summarization clinical reports written in English.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>AI tools were used for: rephrasing sentence and paragraphs to enhance reading quality (abstract and
introduction), grammar correction and spell check (all sections).
[4] M. Rodríguez-Ortega, E. Rodríguez-Lopez, S. Lima-López, C. Escolano, M. Melero, L. Pratesi,
L. Vigil-Gimenez, L. Fernandez, E. Farré-Maduell, M. Krallinger, Overview of multiclinsum task
at bioasq 2025: evaluation of clinical case summarization strategies for multiple languages: data,
evaluation, resources and results., in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025
Working Notes, 2025.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[6] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[7] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Rafel, mT5: A
massively multilingual pre-trained text-to-text transformer, in: K. Toutanova, A. Rumshisky,
L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou
(Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Association for Computational
Linguistics, Online, 2021, pp. 483–498. URL: https://aclanthology.org/2021.naacl-main.41/. doi:10.
18653/v1/2021.naacl-main.41.
[8] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer, arXiv preprint
arXiv:2004.05150 (2020).
[9] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
learners, in: Proceedings of the 34th International Conference on Neural Information Processing
Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
[10] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, Y. Yang, LongT5: Eficient
textto-text transformer for long sequences, in: M. Carpuat, M.-C. de Marnefe, I. V. Meza Ruiz
(Eds.), Findings of the Association for Computational Linguistics: NAACL 2022, Association for
Computational Linguistics, Seattle, United States, 2022, pp. 724–736. URL: https://aclanthology.
org/2022.findings-naacl.55/. doi: 10.18653/v1/2022.findings-naacl.55.
[11] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,</p>
      <p>S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[12] A. Yang, B. Yang, B. Zhang, B. H. et. al., Qwen2.5 technical report, 2025. URL: https://arxiv.org/
abs/2412.15115. arXiv:2412.15115.
[13] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
arXiv:2307.09288 (2023).
[14] A. Nentidis, G. Katsimpras, A. Krithara, M. Krallinger, M. Rodríguez-Ortega, E. Rodriguez-López,
N. Loukachevitch, A. Sakhovskiy, E. Tutubalina, D. Dimitriadis, G. Tsoumakas, G. Giannakoulas,
A. Bekiaridou, A. Samaras, G. Maria Di Nunzio, N. Ferro, S. Marchesin, M. Martinelli, G. Silvello,
G. Paliouras, Overview of bioasq 2025: The thirteenth bioasq challenge on large-scale biomedical
semantic indexing and question answering, in: C.-d.-A. Jorge, G. Julio, P. Laura, G. S. d. H.
Alba, M. Josiane, P. Florina, R. Paolo, S. Damiano, F. Guglielmo, F. Nicola (Eds.), Experimental IR
Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International
Conference of the CLEF Association (CLEF 2025), 2025.
[15] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, B. Dolan, Dialogpt:
Large-scale generative pre-training for conversational response generation, in: ACL, system
demonstration, 2020.
[16] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur,</p>
      <p>A. Schelten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024).
[17] X. Liu, M. Huang, N. Rusnachenko, J. Ive, J. Chang, J. J. Zhang, Enhancing medical dialogue
summarization: A mediextract distillation framework, in: 2024 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), IEEE, 2024, pp. 6466–6473.
[18] A. Joshi, N. Katariya, X. Amatriain, A. Kannan, Dr. summarize: Global summarization of medical
dialogue by exploiting local structures., in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association
for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online,
2020, pp. 3755–3763. URL: https://aclanthology.org/2020.findings-emnlp.335/. doi: 10.18653/v1/
2020.findings-emnlp.335.
[19] Y. Mathur, S. Rangreji, R. Kapoor, M. Palavalli, A. Bertsch, M. R. Gormley, Summqa at mediqa-chat
2023: In-context learning with gpt-4 for medical summarization, in: Clinical Natural Language
Processing Workshop, 2023. URL: https://api.semanticscholar.org/CorpusID:259309155.
[20] A. Alkhulaifi, F. Alsahli, I. Ahmad, Knowledge distillation in deep learning and its applications,</p>
      <p>PeerJ Computer Science 7 (2020). URL: https://api.semanticscholar.org/CorpusID:220632998.
[21] L. Wang, K.-J. Yoon, Knowledge distillation and student-teacher learning for visual intelligence: A
review and new outlooks, IEEE Transactions on Pattern Analysis and Machine Intelligence 44
(2020) 3048–3068. URL: https://api.semanticscholar.org/CorpusID:215745611.
[22] C.-Y. Hsieh, C.-L. Li, C.-k. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, T. Pfister,
Distilling step-by-step! outperforming larger language models with less training data and smaller
model sizes, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for
Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada,
2023, pp. 8003–8017. URL: https://aclanthology.org/2023.findings-acl.507/. doi: 10.18653/v1/
2023.findings-acl.507.
[23] N. Rusnachenko, N. D. Nguyen, et al., Pre-training longt5 for vietnamese mass-media
multidocument summarization, Journal of Mathematical Sciences 285 (2024) 88–99.
[24] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl,
H. Cole-Lewis, et al., Toward expert-level medical question answering with large language models,
Nature Medicine (2025) 1–8.
[25] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula,
Q. Wang, L. Yang, et al., Big bird: Transformers for longer sequences, Advances in neural
information processing systems 33 (2020) 17283–17297.
[26] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Ray, et al., Training language models to follow instructions with human feedback, Advances in
neural information processing systems 35 (2022) 27730–27744.
[27] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Hajishirzi, Self-instruct: Aligning
language models with self-generated instructions, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.),
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 13484–13508.</p>
      <p>URL: https://aclanthology.org/2023.acl-long.754/. doi:10.18653/v1/2023.acl-long.754.
[28] N. Calderon, S. Mukherjee, R. Reichart, A. Kantor, A systematic study of knowledge distillation
for natural language generation with pseudo-target training, in: A. Rogers, J. Boyd-Graber,
N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada,
2023, pp. 14632–14659. URL: https://aclanthology.org/2023.acl-long.818/. doi:10.18653/v1/2023.
acl-long.818.
[29] K. Shridhar, A. Stolfo, M. Sachan, Distilling reasoning capabilities into smaller language models,</p>
      <p>Findings of the Association for Computational Linguistics: ACL 2023 (2023) 7059–7073.
[30] J. Ko, T. Chen, S. Kim, T. Ding, L. Liang, I. Zharkov, S.-Y. Yun, Distillm-2: A contrastive approach
boosts the distillation of llms, arXiv preprint arXiv:2503.07067 (2025).
[31] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, ArXiv abs/1910.01108 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Snider</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Yetisgen, Overview of the mediqa-sum task at imageclef 2023: Summarization and classification of doctor-patient conversations</article-title>
          ,
          <source>in: CLEF 2023 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domainspecific language model pretraining for biomedical natural language processing</article-title>
          ,
          <source>ACM Transactions on Computing for Healthcare (HEALTH) 3</source>
          (
          <issue>2021</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          , W.-w. Yim,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>An empirical study of clinical note generation from doctor-patient encounters, in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>2291</fpage>
          -
          <lpage>2302</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .eacl-main.
          <volume>168</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>