<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Medical Internet Research 27 (2025) e68998.
[8] J. Liang</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3390/app14135809</article-id>
      <title-group>
        <article-title>ExtraSum @ MultiClinSum: Extractive Summarization of English, Spanish, French and Portuguese Clinical Case Reports</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soukaina Rhazzafe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Colreavy-Donnelly</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikola S. Nikolov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Systems, University of Limerick</institution>
          ,
          <addr-line>V94 T9PX Limerick</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>2014</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents an extractive summarization approach for multilingual clinical case reports submitted to the MultiClinSUM 2025 shared task. We focused on selecting the ten most important sentences from each report while preserving the original text to ensure factual consistency. Our method compares four extractive techniques: graph based, concept based, topic based and clustering based summarization, tested on English, Spanish, French and Portuguese. Our experiments show that the clustering based summarization using multilingual BERT consistently outperforms the other methods in all languages, with the strongest semantic similarity seen in English. This suggests that multilingual BERT embeddings are efective at capturing the central meaning of clinical texts across diferent languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Extractive summarization</kwd>
        <kwd>Clinical case reports</kwd>
        <kwd>Clinical text summarization</kwd>
        <kwd>Multilingual text</kwd>
        <kwd>Sentence selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Clinical case reports are a valuable source of detailed medical knowledge, as they provide insights into
the diagnosis, treatment and outcomes of individual patients. They typically include critical information
such as patient demographics, clinical presentation, diagnostic process, treatment and follow-ups
[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. These reports not only support clinical education and research but also form a foundation
for developing automated tools for clinical language processing. However, given their length and
complexity, summarizing these documents efectively and accurately remains a significant challenge,
especially when semantic plausibility, relevance and factual accuracy are required for clinical utility
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        The MultiClinSUM 2025 shared task [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ] addresses this issue by providing resources and benchmarks
for multilingual summarization of clinical documents, focusing on four languages: English, Spanish,
French and Portuguese. As part of the BioASQ BioASQ Workshop [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], participants are invited to generate
summaries of clinical case reports using any approach, with evaluations conducted independently per
language. The task highlights the complexity of generating high quality summaries in this domain,
where even human-generated summaries can vary significantly in content and clarity. This complexity
is further compounded by the lack of clear summarization objectives in the literature [7], making the
design and evaluation of summarization systems especially dificult.
      </p>
      <p>
        Clinical case reports resemble discharge summaries in structure and content, making them particularly
suitable for developing summarization models using publicly available data [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ]. However, many prior
works seem to fail to clearly define their summarization objectives, leading to dificulties in evaluating
the clinical relevance of the outputs [7]. Some studies describe their goals in vague terms such as
"significant impressions" or "critical diagnoses," with mismatches between intended audiences and
generated outputs. To address these gaps, there is a growing need for models that produce factually
consistent, targeted and clinically relevant summaries [7].
      </p>
      <p>Extractive summarization has traditionally been favored over abstractive methods in the clinical
domain because it preserves the original phrasing and reduces the risk of hallucination or factual errors
[8]. In our previous work [9], we proposed a hybrid summarization approach for Electronic Health
Records (EHRs) that combined extractive and abstractive techniques to summarize ICU progress notes
and predict patients’ length of stay. Our concept based extractive strategy, in combination with a
T5 model, showed promising results in capturing clinically relevant information while maintaining
coherence.</p>
      <p>Building on this foundation, in the current study we focus exclusively on extractive summarization
techniques for the MultiClinSUM shared task. Our approach aims to identify the 10 most important
sentences from each clinical case report in a way that retains factual integrity and relevance. We define
importance based on how central a sentence is in the overall text and whether it covers key clinical
concepts. To achieve this, we implement and compare four extractive methods: a graph based method
based on semantic similarity and sentences raking using PageRank, a concept based method using
QuickUMLS for medical term extraction and ranking, a topic based method using TF-IDF to identify
topic-salient sentences and, finally, a clustering based method using multilingual BERT embeddings to
extract centroid representative sentences.</p>
      <p>By choosing purely extractive methods, we mitigate the risks of generating factually incorrect or
misleading summaries, an especially critical concern in the clinical domain. Furthermore, our decision
to retain sentences without altering their original text aligns with the need for high transparency,
interpretability and clinical fidelity. This approach also bypasses the common limitations in previous
work, such as unclear objectives and audience misalignment [7].</p>
      <p>The rest of the paper is organized as follows: Section 2 describes our summarization pipeline and
techniques, Section 3 presents evaluation results, Section 4 discusses the results further and finally
Section 5 concludes with key findings and directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Extractive Summarization Techniques</title>
        <p>Extractive summarization is a technique that generates summaries by selecting and concatenating the
most important sentences from the input document. The number of selected sentences is typically
limited by a compression rate, a length cutof or a predefined threshold [10].</p>
        <p>As shown in Figure 1, the extractive summarization pipeline begins with pre-processing, which
depends on the task and may involve several text cleaning techniques [10, 11]. Common pre-processing
steps include sentence boundary detection, typically based on punctuation like periods, stop word
removal, eliminating frequent but non-informative words and stemming reducing words to their root
form to emphasize meaning [12]. Another method includes replacing specific values or terms with
placeholders to improve sentence comparability.</p>
        <p>The processing step involves selecting key sentences from the input text. First, a representation of
the text is created. Then, scores are assigned to each sentence based on features derived from that
representation. Finally, the top ranked sentences are selected according to the summary size constraint
and concatenated to form the summary [10]. This step, including the text representation models, is
introduced in detail in the remainder of this section.</p>
        <p>In the post-processing step, the selected sentences may be reordered to match their original sequence
in the input text. Placeholders introduced during pre-processing are also replaced with the original
content [10, 12].</p>
        <p>There are various extractive summarization techniques available [10]. These methods difer in how
they represent the input text and in how they score and rank sentences for selection. To address the
MultiClinSum challenge, we implemented four such techniques. Some of these methods were adapted
from our previous work [9], where they were applied in a diferent clinical summarization context.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Graph Based</title>
          <p>Graph based extractive summarization techniques represent sentences as nodes in a graph, where
edges between nodes capture the similarity between pairs of sentences [10]. This method measures
sentence-to-sentence similarity to identify central content. The input text is first converted into a
numerical representation, which is used to calculate sentence similarities and build the weighted graph.
A ranking algorithm is then applied to the graph to identify the most important sentences for the
summary.</p>
          <p>To implement this technique, we represented sentences using Term Frequency-Inverse Document
Frequency (TF-IDF) vectors [13], which assign weights to words based on their frequency and rarity in
the text. The TF-IDF formula is shown in Equation (1), where   is the Term Frequency of the -th
word in the -th document and  is the Inverse Document Frequency of the -th word.   -
is the TF-IDF value of the -th word in the -th document;  is the number of occurrences of the -th
word in the -th document, while  is the number of documents containing the -th word, and  is
the total number of documents.</p>
          <p>- =   ·  = ∑︀ log
︂(  )︂

(1)</p>
          <p>Pairwise cosine similarity between the TF-IDF vectors was then calculated to create a similarity
matrix, forming the weighted edges of the graph. Cosine similarity was used because it measures
similarity between two vectors and is well suited for comparing TF-IDF representations, which are
continuous and sparse [14]. The PageRank algorithm [15] was then applied to rank sentences by their
centrality within this graph. Finally, the top-ranked sentences were selected and ordered according to
their original sequence in the document to create the summary.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Concept Based</title>
          <p>Concept based summarization techniques focus on representing sentences using clinical concepts
extracted from an external knowledge base, rather than relying solely on the words themselves. Sentences
are represented by the set of clinical concepts they contain and their importance is determined by
measuring the similarity between these concept sets [10, 16].</p>
          <p>To implement this technique, we used QuickUMLS [17], an open-source tool that extracts clinical
concepts from text by mapping them to the Unified Medical Language System (UMLS) metathesaurus
[18]. For each sentence, clinical concepts were identified and represented as sets. We then computed
pairwise sentence similarities using the Jaccard similarity coeficient [ 19] over these concept sets,
creating a similarity matrix. Jaccard similarity was selected because it measures the similarity between
two sets and is appropriate for comparing sentences represented as sets of extracted clinical concepts
[14]. This matrix was used to build a weighted graph, where sentences are nodes connected by edges
weighted by their concept similarity. Finally, the PageRank algorithm [15] was applied to rank sentences
based on their importance in the graph structure, with the top-ranked sentences selected to form the
summary.</p>
          <p>However, the coverage of clinical terminologies across languages is uneven. While QuickUMLS
provides support for several languages, its resources are significantly skewed toward English. Previous
studies have reported that non-English counterparts of the UMLS lack between 65% and 94% of the term
coverage available in English [20]. This limitation can reduce the number and consistency of extracted
concepts in non-English texts, which may impact the efectiveness of this approach in multilingual
settings.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.1.3. Topic Based</title>
          <p>Topic based summarization techniques focus on identifying the main themes or topics of the document
by measuring how relevant each sentence is to the overall content [10]. Unlike the graph based method,
which relies on sentence-to-sentence similarity, this approach measures sentence-to-document topical
salience.</p>
          <p>To implement this technique, we used TF-IDF vectors [13], as calculated by the formula is Equation (1),
to represent sentences. Sentences with higher overall TF-IDF weights were considered more important
for capturing the main topics. The top-scoring sentences were selected to form the summary.</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>2.1.4. Clustering Based</title>
          <p>This technique identifies the most central and relevant sentences in a cluster [ 10]. Sentence centrality
is based on the distance between a sentence and the centroid of the document cluster in vector space,
with sentences closest to the centroid considered the most important.</p>
          <p>To implement this technique, we used the BERT Summarizer [21] from the Python module
"bert-extractive-summarizer". Particularly, we used the "bert-base-multilingual-cased"
variant that has been trained on a corpus of raw Wikipedia texts in 104 languages [22]. This method
applies BERT (Bidirectional Encoder Representations from Transformers) [23] to encode sentences
into contextual embeddings that capture semantic meaning. Then, the embeddings are clustered using
K-means and for each cluster the sentence nearest to the centroid is selected as a summary candidate.
This process enables the extraction of sentences that collectively represent the document’s main content
while preserving the original wording. The number of sentences included in the final summary can be
specified as a parameter, allowing flexible control over the summary length while requiring minimal
manual tuning [24].</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluation Methods</title>
        <p>To evaluate the text summarization systems, both ROUGE and BERTScore metrics were employed.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. ROUGE Scores</title>
          <p>ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation) [25] are a set of metrics used to
evaluate automatic text summarization systems. They compare the generated summaries to
humanwritten reference summaries by counting overlapping elements, such as n-grams, word sequences or
word pairs, between the system-generated summary and the reference texts. The more overlap, the
higher the score, indicating better alignment with the human-written reference.</p>
          <p>Specifically, the ROUGE-Lsum variant [ 26] was used, which interprets newline characters as sentence
boundaries and computes the union of the Longest Common Subsequences (LCS) across sentence pairs.
This variant, commonly used in neural summarization research, is well suited for evaluating the overall
structure and content fidelity of full summaries rather than individual sentence matches.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. BERTscores</title>
          <p>BERTScore [27] is an evaluation metric that measures how similar a generated text is to a reference
text based on meaning, not just word overlap [28]. Unlike ROUGE, which relies on exact word matches,
BERTScore uses contextual word embeddings from a pre-trained BERT model [27].</p>
          <p>To calculate it, BERTScore first encodes both the candidate summary and the reference summary
using BERT. Then, it compares the words in both texts by measuring how similar their embeddings
are using cosine similarity. These alignments are used to calculate precision, recall and a modified F1
score, which is weighted using inverse document frequency (IDF) to reduce the impact of very common
words [27].</p>
          <p>BERTScore has been shown to correlate better with human judgments, especially in tasks like
translation and paraphrasing, because it can recognize similar meanings even when diferent words are
used [29].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <p>Before applying the summarization methods, the clinical text provided in the challenge was
preprocessed. This included converting all characters to lowercase, removing unnecessary spaces and
replacing decimal points with temporary placeholders to avoid confusion with sentence boundaries.
Specifically, to prevent splitting sentences at decimal points in numbers (e.g., "5.3"), all floats, detected
with a regular expression, were temporarily replaced by the same number with "DOT" instead of
the period (for example, "5DOT3"). Additional cleaning steps included normalizing whitespace and
standardizing certain phrases, such as replacing "and/or" with "and or".</p>
      <p>The experiments were conducted on the gold-standard training and test datasets provided in the
challenge in four languages: English, Spanish, French and Portuguese [30]. For each language, the
training set consisted of 592 full-text clinical case reports paired with human-written summaries. The
test set contained between 3,396 and 3,469 full-text reports per language (English: 3,396, Spanish: 3,406,
French: 3,469, Portuguese: 3,442).</p>
      <p>
        For each language, the previously described extractive summarization methods were applied to
extract the 10 most important sentences in the text to form a summary. The resulting summaries were
then submitted to the challenge platform [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where they were evaluated using BERTScore and ROUGE
metrics.
      </p>
      <p>We chose to select exactly 10 sentences for each summary to balance coverage and conciseness. The
training texts varied widely in length, ranging from a minimum of 4 sentences to a maximum of 197,
with an average of approximately 27 sentences. Choosing 10 sentences provides a consistent summary
length that is suficient to capture key information while maintaining brevity.</p>
      <sec id="sec-3-1">
        <title>3.1. English</title>
        <p>The results for the English clinical text are summarized in Table 1. The clustering based approach
performed best overall. It achieved the highest BERTScore F1 (84.91%), indicating that the selected
sentences were semantically close to the human written summaries. It also had the highest ROUGE F1
(24.67%), suggesting better surface level overlap.</p>
        <p>Interestingly, concept based summarization tied with graph based in BERTScore (84.46%), but
outperformed it in ROUGE. This likely reflects the advantage of using clinical concepts, that were extracted
from QuickUMLS, which helped better match medical terms in the reference summaries. Topic based
summarization was also competitive, with a slightly higher BERTScore than the graph and concept
based, likely because TF-IDF helped surface more globally relevant sentences.</p>
        <p>The graph based method had the lowest ROUGE score, which could be because it relies heavily on
sentence-to-sentence similarity. If key clinical terms are not repeated or well connected, this method
might miss important isolated information.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Spanish</title>
        <p>Again, in the case of Spanish, clustering based summarization had the strongest performance in both
metrics. This suggests that sentence embeddings generated from multilingual BERT were efective in
identifying central sentences in Spanish clinical texts. Results are shown in Table 2.</p>
        <p>While concept based summarization performed well in ROUGE (23.98%), its slightly lower BERTScore
indicates that although it picked sentences with overlapping terms, the semantic meaning might have
been less aligned. This might be due to QuickUMLS’s Spanish support being more limited compared to
English, where only 189,563 medical concepts in Spanish were in the database as opposed to 585,453 in
English, afecting concept extraction quality.</p>
        <p>The topic based method did slightly better than the concept based in BERTScore, possibly because
TF-IDF can still capture general topical content even when concept extraction is weaker.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. French</title>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Portuguese</title>
        <p>For French, only the topic based and clustering based methods were applied, results are shown in Table 3.
Similar to the previous languages, clustering based had the strongest performance. This confirms that
encoding-based representations, such as those from BERT, work well across diferent languages and
domains, including French clinical text. 1
Finally, in Portuguese, the clustering based summarizer outperformed the topic based method. The
margin here was smaller than in other languages, but still consistent with the overall trend, as the
clustering based summarization gave the most balanced and accurate summaries across languages.
Results for the Portuguese extractive summarization are shown in Table 4.
1Due to time constraints and focus on benchmarking strong baseline methods, we limited our submission for French and
Portuguese to the two most promising techniques based on early development: clustering and topic-based summarization.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Across all four languages, the clustering based extractive summarization method achieved the best
performance in both semantic similarity (BERTScore) and lexical overlap (ROUGE). This suggests
that sentence embeddings from multilingual BERT can efectively capture important content across
diferent languages. Interestingly, even though the Multilingual BERT was not trained with an explicit
cross-lingual objective, it still provides strong multilingual representations. This supports previous
ifndings [ 22], that Multilingual BERT generalizes well across languages for various downstream tasks,
despite being trained only on monolingual Wikipedia data without alignment between languages.</p>
      <p>However, while Multilingual BERT showed consistent performance across languages, English
summaries scored noticeably higher in BERTScore, with an average F1 of 84.60%, compared to 71.50% in
Spanish, 71.90% in French and 71.31% in Portuguese. This indicates stronger semantic similarity between
the generated and reference summaries in English, likely due to better sentence representations.</p>
      <p>Interestingly, Spanish achieved the highest average ROUGE F1 score (24.13%), even surpassing English
(23.09%). This suggests that extractive methods in Spanish may be better at capturing lexical overlap
with human-written summaries.</p>
      <p>Another interesting observation is that the topic based method performed the best in Spanish than
the other languages. This suggests that TF-IDF tokenization may be more efective in Spanish for
this task. One reason for this could be the diference in subword tokenization coverage. Generally,
the tokenization performance of each language is correlated with the language’s subword dictionary
coverage. Spanish has broader coverage in this regard compared to English, which may have contributed
to the stronger performance of TF-IDF in Spanish texts [31].</p>
      <p>
        Despite these insights, the general BERTScore and ROUGE values across all languages remain
relatively low. This highlights the inherent dificulty of extractive summarization in the clinical domain,
where preserving factual accuracy, handling domain specific language and ensuring multilingual
consistency are all critical and challenging tasks [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we compared four extractive summarization methods on clinical case reports in English,
Spanish, French, and Portuguese. Our results showed that the clustering based summarization, using
multilingual BERT embeddings, consistently achieved the best performance in both BERTScore and
ROUGE metrics across all languages. English summaries had the highest BERTS scores, probably due
to stronger sentence representations, while Spanish had the highest ROUGE, which may reflect better
lexical overlap in Spanish, possibly helped by more efective tokenization.</p>
      <p>However, the overall scores were relatively low, especially in ROUGE. Moving forward, we could
explore combining extractive and abstractive summarization. Abstractive methods may ofer more fluent
and shorter summaries but should be carefully designed to avoid clinical inaccuracies and hallucinations.
A hybrid approach might give the best of both worlds, accuracy from the extractive summarization and
readability from the abstractive summarization.</p>
      <p>In addition, the decision to retain exactly ten sentences per summary was a heuristic choice aimed at
providing consistent coverage across documents. This parameter was not optimized, and future work
will include ablation studies and more systematic analysis of length sensitivity. In particular, we intend
to investigate how varying summary length afects evaluation metrics in clinical contexts, to better
understand the trade-ofs between brevity and content coverage.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This publication has emanated from research conducted with the financial support of Taighde Éireann
– Research Ireland under Grant No. 18/CRT/6223.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>In preparing this manuscript, the authors utilized OpenAI’s ChatGPT to assist with grammar and
spelling corrections, enhance writing clarity and rephrase text for readability. All AI-assisted content
was carefully reviewed and revised by the authors, who assume full responsibility for the final content.
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.
[26] Google LLC, rouge-score: A native python implementation of rouge, https://pypi.org/project/
rouge-score/, 2022. Version 0.1.2, licensed under Apache-2.0. Implements ROUGE-N, ROUGE-L,
‘rougeLsum‘, bootstrap CI, Porter stemmer.
[27] A. Chen, G. Stanovsky, S. Singh, M. Gardner, Evaluating question answering evaluation, in: A. Fisch,
A. Talmor, R. Jia, M. Seo, E. Choi, D. Chen (Eds.), Proceedings of the 2nd Workshop on Machine
Reading for Question Answering, Association for Computational Linguistics, Hong Kong, China,
2019, pp. 119–124. URL: https://aclanthology.org/D19-5817/. doi:10.18653/v1/D19-5817.
[28] M. Hanna, O. Bojar, A fine-grained analysis of BERTScore, in: L. Barrault, O. Bojar, F. Bougares,
R. Chatterjee, M. R. Costa-jussa, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y. Graham,
R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, T. Kocmi, A. Martins,
M. Morishita, C. Monz (Eds.), Proceedings of the Sixth Conference on Machine Translation,
Association for Computational Linguistics, Online, 2021, pp. 507–517. URL: https://aclanthology.
org/2021.wmt-1.59/.
[29] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation
with bert, in: International Conference on Learning Representations, OpenReview.net, 2020. URL:
https://openreview.net/forum?id=SkeHuCVFDr.
[30] M. Rodríguez-Ortega, E. Rodríguez-López, S. Lima-López, C. Escolano, M. Melero, L. Pratesi, L.
VigilGiménez, L. Fernández, E. Farré-Maduell, M. Krallinger, Multilingual clinical summarization
(multiclinsum) challenge datasets, 2025. URL: https://zenodo.org/records/15546018. doi:10.5281/
zenodo.15546018.
[31] A. Wangperawong, Multilingual search with subword tf-idf, arXiv preprint arXiv:2209.14281
(2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>How to write a patient case report</article-title>
          ,
          <source>American Journal of Health-System Pharmacy</source>
          <volume>63</volume>
          (
          <year>2006</year>
          )
          <fpage>1888</fpage>
          -
          <lpage>1892</lpage>
          . URL: https://doi.org/10.2146/ajhp060182. doi:
          <volume>10</volume>
          .2146/ajhp060182.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Barcelona</given-names>
            <surname>Supercomputing</surname>
          </string-name>
          <article-title>Center (BSC), Multiclinsum: Multilingual clinical summarization challenge - task information</article-title>
          , https://temu.bsc.es/multiclinsum/task-info/,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          - 06-13.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Soboczenski</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Marshall</surname>
          </string-name>
          ,
          <article-title>Generating (factual?) narrative summaries of rcts: Experiments with neural multi-document summarization</article-title>
          ,
          <source>AMIA ... Annual Symposium proceedings. AMIA Symposium</source>
          <year>2021</year>
          (
          <year>2021</year>
          )
          <fpage>605</fpage>
          -
          <lpage>614</lpage>
          . Publisher Copyright:
          <article-title>©2021 AMIA - All rights reserved</article-title>
          . Copyright:
          <article-title>This record is sourced from MEDLINE/PubMed, a database of the U.S. National Library of Medicine</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Elhady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          , E. Agirre,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <article-title>Improving factuality in clinical abstractive multidocument summarization by guided continued pre-training, in:</article-title>
          K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>755</fpage>
          -
          <lpage>761</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .naacl-short.
          <volume>66</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .naacl-short.
          <volume>66</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodríguez-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Escolano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pratesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vigil-Gimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krallinger, Overview of MultiClinSum task at BioASQ 2025: evaluation of clinical case summarization strategies for multiple languages: data, evaluation, resources and results</article-title>
          ., in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>F. P.</given-names>
          </string-name>
          <string-name>
            <surname>Josiane Mothe</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <string-name>
            <surname>Paolo Rosso</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>