<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of ElCardioCC Task on Clinical Coding in Cardiology at BioASQ 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dimitris Dimitriadis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasiliki Patsiou</string-name>
          <email>spatsiou19@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleonora Stoikopoulou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achilleas Toumpas</string-name>
          <email>toumpasaxilleas@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alkis Kipouros</string-name>
          <email>kipourosalkis@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandra Bekiaridou</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos Barmpagiannos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anthi Vasilopoulou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonios Barmpagiannos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Athanasios Samaras</string-name>
          <email>ath.samaras.as@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitrios Papadopoulos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Giannakoulas</string-name>
          <email>g.giannakoulas@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigorios Tsoumakas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Archimedes, Athena Research Center</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Aristotle University of Thessaloniki</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Elmezzi Graduate School of Molecular Medicine, Northwell Health</institution>
          ,
          <addr-line>Manhasset, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automated clinical coding converts unstructured medical narratives into standardized formats like ICD-10, supporting research, data analysis, and healthcare management. While much progress has been made for English texts, languages like Greek remain underexplored, limiting the applicability of such tools in non-English clinical settings. To address this gap, we introduce the ELCardioCC task, part of the BioASQ 2025 challenge. ELCardioCC is a shared task focusing on automated ICD-10 coding of Greek cardiology discharge letters and extraction of supporting text spans. It includes three sub-tasks- named entity recongition, entity linking, and multi-label learning with explainable AI-to promote research in underrepresented languages and support the development of medical natural language processing tools beyond English. The task focuses on (i) assigning ICD-10 codes related to cardiology to Greek hospital discharge letters, and (ii) extracting the exact text spans corresponding to each code. ELCardioCC attracted five participating teams with multiple system submissions. A Greek-language clinical dataset of 1,500 de-identified cardiology discharge letters was created and annotated with ICD-10 codes to support these sub-tasks. The dataset includes both structured metadata and unstructured clinical narratives, with annotations performed by medical professionals using a standardized process. Results indicate that state-of-the-art models can be efectively adapted for Greek clinical texts, demonstrating their practical potential for multilingual medical coding; however, further improvements are necessary to achieve optimal performance and fully address the task's challenges.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clinical coding</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>entity linking</kwd>
        <kwd>multi-label learning</kwd>
        <kwd>explainable ai</kwd>
        <kwd>BioASQ shared task</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, accounting for
approximately 32% of all global deaths, according to the World Health Organization (WHO)1. These
conditions comprise a heterogeneous group of pathologies—including ischemic heart disease, heart
failure, arrhythmias, and cerebrovascular events—that require long-term management and continuous
clinical monitoring. As a result, they generate substantial volumes of medical documentation, often
stored as unstructured free-text in electronic health records (EHRs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The ability to systematically
extract and organize clinically relevant information from these records is increasingly critical for
real-time surveillance, quality assurance, and translational research.
      </p>
      <p>
        A central approach to structuring unstructured clinical data is the assignment of standardized
diagnostic codes, such as those defined by the International Classification of Diseases, 10th Revision
(ICD-10). These codes support data interoperability, facilitate clinical audits and epidemiological
studies, and enable downstream applications including health resource allocation, risk stratification,
and outcome prediction. However, the manual coding process is labor-intensive, expensive, and
susceptible to significant intra- and inter-annotator variability. Prior studies have demonstrated that
even experienced coders frequently diverge in how they map free-text narratives to granular ICD-10
categories [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        To overcome these limitations, automated clinical coding has emerged as a promising solution. By
applying techniques from natural language processing (NLP) and machine learning (ML), it enables
the transformation of free-text medical documents into structured, machine-readable codes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This
task presents several modeling challenges, such as recognizing domain-specific medical terminology,
resolving ambiguous or polysemous expressions, identifying context-sensitive entities, and managing
temporality and negation. In the cardiovascular domain, these complexities are further amplified by the
frequent presence of comorbidities—such as hypertension, diabetes, and chronic kidney disease—which
often interact and complicate the coding process [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Moreover, the high dimensionality of the ICD-10 code space—comprising over 70,000 distinct
codes—and the inherently multi-label nature of clinical documentation increase both computational
and methodological complexity [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These challenges are particularly acute in low-resource
environments where large, high-quality annotated corpora are scarce or unavailable. As healthcare systems
transition to more data-driven infrastructures, automated coding is increasingly viewed as foundational
for scalable, eficient, and equitable health information management. Its implementation is especially
critical in high-burden domains like cardiology, where precise and timely codification can influence not
only individual patient care, but also broader policy and funding decisions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Despite substantial advances in clinical NLP, existing automated coding systems are almost exclusively
developed for English-language corpora [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. This language bias presents a significant barrier to the
adoption of automated tools in multilingual healthcare systems. In Greece, for example, the scarcity of
annotated clinical corpora, language-adapted NLP tools, and reliable coding benchmarks has hindered
progress in clinical text mining and intelligent documentation. Consequently, Greek hospitals and
research institutions remain disadvantaged in their ability to adopt modern AI-driven documentation
systems, thereby limiting their participation in international data-sharing initiatives and slowing the
development of interoperable health infrastructure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Addressing these disparities is vital for equitable
access to medical AI, especially in specialties such as cardiology, where precise diagnosis and tracking
of comorbid conditions are critical [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Moreover, with the increasing digitization of health records in
Greece and other underrepresented countries, there is a timely need for language- and domain-adapted
benchmarks that can catalyze the development of accurate, explainable, and generalizable NLP models.
      </p>
      <p>
        To address this gap, we introduce ELCardioCC, a shared task and competitive benchmark that
is part of the BioASQ 2025 challenge [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The task focuses on two main objectives: (i) assigning
cardiology-related ICD-10 codes to discharge letters from Greek hospitals, and (ii) extracting specific
mentions of ICD-10 codes from the discharge letters. Designed as both a research task and a competition,
ELCardioCC invites participants to develop and evaluate automated systems on these objectives. To
structure the challenge, the task is divided into three sub-tasks. In the Named Entity Recognition (NER)
sub-task, participants identify clinical entity mentions in the text. In the Entity Linking (EL) sub-task,
those mentions must be linked to the appropriate ICD-10 codes. The third sub-task, Multi-label Learning
&amp; Explainable AI (MLC-X), addresses the same goals as NER and EL, but encourages the use of models
and methodologies from multi-label learning and explainable AI to enhance both performance and
interpretability.
      </p>
      <p>The remainder of this paper is organized as follows: Section 2 provides an overview of the ELCardioCC
task, including its objectives, sub-tasks, evaluation metrics, and baseline systems. Section 3 describes the
dataset and presents key statistics. Section 4 outlines the methods, models, and algorithms submitted by
participating teams. Section 5 reports the results and performance analysis, while Section 6 presents the
related work. Finally, Section 7 concludes the paper and discusses potential directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Overview of the Shared Task</title>
      <p>In this section, we provide a detailed description of the ELCardioCC shared task. We begin with an
overview of the task objectives, the dataset, the submission and evaluation process. Next, we define the
individual sub-tasks. We then outline the evaluation framework and, finally, present our baselines for
each sub-task.</p>
      <sec id="sec-2-1">
        <title>2.1. Description</title>
        <p>Participants in the ELCardioCC task were tasked with developing NER, EL and MLC-X systems using
a specialized corpus of discharge letters from the cardiac department of a Greek hospital. These
discharge letters, which were written in Greek, contained valuable medical information about patients’
conditions, treatments, and outcomes. The corpus was annotated with the positions of mentions and
their corresponding ICD-10 codes.</p>
        <p>The provided datasets were divided into a development dataset and an unseen test dataset. The
development dataset consists of letters that came with gold-standard annotations, indicating the
positions of mentions in the text and mapping them to their respective ICD-10 codes. In contrast, the
test dataset only contained the raw discharge letters without any annotations, leaving participants to
apply their models to identify mentions and link them to the appropriate ICD-10 codes.</p>
        <p>For submission, participants were required to prepare and submit JSON files containing their results
for both tasks. The NER submissions included a list of start and end positions for the mentions
identified within the text, while the EL submissions consisted of the identified mentions along with
the ICD-10 codes they were linked to. For MLC-X submissions, each letter was accompanied by a list
of corresponding ICD-10 codes, as well as a list of start and end positions for terms identified by an
explainable AI model as contributing most significantly to the determination of these ICD-10 codes.
The Explainable AI sub-task was optional. Each team was allowed to submit up to five diferent runs for
each task, enabling them to explore diferent strategies or model configurations for better performance.</p>
        <p>The evaluation process compared the predictions made by the participating teams against manual
annotations provided by clinical experts. Details of the evaluation process and metrics can be found in
Section 2.3.</p>
        <p>Figure 1 provides an overview of the workflow for the ELCardioCC shared task.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Sub-Tasks</title>
        <p>The task was divided into three main sub-tasks. In the first sub-task, participants focused on NER, where
they were required to identify all mentions present in discharge letters, along with their corresponding
start and end positions. This phase involved detecting 5 types of mentions: chief complaint, diagnosis,
prior medical history, drugs and cardiac echo, which are often complex and context-dependent, especially
when working with medical language in Greek.</p>
        <p>The second sub-task, EL, involved mapping the recognized mentions to their corresponding ICD-10
codes. Once the mentions were detected either in the first phase or by using any other approach for
mention identification, participants had to generate a list of candidate ICD-10 codes for each entity.
This process required a deep understanding of both the clinical context and the ICD-10 classification
system. The generated ICD-10 codes were expected to accurately represent the identified mentions and
provide a standardized classification for the mentions.</p>
        <p>In the MLC-X sub-task, participants’ systems were tasked with identifying all relevant ICD-10 codes
contained within each discharge letter. Unlike the EL sub-task, where participants first needed to
recognize specific mentions within the text and subsequently map them to their corresponding
ICD10 codes, the MLC-X sub-task allowed for a more direct approach. Participants were not required
to explicitly identify mentions in the text; instead, they could employ alternative techniques. This
distinction enabled participants to focus on optimizing classification strategies tailored to the complex,
multi-label nature of the problem, emphasizing precision and recall in capturing all applicable codes
from the unstructured medical text. In a subsequent optional step, the systems identify the mentions
corresponding to the ICD-10 codes using explainable AI techniques. The terms highlighted by these
techniques can be regarded as contributing significantly to the identification of the ICD-10 codes.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Evaluation</title>
        <p>During the evaluation, participants’ submissions were compared against the ground-truth annotations
extracted by clinical experts, which were not provided to the participants. A strict exact match criterion,
tailored to each task, was applied to compare ground truth and predicted outputs without any further
preprocessing.</p>
        <p>Specifically, for the NER sub-task, participants submitted lists of the start and end positions for
mentions found in the discharge letters. A prediction was considered correct only if both the start and
end positions exactly matched the gold standard annotations. In the EL sub-task, participants were
required to both identify text spans and assign ICD-10 codes. Similarly, a prediction was considered
correct only if both the span boundaries and the assigned code exactly matched the gold annotations,
with no partial matches accepted.</p>
        <p>The evaluation was based on these exact matches and was measured, in both tasks, using Precision,
Recall, and F1 Score, defined as follows:</p>
        <p>Precision =</p>
        <sec id="sec-2-3-1">
          <title>True Positives</title>
          <p>True Positives + False Positives
Recall =</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>True Positives</title>
          <p>True Positives + False Negatives</p>
          <p>Precision × Recall
F1 Score = 2 ×</p>
          <p>Precision + Recall</p>
          <p>The first part of the MLC-X task did not involve predicting text spans but focused on identifying
all relevant ICD-10 codes per document. For evaluation, the predicted and gold codes were treated
as sets and compared for each document to determine true positives (correctly predicted codes), false
positives (codes predicted but not in the gold data) and false negatives (gold codes not predicted). The
precision, recall and F1 score were then calculated based on the aggregated counts across all documents
as follows:</p>
          <p>Precision = ∑︀∑=︀1| ∩ˆ  |</p>
          <p>ˆ
=1 | |
Recall = ∑︀∑=︀1| ∩  |</p>
          <p>ˆ
=1 ||
where  is the number of documents, ˆ  is the set of predicted codes for document , and  is the
corresponding gold set. The F1 score was calculated as before.</p>
          <p>For teams providing explanations, the evaluation followed the same criterion as the Entity Linking
task and was computed independently as an additional metric. If a team did not provide span-level
explanations, no valid span data was available, and their metrics were recorded as zero. The
documentlevel sub-task was designated as the primary basis for team ranking, while the optional span-level
sub-task provided an additional challenge for participants and was excluded from the final rankings.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Baseline Systems</title>
        <p>For each of the sub-tasks, we developed one or more baseline systems to provide a clear reference point
for evaluating the performance of current approaches on our benchmark dataset. These baselines serve
as essential comparisons, helping participants and researchers understand the strengths and limitations
of new methods relative to established techniques. By ofering well-defined reference results, we aim to
support fair, transparent, and reproducible assessment within the challenge.</p>
        <p>
          For the NER sub-task, the baseline system [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is built on the cased multilingual BERT-base2 (mBERT)
architecture [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The model was fine-tuned with a token-level classification head to perform NER
under the BIO2 tagging scheme.
        </p>
        <p>Documents were first segmented into sections using a keyword-driven method that exploits the
semi-structured format and consistent headers of the discharge letters. Within each section, sentence
segmentation was performed using Stanza, chosen for its support of Greek and its robustness in handling
clinical abbreviations and irregular formatting. Each sentence was tokenized using the model’s native
tokenizer, with sequences padded or truncated to a maximum length of 384 tokens. Only a few samples
exceeded this length, making it a reasonable cutof to reduce computational cost.</p>
        <p>Training was performed using the AdamW optimizer and a standard cross-entropy loss function
for token-level classification. The model was fine-tuned over five epochs with a batch size of 4 and a
learning rate of 8 × 10− 6, selected for stable convergence in low-resource settings. A weight decay</p>
        <sec id="sec-2-4-1">
          <title>2https://huggingface.co/google-bert/bert-base-multilingual-cased</title>
          <p>of 0.01 was applied to mitigate overfitting, alongside gradient clipping with a maximum norm of 3.
Dropout with a probability of 0.1, as set by the default BERT configuration, was used. No learning rate
scheduler or advanced regularization techniques were employed, preserving the baseline’s simplicity.</p>
          <p>Subword-level predictions produced by mBERT were aggregated to the word level by selecting the
highest-scoring tag across each word’s constituent subword tokens without any post-processing.</p>
          <p>
            For the EL task, the baseline system [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] is also built on the multilingual BERT-base architecture
and augmented to reflect the hierarchical structure of the ICD-10 taxonomy. It adopts a hierarchical
classification framework with two parallel classification heads, one for coarse-grained (block-level)
predictions and another for fine-grained (code-level) predictions.
          </p>
          <p>Mentions are tokenized using a custom scheme that includes five tokens on both sides of the mention
span. To maintain input consistency, sequences are padded or truncated to a fixed length of 128 tokens.
The mention-level inputs are processed by mBERT, with contextualized representations fed into both
classification heads. Special mention markers and masks are employed to ensure the model attends
appropriately to the mention span during encoding.</p>
          <p>Training is guided by a hierarchical loss function that combines the cross-entropy losses from both
parent and child classifiers. A tunable weighting parameter is used to balance the contributions of each
level to the overall loss. This dual-level supervision encourages the model to learn both broad and
specific label associations, improving performance on fine-grained classes while maintaining coherence
with the ICD-10 taxonomy.</p>
          <p>The model is optimized using the Adam optimizer, with learning rates tuned independently for each
classification head. In particular, a learning rate of 1 × 10− 4 is used for the parent classifier and 1 × 10− 5
for the child classifier, while the base model uses a lower learning rate of 3 × 10− 6. A moderate weight
decay of 0.05 is applied for regularization, alongside dropout. Although only fine-grained (child-level)
predictions are used for evaluation, the model is designed to support hierarchical supervision. In
practice, training was conducted with a strong emphasis on the child-level loss, and the parent loss was
down-weighted in later stages to prioritize fine-grained classification.</p>
          <p>
            Finally, for the MLC-X task, the baseline approach [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] utilized the Greek-BERT transformer3 [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ],
adapted and fine-tuned for multi-label classification. The model was fit and trained with 40 output
heads, each representing one of the 40 most frequent ICD-10 codes in the training set. This translates
to codes that have at least 30 appearances in the training set. Expanding the output layer to cover rarer
codes led to similar or worse performance. Through this model architecture, the resulting system is
limited to predicting only these common ICD-10 codes within discharge letters, while rarer codes or
codes with no appearances in the training set cannot be detected.
          </p>
          <p>The "MLCX1_baseline" system uses the transformer model to perform document-level prediction of
ICD-10 codes for each discharge letter in the test set. Due to the length of the discharge letters, each
document was first segmented into sections, with predictions then aggregated to document-level. Using
a Greek pre-trained language model capable of predicting the most frequent ICD-10 codes provides a
solid baseline for evaluation using micro-averaged metrics. However, more advanced systems capable
of understanding and predicting a broader range of codes are expected to outperform it.</p>
          <p>The "MLCX2_baseline" system extends the first system by adding a rule-based component after
the prediction, in order to transform document-level predictions into span-level mentions. For each
predicted code within a letter, the system searches the document for common predefined terms and
abbreviations associated with that ICD-10 code. If relevant mentions are found, they are included as
explanations for the prediction. If no relevant mention is detected, the corresponding code is removed
from the output. As is typical of rule-based methods in text processing, the "MLCX2_baseline" system is
expected to achieve high precision but lower recall, making it a strong baseline for evaluating explainable
AI techniques.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>3https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>We introduce a novel dataset derived from Greek EHRs, specifically discharge letters from the cardiology
department of a Greek hospital. The dataset addresses the limited availability of Greek-language clinical
corpora, providing a valuable resource for advancing NLP tasks in biomedical applications. It is designed
to support NER, EL, and clinical coding, with a focus on mapping medical entities to ICD-10 codes.</p>
      <p>The dataset was curated from discharge letters, which capture critical aspects of patient care, including
diagnoses, symptoms, medical procedures, and treatment plans. These documents were chosen for
their detailed narrative structure, which represents a rich source of unstructured medical data. The
dataset reflects the complexities of Greek clinical texts, featuring diverse sentence structures, specialized
terminology, and abbreviations typical of healthcare documentation. It is designed to support the
development and evaluation of multilingual and monolingual transformer-based models for tasks
such as weakly supervised clinical entity recognition and automated clinical coding. The inclusion of
commonly observed ICD-10 codes ensures relevance to practical healthcare applications, while the use
of Greek-language texts fills a critical gap in existing resources.</p>
      <p>
        A systematic and ethical approach was taken in constructing the dataset. First, all documents
underwent de-identification to remove personally identifiable information, ensuring compliance with
data protection regulations. Subsequently, a team of 4 medical professionals divided in 2 groups
annotated the texts with ICD-10 codes with each group annotating the same set of documents, linking
clinical entities to standardized classifications. An annotation tool, Doccano [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], was employed
to streamline the process, enabling precise identification of mentions and their relationships. The
annotation process included all cardiology-related information from the discharge letter, encompassing
details from both the current hospitalization and the patient’s past medical history. In particular, every
disease and symptom, either current or past, was codified using a precise ICD-10 code. In addition,
all medical procedures, such as cardiac catheterization or pacemaker insertion, that occurred during
hospitalization were linked to their specific ICD-10 code. Medical recommendations and scheduled
procedures that did not take place during the hospitalization were excluded from annotation. Moreover,
the findings of diagnostic tests, such as echocardiographic findings, ECGs and x-rays, were annotated in
detail whenever applicable. ECG findings were coded with the greatest possible accuracy at admission
and discharge, as well as during hospitalization. X-ray reports were precisely annotated despite the fact
that they were often missing or not relevant to cardiology in general. Of particular interest was the
echocardiographic findings annotation process due to linguistic dificulties. Specifically, annotation
of valvular heart disease was dificult due to the distance between the name of the valve and the
pathological terms, such as stenosis or regurgitation, within the text. To address this issue, it was
decided to annotate the full length of the phrase containing the valve name (first word to be annotated)
and the pathological term (last word to be annotated). Furthermore, the annotation process was unable
to characterize the severity of valvular heart disease, even though such data were available, due to the
lack of specific ICD-10 codes.
      </p>
      <p>The gold standard annotations for each discharge letter were created by merging the annotations
provided by the two medical professionals, through the following conflict-resolution process:
• Spans annotated by only one annotator were included in the gold set, by following the assumption
that the span simply being missed by the second annotator is much more likely than an irrelevant
span being included. Excluding these annotations would thus lead to additional False Positive
predictions by the participants.
• Spans annotated identically by both annotators were directly included. In cases of overlapping
spans, the annotations were merged into a single span, by selecting the largest span. This
approach ensures that additional context is preserved, which can be critical for accurate ICD-10
code assignment. Examples where the larger span provides a more accurate description of the
diagnosis include: "coronary heart disease" versus "positive family history of coronary heart
disease", "aortic valve stenosis" versus "severe aortic valve stenosis", "stroke” versus "ischemic
stroke", and "STEMI" versus "STEMI of inferior wall".
• In instances where the same span was annotated using two diferent ICD codes, a third medical
professional was consulted to determine the most appropriate ICD-10 label, after inspecting the
relevant span and context.</p>
      <p>Additionally, to further improve the quality of the gold annotations, the test set discharge letters
underwent a second annotation round by the medical professionals.</p>
      <p>The final dataset that resulted from the above annotation process was divided into two sets: a training
set containing 1,000 annotated discharge letters made available to participants to develop their systems,
and a test set containing 500 annotated discharge letters, with the annotations kept hidden from the
participants.</p>
      <sec id="sec-3-1">
        <title>3.1. Key Statistics</title>
        <p>The dataset was constructed with 1,000 documents for training and 500 for testing, creating a balanced
split suitable for model development and evaluation. The training set contains 10,168 annotations,
averaging 10.17 per document (ranging from 1 to 33), and the test set contains 5,696 annotations with a
slightly higher average of 11.39 per document (ranging from 2 to 44). Despite these totals, the number of
unique mentions is considerably lower, with 2,418 in the training set and 1,320 in the test set, indicating
frequent repetition of the same mentions both within and across documents.</p>
        <p>This pattern is more clearly seen in the code distribution. The top 10 most frequent codes account
for a substantial portion of annotations in both sets (Figure 2(a)). Table 1 presents a detailed breakdown
of these codes and their exact counts. In the training set, they represent 5,885 annotations, which is
over half of the total. In the test set, the top 10 codes account for 3,255 out of 5,696. Such concentration
reflects the dominance of a small subset of clinical concepts throughout the dataset. Additionally, the
normalized start positions of annotations reveal clear clustering, with annotations in the first 30% of the
documents occurring at roughly twice the frequency compared to the rest (Figure 2(b)). Such clustering
shows clinical mentions mainly appear in the early sections, while later parts mostly contain discharge
instructions, scheduling, and raw test results.
(a) Cumulative distribution of ICD-10 codes in the train and (b) Distribution of normalized annotation start positions
test sets across the training and test sets</p>
        <p>Beyond their location in the text, the annotations are also generally short. The average length is 14.31
characters in the training set and 13.48 characters in the test set. These lengths suggest that annotations
typically consist of brief clinical mentions of about two to three words. Overall, about 7% of the text in
both training and test documents consists of annotated clinical mentions, which is expected given the
nature of clinical data and the typical annotation practices in this field. The annotations are based on a
predefined labelset of 324 unique ICD-10 codes, deemed suficient to describe the clinical concepts in
the dataset.</p>
        <p>From this set, 144 codes appear in the training data and 106 in the test data. Among these, 95 codes are
shared between both sets, while 49 are exclusive to training and 11 unique to testing. This distribution
indicates a substantial overlap, ensuring that the test set is representative, while also containing some
unique codes to assess model generalization.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodologies</title>
      <p>This section outlines the methods and models used by participating systems in the ELCardioCC task.
For each team, we summarize their approaches to the sub-tasks they participated in. Further details can
be found in their respective papers.</p>
      <p>The droidlyx team from Fudan University [18] used a BERT-based sequence labeling model
(bertbase-greek-uncased-v1) for NER, feeding token embeddings into a two-layer MLP classifier for BIO
tagging. They fine-tuned the model and used a sliding window for inference. For EL, they translated
text to English with LibreTranslate, applied SapBERT using the [CLS] token as the entity representation,
and classified entities via a two-layer MLP into ICD-10 codes, with additional data enrichment. In
MLC-X Phase A, they reused EL predictions for each letter.</p>
      <p>The bhuang team from the University of Padova [19] used multilingual LLMs (Gemma-3, Phi-4,
Gemini) with zero-shot prompting to extract clinical mentions from Greek discharge letters, translating
them into English with descriptions. A BERT bi-encoder filtered irrelevant mentions. For EL, a two-stage
retrieval approach was used: BM25 to narrow candidates, followed by a MedCPT cross-encoder to select
the best ICD-10 code. The MLC-X task was addressed by aggregating codes from EL. Various ensemble
strategies combined outputs from diferent prompts and processing methods to improve recall and
capture long-tail/nested entities.</p>
      <p>The enigma team from Sofia University and Graphwise [ 20] approached NER using fine-tuned
BERT-based models (Greek BERT and XLM-RoBERTa) with BIO tagging. For EL, they first used a
dictionary-based method, then a bi-encoder model (BGE-M3 variants) for semantic matching between
mentions and ICD-10 codes, fine-tuned with ranking loss. A cross-encoder reranker was tested but
not used. In MLC-X, they applied a simple multi-label classification using BGE-M3 to process full
documents.</p>
      <p>Finally, the pjmathematician team from Netaji Subhas University of Technology [21] used
Qwenbased LLMs for all subtasks. For NER, they applied both base and LoRA fine-tuned Qwen models with
prompts to translate Greek text and extract entities. For EL, they linked entities to ICD-10 codes using
semantic similarity via a multilingual sentence transformer. In MLC-X, they used Qwen-72B to predict
ICD-10 codes and, in one configuration, also extracted supporting Greek mentions. Inference was
performed using LMDeploy across all tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We present the results of the participating systems alongside our baseline models across for NER (Table
2), EL (Table 3) and MLC-X (Table 4) sub-tasks. Notably, the droidlyx team achieved the highest F1
scores in each sub-task, indicating strong overall performance. Interestingly, although the enigma
team also leveraged Greek variants of BERT, the diferences in model, pre-processing, and sub-task
implementation choices seem to have led to the superior performance by the droidlyx team. Another
factor likely contributing to the strong performance of the droidlyx team is their fine-tuning strategy. By
carefully fine-tuning the BERT model for the specific sub-tasks, they likely enhanced both adaptability
and task-specific understanding, leading to their top-tier results.</p>
      <p>A particularly surprising observation is the performance of the bhuang team, which utilized LLMs.
Despite their potential, these models underperformed in the NER sub-task. However, in the MLC-X
sub-task (Phase A), bhuang achieved the highest recall score (0.8576), suggesting that their system was
efective at capturing relevant labels, albeit at the cost of precision.</p>
      <p>The pjmathematician team consistently showed the lowest performance across all tasks, indicating
potential issues in either their modeling approach or implementation. In contrast, our baseline models
demonstrated competitive results, particularly in the MLC-X sub-task, where our system achieved
the highest precision score (0.9531). This underscores the robustness of our baseline configuration,
providing a solid benchmark for participants to attempt to surpass.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>The Conference and Labs of the Evaluation Forum (CLEF) eHealth Lab in 2020 [22] introduced an
Information Extraction (IE) task focused on automatic clinical coding. This task aimed to assign ICD-10
diagnosis and procedure codes to Spanish clinical case documents, along with identifying relevant
evidence text snippets supporting the coded information.</p>
      <p>In a related efort, the 2023 MedProcNER Task [ 23] introduced three sub-tasks centered on clinical
procedures in Spanish texts. The first was Clinical Procedure Recognition, a named entity recognition
(NER) task for identifying mentions of clinical procedures. The second, Clinical Procedure Normalization,
required mapping these mentions to SNOMED CT codes through entity linking (EL). The third sub-task,
Clinical Procedure-based Document Indexing, involved assigning SNOMED CT codes directly to full
clinical reports for semantic indexing, independently of the other sub-tasks.</p>
      <p>Team Name</p>
      <p>Similarly, the 2024 MultiCardioNER challenge [24] focused on domain-specific clinical NER and
coding, particularly in cardiology. It featured two sub-tracks: one for disease recognition in Spanish,
and another for multilingual medication extraction across Spanish, English, and Italian. Participants
utilized resources such as DisTEMIST, DrugTEMIST, and the CardioCCC corpus, which contains
cardiology-specific annotations.</p>
      <p>In addition to these shared tasks, earlier works have also emphasized multilingual ICD-10 coding
challenges. The 2018 CLEF eHealth Multilingual Information Extraction Task [25] addressed ICD-10
coding of death certificates in French (11,932 records), Hungarian (21,176 records) and Italian (3,618
records). This task focused on mapping causes of death from medical narratives to ICD-10 codes, using
datasets provided by French CépiDc, Hungarian KSH and Italian ISTAT. The CLEF eHealth Lab in
2019 [26] introduced another multilingual coding task, which targeted the multi-label classification
of German non-technical summaries of animal experiments. Participants predicted ICD-10 codes for
descriptions of benefits, harms, and pressures afecting animals in biomedical research projects, with
data annotated using the German ICD-10 ontology.</p>
      <p>Further independent studies have also expanded ICD-10 coding to non-English datasets. Reys et
al. [27] focused on Brazilian-Portuguese clinical notes, assigning diagnostic ICD-10 codes to 77,005
free-text hospital discharge summaries sourced from a Brazilian hospital. Sammani et al. [28] tackled
multilabel ICD-10 coding for Dutch cardiology discharge letters, using a dataset of 10,637 records with
domain-specific cardiology diagnoses. Another notable contribution is the MKE-Coder study [ 29],
which addresses automatic ICD coding for Chinese electronic medical records. This study leverages
a large-scale dataset of 87,797 records from multiple hospitals and emphasizes the tailored nature
of Chinese clinical texts with shorter diagnostic descriptions and discharge summaries compared to
English counterparts like MIMIC-III. Similarly, Tchouka et al. [30] investigated ICD-10 coding for 56,014
unstructured French clinical texts from the Nord Franche-Comté Hospital comprised from documents
such as discharge letters, operating reports, and clinical notes with each record being associated with
multiple ICD-10 codes.</p>
      <p>In this context, the ELCardioCC task aligns with the general objectives of the above eforts but
introduces two novel aspects: it is conducted in the Greek language, and incorporates both multi-label
learning and explainable AI for automatic clinical coding.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>This study presented ELCardioCC, a novel shared task developed under the BioASQ 2025 framework,
targeting the automation of clinical coding in a low-resource language setting. The task focused on two
primary objectives: (i) the assignment of ICD-10 codes to Greek cardiology discharge letters, and (ii)
the extraction of specific mentions of ICD-10 codes from the discharge letters. With participation from
ifve research teams, the challenge highlighted the efectiveness of transformer-based architectures and
language-specific pretraining for both Named Entity Recognition (NER) and Entity Linking (EL).</p>
      <p>Despite promising results, considerable performance variation across sub-tasks suggests that
automated clinical coding remains far from a solved problem—particularly in linguistically constrained
environments. The underrepresentation of submissions in the Multi-label Learning and Explainable AI
sub-task further underscores the need to prioritize model interpretability alongside performance. This
is especially crucial in clinical settings, where explainable decision pathways are essential for ensuring
safety, trust, and regulatory compliance.</p>
      <p>To advance the state of the field, future iterations of ELCardioCC should pursue several directions.
First, expanding the dataset to include more diverse clinical specialties and larger volumes of annotated
text will support deeper model generalization. Second, incorporating multilingual corpora and
crosslingual evaluation protocols may enable more robust transfer learning and better scalability to other
under-resourced health systems. Third, the evaluation framework should evolve to include clinically
meaningful error metrics, such as critical code omission and false attribution, which directly afect
downstream decision-making. Finally, stronger emphasis should be placed on the development and
benchmarking of explainable coding systems, including rationale extraction, clinician-facing
justifications, and human-in-the-loop workflows.</p>
      <p>ELCardioCC represents an initial yet important step toward equitable, interpretable, and
languageinclusive clinical NLP. By enabling systematic evaluation and fostering open collaboration, it contributes
a foundational initiative for advancing automated medical coding in real-world health information
systems. We strongly encourage further research in this direction to advance transparency and
interpretability in automatic clinical coding systems.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: (1) Grammar and spelling
check, (2) Paraphrase and reword, and (3) Improve writing style.
https://github.com/doccano/doccano.
[18] Y. Liu, LYX_DMIIP_FDU at BioASQ 2025: Utilizing BERT embeddings for biomedical text mining,
in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025 Working Notes, 2025.
[19] B. Huang, Clinical entity recognition and linking in greek discharge letters using
multilingual-llmbased multi-stage system, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025 Working
Notes, 2025.
[20] B. Velichkov, A. Datseris, S. Vassileva, S. Boytcheva, Enigma @ ElCardioCC: Bridging NER and
ICD-10 Entity Linking - A Hybrid Method for Greek Clinical Narratives, in: G. Faggioli, N. Ferro,
P. Rosso, D. Spina (Eds.), CLEF 2025 Working Notes, 2025.
[21] P. Vachharajani, Multilingual embedding and prompt-driven approaches for named entity
recognition, entity linking, and clinical code prediction in greek discharge summaries, in: G. Faggioli,
N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025 Working Notes, 2025.
[22] L. Goeuriot, H. Suominen, L. Kelly, A. Miranda-Escalada, M. Krallinger, Z. Liu, G. Pasi, G.
Gonzalez Saez, M. Viviani, C. Xu, Overview of the clef ehealth evaluation lab 2020, in: International
conference of the cross-language evaluation forum for European languages, Springer, 2020, pp.
255–271.
[23] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras,
M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at
bioasq 2023., in: CLEF (Working Notes), 2023, pp. 1–18.
[24] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz,
G. Ceroni, J. Kossof, A. Shah, A. Nentidis, et al., Overview of multicardioner task at bioasq 2024
on medical speciality and language adaptation of clinical ner systems for spanish, english and
italian, CLEF Working Notes (2024).
[25] A. Névéol, A. Robert, F. Grippo, C. Morgand, C. Orsi, L. Pelikan, L. Ramadier, G. Rey, P.
Zweigenbaum, Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death
certificates in french, hungarian and italian., in: CLEF (Working Notes), CEUR-WS, 2018, pp. 1–18.
[26] M. Sänger, L. Weber, M. Kittner, U. Leser, Classifying german animal experiment summaries with
multi-lingual bert at clef ehealth 2019 task 1., in: CLEF (Working Notes), 2019.
[27] A. D. Reys, D. Silva, D. Severo, S. Pedro, M. M. de Sousa e Sá, G. A. Salgado, Predicting multiple
icd-10 codes from brazilian-portuguese clinical notes, in: Brazilian Conference on Intelligent
Systems, Springer, 2020, pp. 566–580.
[28] A. Sammani, A. Bagheri, P. G. van der Heijden, A. S. Te Riele, A. F. Baas, C. Oosters, D. Oberski,
F. W. Asselbergs, Automatic multilabel detection of icd10 codes in dutch cardiology discharge
letters using neural networks, NPJ digital medicine 4 (2021) 37.
[29] X. You, X. Liu, X. Yang, Z. Wang, J. Wu, Mke-coder: Multi-axial knowledge with evidence
verification in icd coding for chinese emrs, arXiv preprint arXiv:2502.14916 (2025).
[30] Y. Tchouka, J.-F. Couchot, D. Laiymani, P. Selles, A. Rahmani, Automatic icd-10 code association:
A challenging task on french clinical texts, in: 2023 IEEE 36th International Symposium on
Computer-Based Medical Systems (CBMS), IEEE, 2023, pp. 91–96.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sedlakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Daniore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Horn</given-names>
            <surname>Wintsch</surname>
          </string-name>
          , et al.,
          <article-title>Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review</article-title>
          ,
          <source>PLOS Digital Health</source>
          <volume>2</volume>
          (
          <year>2023</year>
          )
          <article-title>e0000347</article-title>
          . URL: https://doi.org/10.1371/journal.pdig.0000347. doi:
          <volume>10</volume>
          .1371/journal.pdig.
          <volume>0000347</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tonkin</surname>
          </string-name>
          , et al.,
          <article-title>A scoping review of automatic coding of clinical narratives with icd-</article-title>
          10
          <source>, Journal of the American Medical Informatics Association</source>
          <volume>28</volume>
          (
          <year>2021</year>
          )
          <fpage>1000</fpage>
          -
          <lpage>1010</lpage>
          . doi:
          <volume>10</volume>
          .1093/jamia/ocab028.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Falis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Whiteley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Alex</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Matterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , H. Wu,
          <article-title>Automated clinical coding: what, why, and where we are?</article-title>
          ,
          <source>NPJ digital medicine 5</source>
          (
          <year>2022</year>
          )
          <fpage>159</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Turchioe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Volodarskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Wright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Tcheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Slotwiner</surname>
          </string-name>
          ,
          <article-title>Systematic review of current natural language processing methods and applications in cardiology</article-title>
          ,
          <source>Heart</source>
          <volume>108</volume>
          (
          <year>2022</year>
          )
          <fpage>909</fpage>
          -
          <lpage>916</lpage>
          . URL: https://doi.org/10.1136/heartjnl-2021-319769. doi:
          <volume>10</volume>
          .1136/heartjnl-2021-319769.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          , et al.,
          <article-title>Autonomous international classification of diseases coding using pretrained language models and advanced prompt learning techniques</article-title>
          ,
          <source>JMIR Medical Informatics</source>
          <volume>13</volume>
          (
          <year>2025</year>
          )
          <article-title>e63020</article-title>
          . doi:
          <volume>10</volume>
          .2196/63020.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kang</surname>
          </string-name>
          , et al.,
          <article-title>Medcoder: A generative ai assistant for medical coding</article-title>
          ,
          <source>in: NAACL Industry Track</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Névéol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dalianis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Velupillai</surname>
          </string-name>
          , G. Savova,
          <string-name>
            <surname>P. Zweigenbaum,</surname>
          </string-name>
          <article-title>Clinical natural language processing in languages other than english: opportunities and challenges</article-title>
          ,
          <source>Journal of biomedical semantics 9</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheikhalishahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Dudley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Osmani</surname>
          </string-name>
          ,
          <article-title>Natural language processing of clinical notes on chronic diseases: Systematic review</article-title>
          ,
          <source>JMIR Medical Informatics</source>
          <volume>7</volume>
          (
          <year>2019</year>
          )
          <article-title>e12239</article-title>
          . URL: https://medinform.jmir.org/
          <year>2019</year>
          /2/e12239. doi:
          <volume>10</volume>
          .2196/12239.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bracken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Reilly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feeley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sheehan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Merghani</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Feeley</surname>
          </string-name>
          ,
          <source>Artificial Intelligence (AI</source>
          )
          <article-title>- Powered Documentation Systems in Healthcare: A Systematic Review</article-title>
          ,
          <source>Journal of Medical Systems</source>
          <volume>49</volume>
          (
          <year>2025</year>
          )
          <article-title>28</article-title>
          . doi:
          <volume>10</volume>
          .1007/s10916-025-02157-4.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Boyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kascenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Q.</surname>
          </string-name>
          <article-title>O'Neil, Automated clinical coding using of-the-shelf large language models, arXiv preprint (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stoikopoulou</surname>
          </string-name>
          ,
          <article-title>Weakly Supervised NER for Cardiology Using Multilingual Transformers (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          ,
          <article-title>How multilingual is multilingual BERT?</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>4996</fpage>
          -
          <lpage>5001</lpage>
          . URL: https://aclanthology.org/P19-1493. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1493.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kipouros</surname>
          </string-name>
          ,
          <article-title>Investigating Entity Linking in Greek Electronic Health Records: Leveraging Hierarchical Structures</article-title>
          and
          <string-name>
            <surname>Bi-Encoder Architectures</surname>
          </string-name>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Toumpas</surname>
          </string-name>
          ,
          <article-title>Transferring Labels from the Document Level to the Mention Level in Clinical Coding by Interpreting Transformer-based Classifiers (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Koutsikakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>Greek-bert: The greeks visiting sesame street</article-title>
          ,
          <source>in: 11th Hellenic Conference on Artificial Intelligence, SETN</source>
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , p.
          <fpage>110</fpage>
          -
          <lpage>117</lpage>
          . URL: http://dx.doi.org/10.1145/3411408.3411440. doi:
          <volume>10</volume>
          .1145/3411408.3411440.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nakayama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kubo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taniguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          , doccano: Text annotation tool for human,
          <year>2018</year>
          . URL: https://github.com/doccano/doccano, software available from
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>