=Paper=
{{Paper
|id=Vol-3740/paper-02
|storemode=property
|title=Overview of MultiCardioNER task at BioASQ 2024 on Medical Specialty and Language Adaptation
of Clinical NER Systems for Spanish, English and Italian
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-02.pdf
|volume=Vol-3740
|authors=Salvador Lima-López,Eulalia Farre-Maduell,Jan Rodríguez-Miret,Miguel Rodríguez-Ortega,Livia Lilli,Jacopo Lenkowicz,Giovanna Ceroni,Jonathan Kossof,Anoop Shah,Anastasios Nentidis,Anastasia Krithara,Georgios Katsimpras,Georgios Paliouras,Martin Krallinger
|dblpUrl=https://dblp.org/rec/conf/clef/Lima-LopezFRRLL24
}}
==Overview of MultiCardioNER task at BioASQ 2024 on Medical Specialty and Language Adaptation
of Clinical NER Systems for Spanish, English and Italian==
Overview of MultiCardioNER Task at BioASQ 2024 on
Medical Specialty and Language Adaptation of Clinical
NER Systems for Spanish, English and Italian
Salvador Lima-López1,* , Eulàlia Farré-Maduell1 , Jan Rodríguez-Miret1 ,
Miguel Rodríguez-Ortega1 , Livia Lilli2,3 , Jacopo Lenkowicz2 , Giovanna Ceroni4,5 ,
Jonathan Kossoff5 , Anoop Shah4,5 , Anastasios Nentidis6,7 , Anastasia Krithara5 ,
Georgios Katsimpras5 , Georgios Paliouras5 and Martin Krallinger1
1
Barcelona Supercomputing Center, Plaça Eusebi Güell, 1-3, 08034 Barcelona, Spain
2
Real World Data Facility, Gemelli Generator, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, 00168, Italy
3
Catholic University of the Sacred Heart, Rome, 00168, Italy
4
University College London, UK
5
University College London Hospitals NHS Foundation Trust, UK
6
National Center for Scientific Research “Demokritos”, Athens, Greece
7
Aristotle University of Thessaloniki, Thessaloniki, Greece
Abstract
Transformers and large language models (LLMs) are increasingly used for clinical data analysis, mostly in English,
but also in many other languages used within medical care systems. To comply with clinical standards, it is
critical to evaluate the generated results by means of benchmarking efforts based on high-quality manually
annotated corpora. To foster the adaptation of general clinical natural language processing (NLP) components
to the characteristics of medical specialties, as well as exploring cross-language adaptation techniques, we
propose the MultiCardioNER task at BioASQ 2024. MultiCardioNER focuses on the adaptation of named entity
recognition (NER) systems trained on multispecialty clinical case reports to cardiology, since cardiovascular
diseases are the leading cause of death globally. The MultiCardioNER task covered two entity types (diseases and
medications) in case reports written in three languages (Spanish, English and Italian). To generate a comparable
Gold Standard clinical NER corpus, we used neural translation, annotation projection and manual annotation
correction by domain experts. Top scoring teams reached very competitive results for disease (F1-score 0.8199)
and medication mentions (0.9277) in Spanish and also obtained very competitive scores for English (F1-score
0.9223) and Italian (F1-score 0.8842). These results suggest that adaptation of general clinical NLP components to
a specific clinical specialty can improve the overall results and that cross-language adaptation of clinical NLP
components using neural translation and expert-in-the-loop annotation might speed up the implementation of
clinical entity extraction systems. The MultiCardioNER corpora, as well as a silver standard made up of predictions
of participating systems over the background set, are available at: https://zenodo.org/records/11368861.
Keywords
named entity recognition, cardiology, subdomain adaptation, multilingual, clinical NLP
1. Introduction
Cardiovascular diseases (CVDs) represent a leading cause of death and morbidity worldwide and,
therefore, are responsible for considerable disability costs every year. Cardiology is a medical field with
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ salvador.limalopez@bsc.es (S. Lima-López); eulalia.farre@bsc.es (E. Farré-Maduell); jan.rodriguez@bsc.es
(J. Rodríguez-Miret); livia.lilli@policlinicogemelli.it (L. Lilli); jacopo.lenkowicz@policlinicogemelli.it (J. Lenkowicz);
g.ceroni@ucl.ac.uk (G. Ceroni); j.kossoff@nhs.net (J. Kossoff); a.shah@ucl.ac.uk (A. Shah); tasosnent@iit.demokritos.gr
(A. Nentidis); akrithara@iit.demokritos.gr (A. Krithara); gkatsibras@iit.demokritos.gr (G. Katsimpras);
paliourg@iit.demokritos.gr (G. Paliouras); mkrallin@bsc.es (M. Krallinger)
0000-0002-7384-1877 (S. Lima-López); 0009-0000-0793-981X (J. Rodríguez-Miret); 0009-0000-0188-079X
(M. Rodríguez-Ortega); 0009-0005-3319-7211 (L. Lilli); 0000-0002-8366-1474 (J. Lenkowicz); 0000-0002-8907-5724 (A. Shah);
0000-0002-3782-4412 (A. Nentidis); 0000-0003-0491-4507 (A. Krithara); 0000-0003-3697-941X (G. Katsimpras);
0000-0001-9629-2367 (G. Paliouras); 0000-0002-2646-8782 (M. Krallinger)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
its own concepts and expressions of great relevance, as cardiovascular diseases are highly prevalent
worldwide. Analysis of unstructured medical data, such as clinical notes or medical publications, may
provide an opportunity to improve the characterization of cardiac pathologies. The extraction of clinical
variables from medical content is key to enabling healthcare data analytics, improving patient care
and advancing precision medicine. Information contained only as free text within Electronic Medical
Records is currently mostly unused due to the difficulty of extracting relevant data from very diversely
written data sources. Additionally, the distinctive language of each medical specialty calls for more
specialized automatic semantic annotation resources in English and any language used in clinical
services.
Due to its importance and the need to improve the extraction, use, and ultimately exploitation of
patient data suffering from cardiovascular conditions, efforts have been made to implement natural
language processing (NLP) solutions to classify or extract key variables from cardiology clinical content.
Using results generated by NLP technologies might contribute to improving outcomes and understanding
disease in cardiology. In order to account for the diversity and heterogeneity of NLP research applied to
cardiology, several review articles have tried to systematically characterize the various NLP application
scenarios adapted to handle cardiovascular disease medical documents [1]. These included applications
related to heart failure [2], coronary artery disease, general cardiology or valvular heart disease [3, 4].
Efforts were also made to extract, by means of NLP tools, symptoms [5, 6], vital signs [7], heart function
measurements [8], risk factors [4], or cardiovascular comorbidities [9] as well as social risk factors [10]
or diagnostic codes of common cardiovascular diseases [11]. Some other attempts were also made to
explore the use of NLP approaches to extract Framingham criteria [12] or New York Heart Association
classifications from unstructured clinical notes [13, 14, 15].
General clinical domain pre-training does not necessarily transfer well to all medical sub-specialties
or disciplines because of the use of highly specialized medical language, as encountered in cardiology
clinical case reports or cardiology clinical notes. Domain adaptation strategies may have a great potential
to improve NLP solutions for practical settings, real-world scenarios and industrial applications [16].
Also, adaptation of clinical NLP solutions across languages other than English is necessary and requires
collaboration between researchers to accelerate progress in non-English clinical NLP [17].
In the case of Spanish, general clinical NLP datasets and resources, such as the DisTEMIST,
SympTEMIST, PharmaCoNER, and MedProcNER corpora and systems, have been released. How-
ever, (a) the interplay and complementarity of multi-label entity extraction approaches were neither
targeted nor evaluated, and (b) how such approaches could be adapted to handle multiple languages
was not tested.
To address these issues and promote the development of comparable clinical NLP components adapted
to a specific clinical domain across several languages, we have organized the MultiCardioNER shared
task. This paper presents an overview of the data, methodologies and results of MultiCardioNER. It
is structured as follows: Section 2 introduces the shared task, including its sub-tasks and evaluation
methods. Next, Section 3 describes the different corpora used as part of MultiCardioNER, namely
DisTEMIST, DrugTEMIST and CardioCCC, as well as other associated resources, while Section 4
presents the participation results and proposed methodologies. Finally, Section 5 concludes the paper
with a discussion of some of the most interesting aspects, learned lessons, future work and more.
2. Task description
2.1. Shared task description
The MultiCardioNER task participants were asked to implement named entity recognition (NER)
systems using a general clinical corpus annotated with disease and medication mentions. They were
then required to adapt these NER systems to a particular medical specialty, namely cardiology. In
addition, the MultiCardioNER task also explored the creation of clinical multilingual NER components
or cross-language adaptation of these systems for three languages: Spanish, English and Italian.
The MultiCardioNER task relied on a previous resource exploited in a past shared task, called
DisTEMIST [18]. The DisTEMIST corpus is a collection of 1,000 clinical case reports covering a wide
range of specialties annotated for diseases by clinical experts. Furthermore, a previously unreleased
corpus called DrugTEMIST was published as part of the task. The DrugTEMIST corpus provides drug
or medication mention annotations for the same collection of clinical case reports as used for the
DisTEMIST dataset.
For the adaptation to cardiology, we have constructed the CardioCCC corpus, a new dataset that
consists of manually selected cardiology clinical case reports showing similar characteristics as cardi-
ology discharge summaries. This resource was provided to participants to enable the exploration of
different clinical subdomain/specialty adaptation strategies and to benchmark the resulting systems.
To foster the generation of multilingual clinical NER corpora, DrugTEMIST and CardioCCC were
automatically translated from Spanish into both English and Italian. The Gold Standard drug mention
annotations were then mapped into both target languages and validated manually by clinical experts
(native speakers of English and Italian). The three corpora and the underlying annotation projection
process are described in more detail in Section 3.
The evaluation process relied on the comparison of participating team predictions against the manual
annotations previously done by the clinical experts. Each team was allowed to submit up to 5 runs for
each subtrack and language. The evaluation process and metrics are reported in Section 2.3.
2.2. Subtracks
MultiCardioNER was structured into two different subtracks:
• Subtrack 1 (CardioDis). This track focuses on the adaptation of disease recognition systems to
the cardiology specialty in Spanish. Participants could use the DisTEMIST corpus [18] as a base
training set, together with a new collection of cardiology-specific clinical case reports annotated
with diseases (CardioCCC) that could be used to fine-tune or adapt their systems to cardiology
case reports.
• Subtrack 2 (MultiDrug). This subtrack focuses on the multilingual or cross-language adaptation
(Spanish, English and Italian) of medication recognition systems, specifically for cardiology
clinical case reports. For this track, participants could use the DrugTEMIST dataset as NER
training resource. This corpus can be seen as a complementary dataset to the previously-released
DisTEMIST, ProcTEMIST and SympTEMIST corpora, as it incorporates annotations of medications
for the same document collection. To enable adaptation to cardiology, the CardioCCC corpus was
annotated with medication mentions and divided into development and test subsets. While the
original versions of both datasets were created using Spanish texts, a machine-translated version
in English and Italian was revised by hand and annotated by clinical experts.
2.3. Evaluation
The task was divided into distinct phases: training and test set prediction (evaluation). During the
training phase, participants were provided with the DisTEMIST and DrugTEMIST datasets, as well as
a subset of the CardioCCC corpus made up of 258 documents. The second batch of the CardioCCC
collection was used as test set and released together with a larger background set to make sure that no
manual post-editing was carried out by the teams and that the submitted systems could scale up to
process larger data collections. These collections (test and background set) were released approximately
one month after the start of the training phase. Participants were given two weeks to generate predictions
for all documents. They were then evaluated using the Gold Standard annotations of the CardioCCC test
set, reserving the predictions for the background set to create a participants’ Silver Standard (discussed
in Section 3.4). It was not mandatory to submit results for all three languages.
Both MultiCardioNER subtracks were evaluated using micro-averaged precision, recall and F1-score.
These metrics are calculated as follows:
Table 1
Results of the baseline system (vocabulary transfer) for the two MultiCardioNER subtracks
Subtrack Language System Name Precision Recall F1
CardioDis Spanish DisTEMIST vocabulary transfer 0.5178 0.3681 0.4303
MultiDrug Spanish DrugTEMIST vocabulary transfer 0.6366 0.7148 0.6734
MultiDrug English DrugTEMIST vocabulary transfer 0.3317 0.7269 0.4556
MultiDrug Italian DrugTEMIST vocabulary transfer 0.3320 0.6844 0.4471
True Positives
Precision (P) =
True Positives + False Positives
True Positives
Recall (R) =
True Positives + False Negatives
2 * (𝑃 * 𝑅)
F1 score (F1) =
(𝑃 + 𝑅)
As part of the task, an official MultiCardioNER evaluation library was released and is available on
GitHub1 . After the task results were released, the test set Gold Standard annotations were shared with
participating teams to enable them to perform extra experiments and facilitate error analysis of their
systems.
2.4. Baseline
To provide a baseline system for comparison, we used a simple vocabulary transfer approach that relied
on generating a gazetteer of entities from the training sets (DisTEMIST/DrugTEMIST corpora), and
carrying out dictionary look-up of these terms in the test set. Specifically, the system is a lexical lookup
approach that tries to find the annotated strings in both corpora within the cardiology test set. The
baseline results are shown in Table 1.
3. Corpus and resources
The MultiCardioNER task leverages an already-existing corpus, DisTEMIST, as well as two new releases,
DrugTEMIST and CardioCCC. DisTEMIST and DrugTEMIST share the same document collection,
which consists of clinical case reports from various clinical specialties such as oncology, infectious
diseases, urology and psychiatry. This collection of texts has also been used for the procedures corpus
MedProcNER/ProcTEMIST [19] and the signs and symptoms corpus SympTEMIST [20]. These corpora
could be considered complementary since they have been annotated by the same clinical experts using
the same methodology, which includes the creation of dedicated annotation guidelines. They were
released as part of previous shared tasks in an effort to promote the development and accessibility of
annotated resources for clinical information extraction in Spanish validated by clinical experts. Other
resources resulting from this initiative include PharmaCoNER [21], LivingNER [22], MEDDOPROF [23]
or MEDDOPLACE [24].
The CardioCCC corpus consists of cardiology-specific clinical case reports. It includes annotations
for diseases and drugs created using the same guidelines as DisTEMIST and DrugTEMIST. Although
all three corpora were created originally in Spanish, the texts and annotations related to drugs were
translated into English and Italian and released for this task. Table 2 provides some statistics for the
different datasets that make up MultiCardioNER, which are explained in detail in this section. All
datasets described in this section are openly available on Zenodo2 .
1
https://github.com/nlp4bia-bsc/multicardioner_evaluation_library
2
https://zenodo.org/doi/10.5281/zenodo.10948354
Table 2
Statistics for the datasets provided for MultiCardioNER. “Annot.” stands for “annotations”, while “Chars”
stands for “characters”. Unique annotations refer to the number of distinct annotated strings after converting
all annotations to lowercase. The number of tokens has been calculated using the following spaCy models:
“es_core_news_sm”, “en_core_web_sm” and “it_core_news_sm”.
Mean Mean
Unique
Dataset Lang. Entity Docs Tokens Chars Annot. Annot. Annot.
Annot.
Tokens Chars
DisTEMIST ES Diseases 1,000 406,137 2,335,968 10,664 6,739 3.20 ± 2.98 24.76 ± 18.89
DrugTEMIST ES Drugs 1,000 406,137 2,335,968 2,778 925 1.19 ± 0.56 11.34 ± 4.46
EN Drugs 1,000 404,194 2,230,631 2,814 875 1.25 ± 0.66 11.26 ± 0.52
IT Drugs 1,000 421,251 2,393,002 2,808 893 1.25 ± 0.69 11.49 ± 4.73
CardioCCC ES Diseases 508 568,297 3,215,774 18,232 7,692 3.32 ± 2.84 26.28 ± 19.06
ES Drugs 508 568,297 3,215,774 4,227 755 1.19 ± 0.71 11.60 ± 5.25
EN Drugs 508 576,772 3,114,833 4,231 734 1.21 ± 0.64 11.37 ± 4.74
IT Drugs 508 595,332 3,345,466 4,385 752 1.23 ± 0.72 11.85 ± 5.25
Figure 1: Excerpt from the DisTEMIST corpus with various annotated diseases. Translation with annotated
entities in italics: “A 37-year-old woman diagnosed with AML (acute myeloblastic leukemia) in 2003 following a
spontaneous right hemopneumothorax that required surgery with evacuation of the hemothorax and resection of
bullous dystrophy. She was followed up on an outpatient basis without incident until 2009 when he presented
with chylous ascites and a large retroperitoneal cystic lymphangioma was detected on an abdominal computed
tomography (CT) scan. In February 2011 she was admitted for exertional dyspnea and extensive right pleural
effusion. Pleural fluid showed characteristics of chylothorax: [...]”.
3.1. DisTEMIST
DisTEMIST is a Gold Standard manually annotated corpus of disease mentions in Spanish clinical case
documents normalized or mapped to SNOMED CT concept identifiers. It consists of 1,000 clinical case
reports written in Spanish from miscellaneous medical specialties. Figure 1 shows an example of an
annotated document.
The texts in the corpus were derived from SciELO (Scientific Electronic Library Online)3 , an electronic
library that contains publications from scientific journals. The texts were manually selected by clinical
experts so that their structure and content were clinically relevant and representative. The texts were
then pre-processed to extract the appropriate sections of the clinical cases and to remove embedded
figure references and citations to be as close as possible to real medical records. These texts were
originally released under the name SpaCCC (Spanish Clinical Case Corpus). As shown in Table 2, the
text collection includes a total of 406,137 tokens and 2,335,968 characters. In terms of annotations, the
corpus includes a total of 10,664 entities, out of which 6,739 are unique after converting them into
lowercase.
The DisTEMIST corpus was annotated and standardized by two clinical experts from a Spanish
3
http://www.scielo.org
tertiary hospital. The annotated mentions and their normalization were post-processed and revised
afterwards by a third physician. The annotations were created using the brat tool [25]. Annotation and
normalization guidelines were created specifically for this task. The annotation involved discussions
between physicians, particularly regarding complex mentions. This, together with multiple rounds of
inter-annotator agreement (IAA) through parallel annotation of a section of the corpus (around 20%),
resulted in an iterative refinement of the guidelines. After several rounds, a total IAA score of 82.3
(computed as the pairwise agreement between two independent annotators) for the disease mentions
was achieved.
The result of this process is the DisTEMIST guidelines, openly available on Zenodo4 . The document
contains a total of 28 pages describing how to annotate diseases in clinical texts. There are a total of 52
rules divided into various types, such as general, positive or negative. There is also a set of rules specific
to oncology mentions that was added as the language used in clinical cases related to this specialty
proved to be more specific and harder to annotate. These are partially based on the CANTEMIST corpus
[26]. The guidelines also include a discussion of the task’s importance, a corpus characterization, basic
information about the task and the annotation process, as well as indications and resources for the
annotators. It is noteworthy that the DisTEMIST guidelines have been adapted to other domains, such
as social media [27].
The DisTEMIST text documents are in plain text format with UTF-8 encoding. The annotations are
presented in two different stand-off versions. The first version includes the original annotation files
as outputted by brat [25]. These are .ann files, one for each text file, where each line represents an
annotation, including its label, its start and end position and its associated text. The second version is a
single tab-separated file (.tsv) which includes all annotations in the corpus. Similarly to the .ann files,
this version includes one annotation per row with an additional field for the corresponding filename.
For MultiCardioNER, all 1,000 documents in the corpus are presented together. For anyone who wishes
to use the original train/test split of the corpus (consisting of 750 and 250 documents, respectively), we
advise downloading the original DisTEMIST Gold Standard5 to retrieve the list of filenames belonging to
each split. The original repository also includes the SNOMED CT mappings for the annotated mentions,
as well as some additional data, such as a background set of related clinical documents and a Silver
Standard of the corpus in 6 languages (English, Portuguese, Catalan, Italian, French and Romanian),
created using annotation projection. The annotation projection methodology is described in Section 3.2,
as well as in the original paper [18].
3.2. DrugTEMIST
DrugTEMIST is a collection of 1,000 clinical case reports from various clinical specialties annotated
with mentions of medications. Figure 2 shows an excerpt of an annotated document from the corpus.
The corpus uses the same collection of texts as DisTEMIST, which is also shared by MedProcN-
ER/ProcTEMIST [19] and SympTEMIST [20]. Unlike those corpora, DrugTEMIST hadn’t been previously
released and is one of the novelties of the MultiCardioNER task. Again, the corpus includes a total of
406,137 tokens and 2,335,968 characters, as well as 2,778 annotated entities (925 unique after converting
them to lowercase).
Similarly to the DisTEMIST corpus, dedicated annotation guidelines were written to define what
should be considered a medication and how to perform the annotations. These guidelines were created
and refined using the same methodology used for DisTEMIST, including thorough discussions between
physicians and the annotation of a sample of the corpus (around 20%). The final IAA of the corpus is
0.955. The DrugTEMIST annotation guidelines are also available in Zenodo6 . They contain 17 pages
and are quite similar to the DisTEMIST guidelines, with a total of 29 rules. The release format of the
corpus is the same as that of DisTEMIST.
4
https://zenodo.org/doi/10.5281/zenodo.6458078
5
https://zenodo.org/doi/10.5281/zenodo.6408476
6
https://zenodo.org/doi/10.5281/zenodo.11065432
Figure 2: Excerpt from the DrugTEMIST corpus with various annotated medications. Translation with annotated
entities in italics: “An 82-year-old woman with a history of breast neoplasia treated with surgery and hormone
therapy 20 years ago, hypertensive cardiomyopathy in sinus rhythm, hypercholesterolemia and moderate chronic
hyponatremia around 133 mmol/L. She was treated with torasemide 5 mg/24h, isosorbide mononitrate 50 mg/24h,
acetylsalicylic acid 100 mg/24h, pravastatin 20 mg/24h, candesartan 32 mg/24h, hydrochlorothiazide 12.5 mg/24h,
atenolol 50 mg/24h and spironolactone 25 mg/24h”.
The original Gold Standard of the corpus was created in Spanish. For the multilingual part of the
task, we created versions of the corpus in English and Italian using annotation projection techniques.
These two languages were chosen due to their relevance for other related projects and the availability of
clinical experts fluent in each language, who performed a manual revision of all documents to validate
the annotation and the quality of the translation. Specifically, the annotation projection methodology
consisted of the following steps:
1. An automatic translation of the Spanish documents was carried out (for the previous DisTEMIST
task) using high-quality commercial machine translation systems. In a separate step, the Gold
Standard annotations were translated without context (i.e. as a plain list of strings).
2. The translated annotations were next transferred into each document using a look-up system.
For each document, only the annotations that existed in the original Gold Standard were looked
up to prevent introducing false positives. The result of this step is an automatically annotated
version of the corpus in each language, which could be considered a Silver Standard.
3. In order for the corpus to be used as a Gold Standard, a manual revision was performed. Experts
compared the original Spanish version of the documents with the version in English and Italian
using brat’s side-by-side comparison mode. They were tasked with correcting existing and
adding new mentions if necessary to make the annotation as close as possible to the original.
Additionally, these experts were asked to provide alternative translations to annotated entities
that were incorrectly translated.
4. A post-processing step incorporated the alternative translations suggested by the annotators.
These translations replaced the original annotated entity both in the text and in the annotation
files.
Statistics about the English and Italian versions of DrugTEMIST are also provided in Table 2. We
should underscore that the different versions of the corpus do not contain the exact same number
of annotations. This is mostly due to translation differences and errors introduced by the machine
translation system.
3.3. CardioCCC
CardioCCC (which stands for Cardiology Clinical Case Corpus) is a collection of 508 cardiology clinical
case reports. The documents were retrieved from open-access cardiology journals in Spanish. Within
these journals, we tried to manually locate clinical case reports that would have a similar structure to
real clinical health records. The candidates were then extracted and, in a similar fashion to the other
two corpora presented so far, pre-processed to keep only the relevant article sections and to remove
references to figures and tables. The cases were then revised by a clinical expert to confirm their validity.
Figure 3 shows a parallel example of the drug annotations in Spanish, English and Italian.
(a) Example in English.
(b) Example in Spanish.
(c) Example in Italian.
Figure 3: Excerpt from the CardioCCC drug annotations in all three languages taken from the same document.
Table 3
Statistics for the two splits of the CardioCCC corpus. CardioCCC_dev refers to the first batch of the corpus,
which participants were allowed to use freely during the training phase. CardioCCC_test refers to the held-out
test set used for evaluation. “Annot.” stands for “annotations”, while “Chars” stands for “characters”. Unique
annotations refer to the number of distinct annotated strings after converting all annotations to lowercase. The
number of tokens has been calculated using the following spaCy models: “es_core_news_sm”, “en_core_web_sm”
and “it_core_news_sm”.
Mean Mean
Unique
Dataset Lang. Entity Docs Tokens Chars Annot. Annot. Annot.
Annot.
Tokens Chars
CardioCCC_dev ES Diseases 258 315,047 1,777,279 10,348 4,749 3.34 ± 2.72 26.67 ± 18.47
ES Drugs 258 315,047 1,777,279 2,510 526 1.21 ± 0.74 11.67 ± 5.34
EN Drugs 258 319,289 1,718,585 2,510 513 1.22 ± 0.66 11.48 ± 4.85
IT Drugs 258 330,291 1,848,888 2,585 520 1.23 ± 0.73 11.82 ± 5.31
CardioCCC_test ES Diseases 250 253,250 1,438,495 7,884 3,784 3.29 ± 3.00 25.76 ± 19.80
ES Drugs 250 253,250 1,438,495 1,718 465 1.17 ± 0.66 11.51 ± 5.12
EN Drugs 250 257,483 1,396,248 1,721 460 1.20 ± 0.61 11.21 ± 4.58
IT Drugs 250 265,041 1,496,578 1,800 469 1.22 ± 0.71 11.89 ± 5.16
The corpus contains annotations for diseases and drugs, which were created following the same
guidelines used for DisTEMIST and DrugTEMIST. The main annotator for CardioCCC was the same
clinical expert who did the final annotation and revision step for the other two corpora, which was a
big asset in accelerating the corpus annotation process. As with DrugTEMIST, the corpus’s texts were
translated from Spanish into English and Italian using machine translation. The Gold Standard drug
annotations were also transferred into English and Italian via annotation projection and revised by
clinical experts who are native speakers of each language.
As explained in Section 2.3, CardioCCC was released in two batches: one for training/development
and another for evaluation. The statistics for these two parts are presented in Table 3, while Table 2
presents the statistics of the complete corpus. In terms of content, as shown by Table 2, the corpus
contains 568,297 tokens and 3,215,774 characters. Despite having about half the documents as the
SpaCCC corpus (i.e. the texts in DisTEMIST/DrugTEMIST), CardioCCC contains over 150,000 more
tokens and one million more characters, meaning the documents are quite longer. This is also reflected
in the number of annotations, with CardioCCC having around 8,000 more annotated diseases and 1,500
more drugs. Notably, despite the higher total number of annotations, CardioCCC contains fewer unique
drug mentions (which is calculated by converting all annotations to lowercase). This might be due to
the fact that in CardioCCC, drug mentions are usually more limited to cardiology-specific medications,
while in DrugTEMIST, there is a wider variety of medications mentioned due to the varied clinical
specialties it contains. As for the length of annotations themselves, all corpora seem to have a similar
distribution in terms of character and token length. The high standard deviation with respect to the
mean, especially for diseases, indicates that there’s a number of long annotations in the datasets.
3.4. Background set
In addition to the three annotated corpora, an additional dataset was released as a background set.
This dataset contains 7,625 text documents, both from the cardiology subdomain and other clinical
specialties. While most documents were originally written in Spanish, some of them were also originally
in English and Italian. All documents were translated to the other languages to have a comparable
background set in all three languages. Together with the background set, we release a tab-separated
values (.tsv) file that specifies the original language of each document and whether they belong to the
cardiology domain or not.
As part of the task’s evaluation period, participants were asked to create predictions for diseases
and drugs using their systems. Their predictions were then used to create a Silver Standard, which we
release in three different versions:
1. All mentions are kept, with the label name reflecting the team and run the prediction belongs to.
This version inevitably includes many incorrect and redundant annotations.
2. Only predictions that have some overlap with the predictions of a different run are used. The
overlapping annotations are then merged under a single annotation and a new label name. This
version should have a reduced number of incorrect annotations, although some of the “correct”
annotations might have extension problems, such as being too short or too long.
3. Only predictions that have a complete overlap with another prediction of a different run are used.
This should, in theory, contain the highest number of correct annotations.
Table 4 shows some basic statistics about the text documents included within the Silver Standard.
This new dataset can have multiple uses, such as bootstrapping manual annotations, system training
using semi-supervised learning techniques or errors and data analysis, amongst others.
Table 4
Statistics of the documents in the background set.
Language Documents Tokens Characters
Spanish 7,625 3,863,801 22,066,533
English 7,625 3,857,831 21,130,044
Italian 7,625 4,015,920 22,782,246
Table 5
Overview of the teams that participated in MultiCardioNER. In the Affiliation column, A/I stands for
academic or industry institution. In the Tasks column, C stands for the CardioDis subtrack and M for
the MultiDrug subtrack.
Team Name Affiliation Tasks Ref.
BIT.UA IEETA, University of Aveiro, Portugal [A] C [28]
Technische Universität Wien, Austria & Spanish National Research
DataScienceTUW C/M [29]
Council (CSIC), Spain [A]
Enigma OntoText, Bulgary & Sofia University, Bulgary [I/A] C/M [30]
ICUE University of Edinburgh, UK & Imperial College London, UK [A] M [31]
NOVALINCS NOVA School of Science And Technology, Portugal [A] C/M [32]
PICUSLab Università degli Studi di Napoli Federico II, Italy [A] C [33]
Siemens Advanta, Romania & Transilvania University of Brasov,
Siemens C/M [34]
Romania [I/A]
4. Results
4.1. Participation overview
A total of 31 teams registered for the MultiCardioNER task, out of which 7 teams submitted at least
one run of their predictions. The participating teams originate from 8 different countries (some include
collaborations between teams from different countries), and except for one group from the industry,
the rest belong to academia. Table 5 shows the complete list of participating teams, along with their
affiliation and the reference to their task paper.
As for the participation in each subtrack, 6 teams participated in the CardioDis subtrack, while 5
teams participated in the MultiDrug subtrack (with one of those teams participating only in the Spanish
part). Overall, a total of 70 runs were submitted, with each team allowed up to 5 runs per subtrack and
language: 20 for the CardioDis subtrack, 18 for the Spanish MultiDrug, 16 for the English MultiDrug
and 16 for the Italian MultiDrug.
4.2. System results
All in all, the top scores for each subtrack were:
• Subtrack CardioDis. The team BIT.UA attained the top position with an ensemble of RoBERTa-
based models (roberta-es-clinical-trials-ner) that also uses a multi-head-CRF approach [35]. Their
runs integrated the provided datasets in different ways, with the highest scores achieved by the
models that use both the DisTEMIST and CardioCCC data. Their best run achieved an F1-score of
0.8199 and a recall of 0.8243. The team with the next best F1-score (0.8049) is Enigma, which uses a
CLIN-X-ES model also fine-tuned on the DisTEMIST and CardioCCC data. Interestingly, the team
PICUSLab achieves the best precision (0.8886) by a wide margin by combining the predictions of
multiple models trained on different parts of the data (including an augmented version of the
CardioCCC corpus) and then using string matching techniques to enhance the final predictions.
• Subtrack MultiDrug. In Spanish, the best F1-score is achieved by the ICUE team (0.9277), who
also achieved the best recall (0.9412). Meanwhile, in English and Italian, the winning team is
Enigma, with an F1-score of 0.9223 and 0.8842, respectively.
The results for the CardioDis subtrack are shown in Table 6, while the results for the MultiDrug
subtrack are presented in Table 7 for Spanish, Table 8 for English and Table 9 for Italian.
4.3. Methodologies
This section describes the methodologies used by each team, which are also summarized in Table 10.
Table 6
Results of the MultiCardioNER CardioDis subtrack, sorted by F1-score. The best result is bolded, and the
second-best is underlined.
Team Name Run name Precision Recall F1
BIT.UA run1-all-full 0.8155 0.8243 0.8199
BIT.UA run0-top5-full 0.811 0.8181 0.8145
Enigma 3-system-CLIN-X-ES-pretrained 0.8016 0.8082 0.8049
Enigma 2-system-CLIN-X-ES-14 0.8052 0.8007 0.803
PICUSLab aug_fus_sub2 0.7794 0.803 0.791
BIT.UA run4-all 0.7981 0.7827 0.7903
Enigma 1-system-CLIN-X-ES-12 0.7827 0.7938 0.7882
PICUSLab aug_fus_sub1 0.7346 0.7799 0.7566
BIT.UA run3-all-val 0.7544 0.7588 0.7566
BIT.UA run2-best-val 0.748 0.7542 0.7511
DataScienceTUW run4-roberta-dg 0.6565 0.7376 0.6947
DataScienceTUW run5-roberta-dg-windows 0.6546 0.7244 0.6877
Siemens run1_SDR 0.6758 0.6437 0.6593
PICUSLab aug_fus_sm_sub2 0.8919 0.4897 0.6323
DataScienceTUW run1_mdeberta-ct-mlm-dg 0.5928 0.6715 0.6297
PICUSLab aug_fus_sm_sub1 0.8886 0.4744 0.6185
DataScienceTUW run2-mdeberta-ct 0.5027 0.6884 0.581
DataScienceTUW run3_mdeberta-ct-dg 0.48 0.6773 0.5618
NOVALINCS 1_bsc-bio-ehr-es_distemist_4 0.8018 0.3525 0.4897
NOVALINCS 2_bsc-bio-ehr-es_distemist_1 0.8183 0.3398 0.4802
Table 7
Results of the MultiCardioNER MultiDrug subtrack in Spanish, sorted by F1-score. The best result is bolded, and
the second-best is underlined.
Team Name Run name Precision Recall F1
ICUE run2_single_pp 0.9146 0.9412 0.9277
ICUE run4_GPT_translation 0.9146 0.9412 0.9277
ICUE run5_GPT_translation_all 0.9146 0.9412 0.9277
Enigma 3-system-SpanishRoBERTa 0.913 0.9348 0.9238
Enigma 1-system-XLMR 0.904 0.9208 0.9123
Enigma 2-system-XLMR-filtering 0.9148 0.9005 0.9076
ICUE run3_single 0.8777 0.9272 0.9018
Siemens run1_SMR 0.8928 0.8778 0.8852
ICUE run1_multilingual_pp 0.8287 0.9348 0.8786
Enigma 5-system-XLMR-filtering-dict2 0.7654 0.8871 0.8218
NOVALINCS 3_bsc-bio-ehr-es_drugtemist_4 0.9242 0.4965 0.646
NOVALINCS 4_bsc-bio-ehr-es_drugtemist_1 0.9076 0.4919 0.638
DataScienceTUW run3_roberta-ct-multilingual 0.8705 0.4342 0.5794
Enigma 4-system-XLMR-filtering-dict1 0.4351 0.7899 0.5611
DataScienceTUW run5_roberta-ct-mlm 0.8421 0.3912 0.5342
DataScienceTUW run4_mdeberta_ct_mlm_dg 0.6815 0.3836 0.4909
DataScienceTUW run2_mdeberta-ct-multilingual 0.7647 0.3556 0.4855
DataScienceTUW run1_mdeberta-multilingual 0.3914 0.1531 0.2201
• Team BIT.UA.
For subtrack CardioDis, this team builds on some of their previous work, namely the Multi-Head-
CRF approach [35], which introduces a Multi-Head Conditional Random Field (CRF) classifier
on top of a multi-class NER system. Starting from the “roberta-es-clinical-trials-ner” pre-trained
Table 8
Results of the MultiCardioNER MultiDrug subtrack in English, sorted by F1-score. The best result is bolded, and
the second-best is underlined.
Team Name Run name Precision Recall F1
Enigma 3-system-BioLinkBERT 0.8981 0.9477 0.9223
ICUE run2_single_pp 0.9086 0.9128 0.9107
ICUE run4_GPT_translation 0.9086 0.9128 0.9107
Enigma 1-system-XLMR 0.8823 0.9233 0.9023
Enigma 2-system-XLMR-filtering 0.9031 0.8989 0.901
Enigma 5-system-XLMR-filtering-dict2 0.8698 0.9047 0.8869
ICUE run3_single 0.8734 0.8977 0.8854
ICUE run1_multilingual_pp 0.8314 0.9343 0.8799
Siemens run1_EMR 0.8685 0.8791 0.8738
Enigma 4-system-XLMR-filtering-dict1 0.8298 0.921 0.873
ICUE run5_GPT_translation_all 0.8767 0.8635 0.87
DataScienceTUW run3_roberta-ct-multilingual 0.8632 0.4364 0.5797
DataScienceTUW run4-mdeberta-windows 0.7955 0.4317 0.5597
DataScienceTUW run5-biobert-mlm-windows 0.6771 0.441 0.5341
DataScienceTUW run2_mdeberta-ct-multilingual 0.8453 0.3777 0.5221
DataScienceTUW run1_mdeberta-multilingual 0.5648 0.2481 0.3448
Table 9
Results of the MultiCardioNER MultiDrug subtrack in Italian, sorted by F1-score. The best result is bolded, and
the second-best is underlined.
Team Name Run name Precision Recall F1
Enigma 1-system-XLMR 0.884 0.8844 0.8842
Enigma 3-system-Italian-Spanish-RoBERTa 0.8723 0.8956 0.8838
Enigma 2-system-XLMR-filtering 0.9016 0.8606 0.8806
Siemens run1_IMR 0.8891 0.8689 0.8789
ICUE run4_GPT_translation 0.9114 0.8461 0.8776
ICUE run5_GPT_translation_all 0.9114 0.8461 0.8776
ICUE run2_single_pp 0.8186 0.9 0.8574
ICUE run1_multilingual_pp 0.8139 0.8867 0.8487
ICUE run3_single 0.7879 0.8894 0.8356
Enigma 4-system-XLMR-filtering-dict1 0.5693 0.8578 0.6844
Enigma 5-system-XLMR-filtering-dict2 0.5707 0.845 0.6813
DataScienceTUW run3_roberta-ct-multilingual 0.8264 0.4206 0.5574
DataScienceTUW run4-mdeberta 0.7481 0.3928 0.5151
DataScienceTUW run5-biobit-mlm 0.7922 0.3517 0.4871
DataScienceTUW run2_mdeberta-ct-multilingual 0.7433 0.3394 0.4661
DataScienceTUW run1_mdeberta-multilingual 0.5074 0.2094 0.2965
model7 , they present 5 runs of ensembled models, with some runs consisting on models fine-tuned
only with the DisTEMIST dataset and others with DisTEMIST plus CardioCCC. Their best run is
an ensemble of 17 systems trained on both corpora, which achieves the highest F1-score of the
subtrack (0.8199).
• Team Data Science TUW.
This team uses four main strategies throughout their experiments for both subtracks: pre-training
via MLM (Masked Language Modelling), data augmentation, sliding windows with overlap and
additional pre-training on general diseases and drugs using other corpora. The pre-trained models
they use include the multilingual mDeBERTa [36, 37], the Spanish “roberta-es-clinical-trials-ner”,
7
https://huggingface.co/lcampillos/roberta-es-clinical-trials-ner
Table 10
General overview of the approaches presented by participants for the MultiCardioNER task. “*TEMIST corpora”
refers to the joint version of the DisTEMIST, SympTEMIST, ProcTEMIST and DrugTEMIST corpora.
Team Task Approaches
Ensemble of RoBERTa models with multi-head CRF and differences in
BIT.UA CardioDis
the data used for training (only DisTEMIST or DisTEMIST + CardioCCC)
Data Science Transformer-based models with different pretraining settings, data
CardioDis
TUW augmentation and window sliding
Multilingual and language-specific Transformers with different
MultiDrug
pretraining settings, data augmentation and window sliding
CLIN-X-ES model fine-tuned on the entire task data + custom clinical
Enigma CardioDis
dataset
Multilingual and language-specific Transformers fine-tuned on the entire
MultiDrug
task data + custom drug dictionary
Multilingual and language-specific BERT models with re-training,
ICUE MultiDrug
post-processing rules + GPT 3.5
RoBERTa model fine-tuned on the standalone DisTEMIST corpus vs.
NOVALINCS CardioDis
joint *TEMIST corpora
RoBERTa model fine-tuned on the standalone DrugTEMIST corpus vs.
MultiDrug
joint *TEMIST corpora
Ensemble of Transformer-based models trained on different datasets,
PICUSLab CardioDis including an augmented version of CardioCCC + post-processing via
string matching
Siemens CardioDis Fine-tuned general domain BERT model
MultiDrug Fine-tuned language-specific general domain BERT models
the English “biobert_chemical_ner”8 and the Italian BioBIT [38].
An important note about this team’s results is that they had some problems with their submission
that caused the overall low results. This is addressed in their system description, in which they
re-evaluate their models with much better results, comparable to some of the task’s best.
• Team Enigma.
For subtrack CardioDis, team Enigma fine-tuned a CLIN-X-ES model [39] on the DisTEMIST and
CardioCCC corpora for a different number of epochs. One of their runs further pre-trains the
model using Spanish Wikipedia pages and datasets from different challenges, achieving them a
spot in the subtrack’s top three F1-scores (0.8049).
For subtrack MultiDrug, the team uses a combination of different models, including a multilingual
XLM-RoBERTa [40] and language-specific models such as a Spanish RoBERTa [41] (which they also
use for Italian) and BioLinkBERT for English [42]. Their first run, which uses the multilingual XLM-
RoBERTa, pre-trains the model on a custom multi-lingual dataset (including biomedical challenge
data, European drug description data, Wikipedia) and then fine-tuned for token classification on
all data for all languages. For Italian, this approach achieves them the highest F1-score of the
Italian part of the subtrack (0.8842).
Their second run uses the same system but adds a classifier before it, which determines if there
are any drugs in the sentence. For Spanish and English, their best run is the third one, which uses
a language-specific model. This is not the case, however, for their third Italian run, which uses a
Spanish model pre-trained on Italian data. Another interesting contribution by this team is the
combination of neural systems and drug dictionaries obtained from resources such as DrugBank,
ATC, DrugCentral or the NIHS. The two runs that use this approach achieve very good results,
although not as good as their other ones.
8
https://huggingface.co/alvaroalon2/biobert_chemical_ner
• Team ICUE.
For the MultiDrug subtrack, this team compares the effectiveness of multilingual and monolingual
BERT models. They also experiment with the inclusion of post-processing rules (specifically for
composite drug mentions in Spanish), as well as with using Large Language Models (LLMs) such
as GPT-3.5 [43] to translate predictions in Spanish to the other two languages. Their methodology
achieves very good results, especially when they use monolingual models. In Spanish, they
achieve the best F1-score (0.9277). It is noteworthy that some of their runs in the results table are
repeated since they presented the same system with changes only for some languages.
Team ICUE also includes some additional experiments in their system description paper, such as
using GPT-3.5 and LLaMA [44] for entity recognition with competitive results.
• Team NOVALINCS.
For CardioDis, this team fine-tunes the “bsc-bio-ehr-es” pre-trained RoBERTa9 using the Dis-
TEMIST corpus. They prepared two runs: one in which they only use the DisTEMIST annotations
and another in which they also incorporate the other 3 entities from the complementary corpora
(that is, procedures from MedProcNER/ProcTEMIST, symptoms from SympTEMIST and medica-
tions from DrugTEMIST). For MultiDrug, they only participated in the Spanish part using the
same methodology, exchanging DisTEMIST with DrugTEMIST. Their overall results for both
tasks are remarkable for their high precision and low recall, which may indicate the difficulty of
the systems to adapt to the cardiology subdomain using only the general clinical domain data.
• Team PICUSLab.
For the CardioDis subtrack, this team employs an ensemble transfer learning strategy. They train
different models on DisTEMIST, CardioCCC and an augmented version of CardioCCC (created
with the help of sentence similarity techniques and a gazetteer), and then fuse the predictions
of the different models. To further improve their predictions, they use string matching to post-
process them. Their best run earns them a spot in the subtrack’s top 5 with an F1-score of
0.791.
• Team Siemens.
This team participated in both CardioDis and MultiDrug with the same methodology. They
use general domain BERT models (“bert-spanish-cased-finetuned-ner“10 , “bert-base-NER“11 and
“bert-italian-finetuned-ner“12 ) and fine-tune them for multi-label token classification using the
different MultiCardioNER datasets. Despite not using clinical models, their results are quite
good, especially for the MultiDrug subtrack (e.g. 0.8789 F1-score in the Italian part). In their
overview paper, they also perform additional experiments that were not evaluated during the
task’s evaluation phase.
5. Discussion
Comparison with previous tasks.
MultiCardioNER is a novel task built upon the foundation of previous tasks and resources. In recent
years, tasks such as DisTEMIST [18], PharmaCoNER [21] or MedProcNER [19] have provided the
Spanish NLP community with a variety of corpora for the recognition (and normalization) of named
entities in clinical texts. These corpora have progressively become reference corpora used to benchmark
and model pre-training efforts [39, 45, 46, 47, 48, 49]. MultiCardioNER is different from these previous
tasks in that it uses data from a single clinical specialty, rather than a general medical dataset. The
CardioCCC corpus could become a reference for cardiology and subdomain adaptation in clinical NLP
in Spanish. The corpus is expected to expand with the addition of case reports, more entity types (such
as procedures and symptoms), and more languages.
9
https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es
10
https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner
11
https://huggingface.co/dslim/bert-base-NER
12
https://huggingface.co/nickprock/bert-italian-finetuned-ner
Subdomain adaptation is a major goal of MultiCardioNER. The task’s results indicate the importance
of using subdomain data to build systems with specific application fields. All top-performing systems
incorporate the released 258 documents from the CardioCCC corpus. In contrast, participants that only
use the DisTEMIST and DrugTEMIST corpora (consisting of clinical case reports from various specialties)
achieve high precision but fail to recall, thus obtaining a comparatively lower F1-score. This suggests
that, while these systems are able to retrieve many clinical entities correctly (i.e. high precision), they
fail to recover concepts specific to the cardiology subdomain (i.e. low recall). Furthermore, comparing
the results of the DisTEMIST shared task [18] with the CardioDis subtrack, the overall results are
somewhat better in the latter task: DisTEMIST’s winning team obtained an F1-score of 0.77, while the
winning team of MultiCardioNER obtained an F1 of 0.81. This might point to the importance of using
specialty-specific data, even within very similar clinical domains.
We should underline that compared with DisTEMIST, this task offers a higher volume of training
data. While there seems to be a positive correlation with the use of subdomain-specific data, it
remains a question whether these improvements can actually be attributed to subdomain adaptation, to
differences in each of the tasks’ test sets, or to simply having more data.
Similarity between the general domain and the cardiology corpora.
Given the task’s focus on subdomain adaptation, and in order to further characterise the cardiology
and SpaCCC datasets (i.e. the DisTEMIST and DrugTEMIST texts) of the shared task, a comparison
analysis was conducted between these clinical case reports and documents belonging to other medical
disciplines. These documents consist of a collection of clinical cases categorised into 22 different
specialities with varying text structures and content, including oncology, COVID-specific reports,
primary health care, neurology, etc. The data for the other specialties was extracted using the same
methodology as for the CardioCCC (cardiology) corpus (explained in Section 3.3).
For the analysis, we tried to create a mathematical representation of the different document spe-
cialties and their subsequent visualisation in a two-dimensional space. To this purpose, the document
embeddings were extracted using the pre-trained language model “roberta-base-biomedical-clinical-es”
(RoBERTa-based and trained on a large Spanish biomedical corpus from different sources), resulting in
tensors of 𝑛 × 𝑚 dimensions, where 𝑛 is the number of sentences in the document and 𝑚 is the size
of the language model (768 for the RoBERTa model). Subsequently, a vector composition technique
was employed to process the extracted document embeddings, as described in the work of Amigó et al.
[50]. This involved utilising the proposed generalised composition function in Amigó et al. [50] and
illustrated in Equation 1. In this expression, the first component determines the vector direction of the
sum of two vectors (𝑣⃗1 and 𝑣⃗2 ), while the second component represents its magnitude, which depends
on the norm of single vectors and their inner product. By applying this function to pairs of successive
sentences in a document and representing them as vectors, we are able to compute and represent each
document as a single vector (embedding).
In this study we implemented two different composition functions derived from Equation 1, the
summation (𝐹𝑠𝑢𝑚 ), obtained when the constants 𝜆 and 𝜇 are equal to 1 and −2 respectively, and 𝐹𝑖𝑛𝑑 ,
a particularization of Equation 1 when 𝜆 is equals to 1 and 𝜇 to 0.
𝑣⃗1 + 𝑣⃗2
√︁
𝐹𝜆,𝜇 (𝑣⃗1 , 𝑣⃗2 ) = · 𝜆(‖𝑣⃗1 2 ‖ + ‖𝑣⃗2 2 ‖) − 𝜇⟨𝑣⃗1 , 𝑣⃗2 ⟩ (1)
‖𝑣⃗1 + 𝑣⃗2 ‖
Following the document vector representation and the composition function technique, we imple-
mented a t-Distributed Stochastic Neighbour Embedding (t-SNE) algorithm with a perplexity of 30
and a maximum number of iterations of 800 to reduce the dimensionality of the document embed-
dings. This statistical method enables the visualisation of high-dimensional document embeddings in
lower-dimensional spaces, in this case, two dimensions.
Figures 4 and 5 illustrate the scatter plots generated by the applied methodology, utilising the two
composition functions previously mentioned, 𝐹𝑠𝑢𝑚 and 𝐹𝑖𝑛𝑑 respectively. Both figures reveal distinct
clustering patterns depending on the specialty. Documents belonging to specific specialties form a
well-defined cluster (see cardiology i.e. CardioCCC in black), highlighting the fact that each of them
Figure 4: Document embeddings representation per each discipline after reduction of their dimensionality to
2-dimensions by applying the t-SNE algorithm and using 𝐹𝑠𝑢𝑚 as the composition function.
Figure 5: Document embeddings representation per each discipline after reduction of their dimensionality to
2-dimensions by applying the t-SNE algorithm and using 𝐹𝑖𝑛𝑑 as the composition function.
possesses unique features in terms of content and structure. In contrast, documents from the SpaCCC
corpus (red points) are scattered across the plot, reflecting their diverse nature. This is due to the fact
that they cover a wide range of medical disciplines, such as cardiology (CardioCCC), oncology, urology,
pneumology or infectious diseases, among many others.
Future work and conclusions.
There is a pressing need to promote the development of annotated datasets to generate automatic
clinical concept detection tools, not only for a single language but for several languages, following
comparable annotation criteria and consistent results across multiple languages. Due to the complexity
and considerable workload associated with the manual corpus construction process of clinical content,
the use of creative solutions such as neural translation and annotation projection strategies might provide
an alternative solution to traditional corpus construction attempts. The results of the MultiCardioNER
task indicate that it is feasible to create multilingual clinical corpora and use them to train and generate
very competitive clinical NER systems with comparable results across several languages.
Moreover, an adaptation of clinical NLP components to specific medical specialties can improve the
quality of the resulting systems for real-world scenarios. Typically clinical NLP application scenarios
or use cases focus on content related to a particular medical discipline, disease or patient type. In this
regard, the MultiCardioNER task also provides useful insights on how to adapt general-purpose clinical
NLP systems to the characteristics of a medical specialty of interest.
We foresee that the results, resources, and strategies generated through the MultiCardioNER task
(both by organizers and participants) might potentially promote also the creation of clinical NLP
resources beyond the three chosen languages covered in this track. The MultiCardioNER silver standard
corpus of predictions for Spanish, English and Italian could also constitute a valuable resource for data
augmentation or corpus construction by manually validating the generated system predictions.
The presented annotation projection strategy obviously relies on the sufficient quality of the used
medical translation systems. Therefore, systematic efforts to evaluate the quality of neural medi-
cal machine translation systems are critical. Initiatives like the Workshop on Machine Translation
(WMT) Biomedical Translation shared task has provided insights on the quality and potential of neural
translation technologies adapted to translate healthcare documents [51, 52].
Acknowledgments
The MultiCardioNER track was funded by Spanish and European projects such as DataTools4Heart
(Grant Agreement No. 101057849), AI4HF (Grant Agreement No. 101080430), BARITONE (Proyec-
tos de Transición Ecológica y Transición Digital 2021. Expediente Nº TED2021-129974B-C21) and
AI4ProfHealth (PID2020-119266RA-I00 MICIU/AEI/10.13039/501100011033).
Google was a proud sponsor of the BioASQ Challenge in 2023. Ovid is also sponsoring this edition
of BioASQ. The twelfth edition of BioASQ is also sponsored by Elsevier. Atypon Systems Inc. is also
sponsoring this edition of BioASQ.
References
[1] A. Tariq, T. Santos, I. Banerjee, Natural language processing for cardiovascular applications, in:
Artificial Intelligence in Cardiothoracic Imaging, Springer, 2022, pp. 231–243.
[2] T. Nagamine, B. Gillette, A. Pakhomov, J. Kahoun, H. Mayer, R. Burghaus, J. Lippert, M. Saxena,
Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured
electronic medical record data, Scientific reports 10 (2020) 21340.
[3] M. R. Turchioe, A. Volodarskiy, J. Pathak, D. N. Wright, J. E. Tcheng, D. Slotwiner, Systematic
review of current natural language processing methods and applications in cardiology, Heart 108
(2022) 909–916.
[4] M. J. Boonstra, D. Weissenbacher, J. H. Moore, G. Gonzalez-Hernandez, F. W. Asselbergs, Artificial
intelligence: revolutionizing cardiology with large language models, European Heart Journal 45
(2024) 332–345.
[5] R. Vijayakrishnan, S. R. Steinhubl, K. Ng, J. Sun, R. J. Byrd, Z. Daar, B. A. Williams, C. Defilippi,
S. Ebadollahi, W. F. Stewart, Prevalence of heart failure signs and symptoms in a large primary
care population identified through the use of text and data mining of the electronic health record,
Journal of cardiac failure 20 (2014) 459–464.
[6] S. Chae, J. Song, M. Ojo, M. Topaz, Identifying heart failure symptoms and poor self-management
in home healthcare: a natural language processing study, in: Nurses and Midwives in the Digital
Age, IOS Press, 2021, pp. 15–19.
[7] S. Khurshid, C. Reeder, L. X. Harrington, P. Singh, G. Sarma, S. F. Friedman, P. Di Achille, N. Dia-
mant, J. W. Cunningham, A. C. Turner, et al., Cohort design and natural language processing to
reduce bias in electronic health records research, Npj Digital Medicine 5 (2022) 47.
[8] O. V. Patterson, M. S. Freiberg, M. Skanderson, S. J. Fodeh, C. A. Brandt, S. L. DuVall, Unlocking
echocardiogram measurements for heart disease research through natural language processing,
BMC cardiovascular disorders 17 (2017) 1–11.
[9] A. N. Berman, D. W. Biery, C. Ginder, O. L. Hulme, D. Marcusa, O. Leiva, W. Y. Wu, N. Cardin,
J. Hainer, D. L. Bhatt, et al., Natural language processing for the assessment of cardiovascular
disease comorbidities: The cardio-canary comorbidity project, Clinical Cardiology 44 (2021)
1296–1304.
[10] J. R. Brown, I. M. Ricket, R. M. Reeves, R. U. Shah, C. A. Goodrich, G. Gobbel, M. E. Stabler, A. M.
Perkins, F. Minter, K. C. Cox, et al., Information extraction from electronic health records to
predict readmission following acute myocardial infarction: does natural language processing using
clinical notes improve prediction of readmission?, Journal of the American Heart Association 11
(2022) e024198.
[11] X. Zhan, M. Humbert-Droz, P. Mukherjee, O. Gevaert, Structuring clinical text with ai: Old versus
new natural language processing techniques evaluated on eight common cardiovascular diseases,
Patterns 2 (2021).
[12] H. A. Alhakimi, T. E. Magzoub, Applications of natural language processing in cardiology using
text clinical data: A systematic review, Advances in Clinical and Experimental Medicine 10 (2023).
[13] R. Zhang, S. Ma, L. Shanahan, J. Munroe, S. Horn, S. Speedie, Automatic methods to extract new
york heart association classification from clinical notes, in: 2017 ieee international conference on
bioinformatics and biomedicine (bibm), IEEE, 2017, pp. 1296–1299.
[14] R. Zhang, S. Ma, L. Shanahan, J. Munroe, S. Horn, S. Speedie, Discovering and identifying new
york heart association classification from electronic health records, BMC medical informatics and
decision making 18 (2018) 5–13.
[15] P. Adejumo, P. Thangaraj, L. S. Dhingra, A. Aminorroaya, X. Zhou, C. Brandt, H. Xu, H. M.
Krumholz, R. Khera, A deep learning approach for automated extraction of functional status and
new york heart association class for heart failure patients during clinical encounters, medRxiv
(2024).
[16] P. Singhal, R. Walambe, S. Ramanna, K. Kotecha, Domain adaptation: challenges, methods, datasets,
and applications, IEEE access 11 (2023) 6973–7020.
[17] E. Laparra, A. Mascio, S. Velupillai, T. Miller, A review of recent work in transfer learning and
domain adaptation for natural language processing of electronic health records, Yearbook of
medical informatics 30 (2021) 239–244.
[18] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara,
G. Katsimpras, G. Paliouras, M. Krallinger, Overview of DisTEMIST at BioASQ: Automatic detection
and normalization of diseases from clinical texts: results, methods, evaluation and multilingual
resources (2022).
[19] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras,
M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at
bioasq 2023, in: Working Notes of CLEF 2023, 2023.
[20] S. Lima-López, E. Farré-Maduell, L. Gasco-Sánchez, J. Rodríguez-Miret, M. Krallinger, Overview of
SympTEMIST at BioCreative VIII: Corpus, Guidelines and Evaluation of Systems for the Detection
and Normalization of Symptoms, Signs and Findings from Text, in: Proceedings of the BioCreative
VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models, 2023.
[21] A. Gonzalez-Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger, Pharma-
coner: Pharmacological substances, compounds and proteins named entity recognition track, in:
Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019, pp. 1–10.
[22] A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, D. Estrada, L. Gascó, M. Krallinger, Mention
detection, normalization & classification of species, pathogens, humans and food in clinical
documents: Overview of livingner shared task and resources, Procesamiento del Lenguaje Natural
(2022).
[23] S. Lima-López, E. Farré-Maduell, A. Miranda-Escalada, V. Brivá-Iglesias, M. Krallinger, Nlp
applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition,
classification and normalization of professions and occupations from medical texts, Procesamiento
del Lenguaje Natural 67 (2021) 243–256. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/
article/view/6393.
[24] S. Lima-López, E. Farré-Maduell, V. Brivá-Escalada, L. Gascó, M. Krallinger, MEDDOPLACE Shared
Task overview: recognition, normalization and classification of locations and patient movement in
clinical texts, Procesamiento del Lenguaje Natural 71 (2023).
[25] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, Brat: a web-based tool for
nlp-assisted text annotation, in: Proceedings of the Demonstrations at the 13th Conference of the
European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107.
[26] A. Miranda-Escalada, E. Farré-Maduell, M. Krallinger, Named entity recognition, concept normal-
ization and clinical coding: Overview of the cantemist track for cancer text mining in spanish,
corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages Evaluation
Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
[27] L. Gasco Sánchez, D. Estrada Zavala, E. Farré-Maduell, S. Lima-López, A. Miranda-Escalada,
M. Krallinger, The SocialDisNER shared task on detection of disease mentions in health-relevant
content from social media: methods, evaluation, guidelines and corpora, in: Proceedings of The
Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task,
Association for Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 182–189. URL:
https://aclanthology.org/2022.smm4h-1.48.
[28] R. Jonker, T. Almeida, S. Matos, BIT.UA at MultiCardioNER: Adapting a Multi-head CRF for
Cardiology, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF
Working Notes, 2024.
[29] P. Styll, L. Campillos-Llanos, W. Kusa, A. Hanbury, Cross-Linguistic Disease and Drug Detection
in Cardiology Clinical Texts: Methods and Outcomes, in: G. Faggioli, N. Ferro, P. Galuščáková,
A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[30] A. Aksenova, A. Datseris, S. Vassileva, S. Boytcheva, Transformer-Based Disease and Drug Named
Entity Recognition in Multilingual Clinical Texts: MultiCardioNER challenge, in: G. Faggioli,
N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[31] C. Lee, T. I. Simpson, J. M. Posma, A. D. Lain, Comparative Analyses of Multilingual Drug
Entity Recognition Systems for Clinical Case Reports In Cardiology, in: G. Faggioli, N. Ferro,
P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[32] R. Gonçalves, A. Lamúrias, Team NOVA LINCS @ BIOASQ12 MultiCardioNER Track: Entity
Recognition with Additional Entity Types, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García
Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[33] A. Romano, G. Riccio, M. Postiglione, V. Moscato, Identifying Cardiological Disorders in Spanish
via Data Augmentation and Fine-Tuned Language Models, in: G. Faggioli, N. Ferro, P. Galuščáková,
A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[34] M. D. Danu, V. G. Marica, C. Suciu, L. M. Itu, O. Farri, Multilingual Clinical NER for Diseases and
Medications Recognition in Cardiology Texts using BERT Embeddings, in: G. Faggioli, N. Ferro,
P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[35] R. A. A. Jonker, T. Almeida, R. Antunes, J. R. Almeida, S. Matos, Multi-head CRF classifier for
biomedical multi-class named entity recognition on Spanish clinical notes, Database (2024).
[36] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, in:
International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?
id=XPZIaotutsD.
[37] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing, 2021. arXiv:2111.09543.
[38] T. M. Buonocore, C. Crema, A. Redolfi, R. Bellazzi, E. Parimbelli, Localizing in-domain adaptation
of transformer-based biomedical language models, Journal of Biomedical Informatics 144 (2023)
104431.
[39] L. Lange, H. Adel, J. Strötgen, D. Klakow, Clin-x: pre-trained language models and a study
on cross-task transfer for concept extraction in the clinical domain, Bioinformatics 38 (2022)
3267–3274.
[40] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR
abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[41] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo,
A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained Biomedical Language Models for Clin-
ical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Language Pro-
cessing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL:
https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19.
[42] M. Yasunaga, J. Leskovec, P. Liang, LinkBERT: Pretraining Language Models with Document
Links, in: Association for Computational Linguistics (ACL), 2022.
[43] OpenAI, Gpt-3.5 model, https://www.openai.com, 2023.
[44] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient
foundation language models, 2023. arXiv:2302.13971.
[45] G. García Subies, Á. Barbero Jiménez, P. Martínez Fernández, A comparative analysis of spanish
clinical encoder-based models on ner and classification tasks, Journal of the American Medical
Informatics Association (2024) ocae054.
[46] A. J. Tamayo Herrera, D. A. Burgos, A. Gelbukh, Clinical text mining in spanish enhanced by
negationdetection and named entity recognition, Computación y Sistemas 27 (2023) 1169–1181.
[47] H. Verma, S. Bergler, N. Tahaei, Comparing and combining some popular ner approaches on
biomedical tasks, arXiv preprint arXiv:2305.19120 (2023).
[48] A. V. Serrano, G. G. Subies, H. M. Zamorano, N. A. Garcia, D. Samy, D. B. Sanchez, A. M. Sandoval,
M. G. Nieto, A. B. Jimenez, Rigoberta: a state-of-the-art language model for spanish, arXiv preprint
arXiv:2205.10233 (2022).
[49] F. Gallego, G. López-García, L. Gasco-Sánchez, M. Krallinger, F. J. Veredas, Clinlinker: Medical entity
linking of clinical concept mentions in spanish, in: International Conference on Computational
Science, Springer, 2024, pp. 266–280.
[50] E. Amigó, A. Ariza-Casabona, V. Fresno, M. A. Martí, Information Theory–based Compositional
Distributional Semantics, Computational Linguistics 48 (2022) 907–948. URL: https://doi.org/10.
1162/coli_a_00454. doi:10.1162/coli_a_00454.
[51] R. Bawden, K. B. Cohen, C. Grozea, A. J. Yepes, M. Kittner, M. Krallinger, N. Mah, A. Neveol,
M. Neves, F. Soares, et al., Findings of the wmt 2019 biomedical translation shared task: Evaluation
for medline abstracts and biomedical terminologies, in: ACL 2019 Fourth Conference on Machine
Translation, Association for Computational Linguistics, 2019, pp. 29–53.
[52] M. Neves, A. J. Yepes, A. Siu, R. Roller, P. Thomas, M. V. Navarro, L. Yeganova, D. Wiemann,
G. M. Di Nunzio, F. Vezzani, et al., Findings of the WMT 2022 biomedical translation shared task:
Monolingual clinical case reports, in: WMT22-Seventh Conference on Machine Translation, 2022,
pp. 694–723.