Multilingual Clinical NER for Diseases and Medications
                         Recognition in Cardiology Texts using BERT Embeddings
                         Manuela Daniela Danu1,2,* , George Marica1 , Constantin Suciu1,2 , Lucian Mihai Itu1,2 and
                         Oladimeji Farri3
                         1
                           Advanta, Siemens SRL, 15 Noiembrie Bvd, 500097 Brasov, Romania
                         2
                           Automation and Information Technology, Transilvania University of Brasov, 5 Mihai Viteazul Street, 500174 Brasov, Romania
                         3
                           Digital Technology and Innovation, Siemens Healthineers, 755 College Rd E, 08540 Princeton, NJ, United States


                                     Abstract
                                     The rapidly increasing volume of electronic health record (EHR) data underscores a pressing need to unlock
                                     biomedical knowledge from unstructured clinical texts to support advancements in data-driven clinical systems,
                                     including patient diagnosis, disease progression monitoring, treatment effects assessment, prediction of future
                                     clinical events, etc. While contextualized language models have demonstrated impressive performance improve-
                                     ments for named entity recognition (NER) systems in English corpora, there remains a scarcity of research focused
                                     on clinical texts in low-resource languages. To bridge this gap, our study aims to develop multiple deep contextual
                                     embedding models to enhance clinical NER in the cardiology domain, as part of the BioASQ MultiCardioNER
                                     shared task. We explore the effectiveness of different monolingual and multilingual BERT-based models, trained
                                     on general domain text, for extracting disease and medication mentions from clinical case reports written in
                                     English, Spanish, and Italian. We achieved an F1-score of 77.88% on Spanish Diseases Recognition (SDR), 92.09%
                                     on Spanish Medications Recognition (SMR), 91.74% on English Medications Recognition (EMR), and 88.9% on
                                     Italian Medications Recognition (IMR). These results outperform the mean and median F1 scores in the test
                                     leaderboard across all subtasks, with the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for
                                     SMR, 89.2%/88.96% for EMR, and 82.8%/87.76% for IMR.

                                     Keywords
                                     MultiCardioNER, BioASQ, Cardiology, Named Entity Recognition, NER, unstructured data, BERT, Multilingual,
                                     English, Spanish, Italian


                         1. Introduction
                         With the increasing amount of available electronic health record (EHR) data, clinical natural language
                         processing (NLP) tasks have become significantly important for extracting valuable information from
                         unstructured clinical texts [1]. Named Entity Recognition (NER) is a key NLP task used to identify
                         meaningful entities within these texts, such as anatomical structures, diseases and disorders, signs
                         and symptoms, procedures, and medications [1, 2]. Consequently, this facilitates various data analysis
                         applications, ranging from predicting future clinical events [3] to summarization [4] and relation
                         extraction between entities (e.g., drug-to-drug interactions [5], symptom-disease relationship [6],
                         patient-procedure association [7], etc.)
                            Despite recent advances in deep learning methods for NER [8, 9], extracting structured information
                         from the vast amounts of unstructured and often noisy clinical documents in EHR systems remains
                         challenging due to the highly specialized medical language, which varies considerably across different
                         medical specialties, as well as due to the prevalence of misspellings, abbreviations, and use of synonyms
                         to express clinical concepts [1].
                            While contextualized language models have recently improved the performance of NER systems for
                         English corpora [8, 10, 11], there is a notable lack of research focused on clinical texts in low-resource

                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         Disclaimer: The concepts and information presented in this paper are based on research results that are not commercially
                         available.
                         *
                           Corresponding author.
                         $ manuela.voinea@siemens.com (M. D. Danu); george.marica@siemens.com (G. Marica); constantin.suciu@siemens.com
                         (C. Suciu); lucian.itu@siemens.com (L. M. Itu); oladimeji.farri@siemens-healthineers.com (O. Farri)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
languages. To address this gap, our study aims to develop multiple deep contextual embedding models
for English, Spanish, and Italian to enhance clinical NER in the cardiology domain, as part of the
MultiCardioNER shared task [12, 13, 14]. The MultiCardioNER task is part of the twelfth edition of
the large-scale biomedical semantic indexing and question answering challenge (BioASQ) [13, 14], a
long-standing initiative aiming to advance research by developing methods and tools that leverage
the vast amount of online information to meet the needs of biomedical researchers and practitioners.
This initiative seeks to provide efficient and rapid access to the continuously expanding resources and
knowledge in the biomedical field.
   MultiCardioNER [12, 13, 14] is a shared task that aims to automatically identify two key clinical
concepts in medical documents pertaining to cardiology, namely diseases and medications. This task
focuses on adapting clinical NER systems to effectively work across multiple languages - primarily
Spanish, English, and Italian - for two different subtasks: (1) diseases recognition in Spanish cardiology
texts, and (2) medications recognition in cardiology texts written in Spanish, English, and Italian. Both
subtasks involve reading and analyzing clinical texts to identify the clinical entities mentioned in the
text and using the BRAT format to mark the starting and ending positions of these entities.
   In this paper, we created four different monolingual models: (1) Spanish Diseases Recognition (SDR),
(2) Spanish Medications Recognition (SMR), (3) English Medications Recognition (EMR), and (4) Italian
Medications Recognition (IMR). Additionally, we developed two multilingual models: one specialized
for Spanish Diseases Recognition (Multi-SDR) and another for Medications Recognition across all three
targeted languages (Multi-MMR). We applied transfer learning techniques by fine-tuning BERT-based
[15] contextual embeddings, originally trained on general domain text in each of the three languages,
for the biomedical domain to extract diseases and medications from clinical reports.


2. Related Work
In clinical and biomedical NER, recent studies have explored various methodologies to enhance per-
formance. A key model in this domain is multilingual BERT (M-BERT) [15], trained on 104 Wikipedia
languages, which excels in various tasks without explicit cross-lingual alignment [16], outperforming
models based on cross-lingual embeddings [17].
   [18] improved biomedical NER by incorporating syntactic information, enhancing recognition of
complex entity relationships (ORCID). [19] focused on de-identifying Spanish medical texts via NER and
entity randomization, achieving high recall rates on radiology reports and MEDDOCAN [20] challenge
data. [9] developed BioELECTRA, a biomedical text encoder using discriminators, which outperformed
several baselines on multiple biomedical NER benchmarks by leveraging ELECTRA’s efficiency and
accuracy in text encoding.
   [21] developed a scalable NER system for large biomedical datasets, emphasizing real-time processing
and high accuracy. [22] focused on pre-trained biomedical language models for clinical NLP in Spanish,
addressing the need for multilingual capabilities in biomedical NER. [23] optimized a bi-encoder for
NER using contrastive learning, introducing dynamic thresholding to improve accuracy, especially for
nested entities, with significant gains on datasets like ACE [24] and GENIA [25]. [26] used a novel
schema with distant supervision to enhance NER accuracy, showing that domain-specific schema can
supplement limited annotated data effectively.
   [27] used ChatGPT [28] for zero-shot clinical entity recognition with prompt engineering, showing
it outperforms GPT-3 [29] but trails behind fine-tuned BioClinicalBERT [10] models. [30] leveraged
transfer learning and asymmetric tri-training, combining labeled and pseudo-labeled data to boost NER
performance across biomedical datasets.
   To advance the development of medical NER systems, the BioASQ challenge proposed multiple
clinical NER tasks to be solved over time, such as automatic detection and normalization of disease
mentions from clinical texts (DisTEMIST) [31] or medical procedure detection and entity linking
(MedProcNER) [32]. Most participating teams employed Transformer-based and large language models
in their approaches.
3. Methods
3.1. Datasets
With a focus on adapting general medical NER systems for diseases and medications across multiple
languages, the MultiCardioNER [12, 33] task leverages several datasets. Specifically, it utilizes a training
collection of 1000 general clinical case reports in Spanish, covering various medical specialties such as
oncology, urology, ophthalmology, dentistry, pediatrics, primary care, allergology, radiology, psychiatry,
and more [33, 31]. These reports were annotated with diseases and medications, resulting in two
distinct corpora, namely DisTEMIST [33, 31, 34] and DrugTEMIST [33]. The DrugTEMIST [33] corpus
was also released in English and Italian. Since the original 1000 clinical case reports belong to the
Spanish Clinical Case Corpus (SPACCC) [35], the multilingual DrugTEMIST [33] dataset was originally
created in Spanish and then transferred into English and Italian using machine translation and lexical
annotation projection. The result of this process was revised and validated by clinical experts who are
native speakers of each language.
   For the domain adaptation part of the task, MultiCardioNER [33] leverages a collection of 508
annotated cardiology clinical case reports (CardioCCC), divided into 258 for development and 250 for
testing. The annotation process followed the same guidelines as the DisTEMIST [36] and DrugTEMIST
[37] corpora, with the medication part also released in Spanish, English and Italian. In addition to the
test set, an auxiliary collection of multilingual clinical case reports, referred to as the background set, is
provided to facilitate the creation of a silver standard corpus and ensure the developed systems can
effectively scale up to larger content collections.
   All datasets were manually annotated by clinical experts using the BRAT annotation tool [38],
following well-defined annotation guidelines [36, 37] defined after several cycles of quality control and
annotation consistency analysis.

3.2. Experiments
In this work, we treated the automatic named entity recognition (NER) of diseases and medications
in clinical case reports as a multi-label token classification task. To accomplish this, we employed
pre-existing BERT models [15] for NER in the general domain for each of the three languages (Spanish,
English, and Italian), as well as a multilingual model, and further fine-tuned them for the biomedical
domain using the MultiCardioNER dataset [33].
   We experimented with the following BERT-based models, specifically trained to perform NER:
    • bert-spanish-cased-finetuned-ner [39]: a Spanish BERT cased model based on BETO [40].
      Originally fine-tuned on the Spanish dataset of the CoNLL-2002 Shared Task [41], BETO was
      further fine-tuned on the Catalan and Basque subsets of the CoNLL-2007 dataset [42], resulting
      in the bert-spanish-cased-finetuned-ner model, which focuses on recognizing persons (PER),
      organizations (ORG), locations (LOC), and miscellaneous (MISC) entities within Spanish text
      documents.
    • bert-base-NER [43]: a BERT cased model fine-tuned on the English version of the standard
      CoNLL-2003 dataset [44]. It was trained to recognize four types of entities, namely locations
      (LOC), organizations (ORG), persons (PER), and miscellaneous (MISC).
    • bert-italian-finetuned-ner [45]: an Italian BERT cased model fine-tuned on the WikiANN
      dataset [46], which consists of Wikipedia articles annotated with LOC (location), PER (person),
      and ORG (organisation) tags.
    • bert-base-multilingual-cased-ner-hrl [47]: a named entity recognition model for 10 high-
      resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Por-
      tuguese and Chinese) based on a fine-tuned multilingual cased BERT model. It has been trained
      to recognize three types of entities: locations (LOC), organizations (ORG), and persons (PER).
  All these BERT-based models utilize the standard Beginning-Inside-Outside (BIO) format [48] for
tagging entities. This format is crucial as it allows NER to be approached as a multi-label classification
Figure 1: Overview of the prediction pipeline. As a pre-processing step, the clinical case reports are split
into sentences, and further segmented into word-level tokens. By leveraging the available BRAT annotations,
the word-level tokens are encoded using the BIO format and used to fine-tune BERT-based models on the
MultiCardioNER dataset. The output from the BERT models, also in BIO format, is then post-processed to
comply with BRAT format.


task, where words are labelled B if they represent the beginning of an entity, I if they are inside an
entity, and O if they are outside any entity. This labeling method effectively distinguishes between the
beginning and continuation of an entity, thereby simplifying the task of identifying entity boundaries.
   Before performing the clinical domain adaptation of the general domain BERT-based models, the
medical reports undergo a pre-processing step which involves splitting them into sentences to ensure a
sequence length of less than or equal to 256. These sentences are then further segmented into word-level
tokens while preserving their start and end offsets with respect to the original report. The word-level
tokens are encoded in BIO format and used to fine-tune BERT-based models on the MultiCardioNER
dataset. The output from the BERT models is then post-processed to comply with BRAT format. Figure 1
provides an overview of the prediction pipeline.
   Details regarding the label lists used for each subtask, as well as the hyper-parameters configuration
employed in the experiments, are provided in sections 3.2.1 and 3.2.2, respectively.

3.2.1. Subtask 1: Diseases Recognition in Spanish Cardiology Texts
For the subtask aiming to address the recognition of diseases in Spanish cardiology texts, we leveraged
the pre-trained bert-spanish-cased-finetuned-ner and bert-base-multilingual-cased-ner-hrl models and
further fine-tuned them on the MultiCardioNER dataset. Specifically, we employed the DisTEMIST
corpora as the training set for the general clinical domain adaptation part of the task and used the
disease-annotated version of the Spanish CardioCCC clinical cases as the development set to identify the
best-performing models in the cardiology domain, resulting in the Clinical-SDR and MultiClinical-SDR
models. We additionally experimented with fine-tuning these models on the CardioCCC development
set, leading to the creation of the cardiology-specialized Cardio-SDR and MultiCardio-SDR models.
   Following the standard BIO format [48], we defined our label list as follows: B-ENFERMEDAD, I-
ENFERMEDAD, O, [CLS], and [SEP]. B-ENFERMEDAD and I-ENFERMEDAD denote the beginning and
continuation of disease mentions within text sequences, whereas the label O corresponds to word-level
tokens outside any recognized entity. Additionally, the [CLS] token indicates the commencement of
a sentence, while the [SEP] token marks its termination. Notably, the [CLS] token also serves as a
placeholder for the [PAD] token within the text sequences.
   The Spanish Diseases Recognition (SDR) models were fine-tuned on an NVIDIA GeForce RTX 3090
(24GB) GPU for 10 epochs. The utilized hyper-parameters configuration includes a maximum sequence
length of 256, a batch size of 8, and a learning rate of 9𝑒−6 . Predictions were generated for both the
test and background sets, but the evaluation exclusively considered the predictions achieved on the
test set. In addition to the test set results, we also reported the development set results to identify any
discrepancies between them and, hence, detect potential overfitting or any issues related to the data
split distributions.

3.2.2. Subtask 2: Multilingual (Spanish, English and Italian) Medications Recognition in
       Cardiology Texts
For the second subtask, which focuses on the recognition of medications in cardiology texts written
in Spanish, English, and Italian, we employed three monolingual pre-trained models (bert-spanish-
cased-finetuned-ner, bert-base-NER, and bert-italian-finetuned-ner), each specialized for one of the three
languages, as well as a multilingual model (bert-base-multilingual-cased-ner-hrl), and subsequently
fine-tuned them on the MultiCardioNER dataset. Therefore, we leveraged the DrugTEMIST corpora in
each of the three languages as the training sets for the general clinical domain adaptation part of the
task and used the medication-annotated version of the CardioCCC clinical cases in Spanish, English,
and Italian as the development sets to identify the best performing models in the cardiology domain,
resulting in the Clinical-SMR, Clinical-EMR, Clinical-IMR, and MultiClinical-MMR models. We again
conducted additional experiments by fine-tuning these models on the CardioCCC development sets,
thereby achieving the cardiology-specialized Cardio-SMR, Cardio-EMR, Cardio-IMR, and MultiCardio-
MMR models. It is worth noting that the multilingual model was trained on an aggregated dataset
encompassing all three languages, but separately evaluated for each language to assess its performance
across different linguistic contexts.
   In accordance with the standard BIO format [48], we defined the label list for this subtask as B-
FARMACO, I-FARMACO, O, [CLS], and [SEP]. The tags B-FARMACO and I-FARMACO denote the
beginning and continuation of medication mentions within text sequences, while the label O marks the
word-level tokens not associated with any recognized entity. Additionally, as in Subtask 1, the [CLS]
token indicates the beginning of a sentence, while the [SEP] token marks its end. In this context, the
[CLS] token also serves as a placeholder for the [PAD] token within the text sequences.
   The Spanish Medications Recognition (SMR), English Medications Recognition (EMR), Italian Medica-
tions Recognition (IMR), and Multilingual Medications Recognition (MMR) models were independently
fine-tuned on an NVIDIA GeForce RTX 3090 (24GB) GPU for 10 epochs. The utilized hyper-parameters
configuration is identical to that employed for Subtask 1 and consists of a maximum sequence length
of 256, a batch size of 8, and a learning rate of 9𝑒−6 . Predictions were generated for both the test and
background sets. However, the evaluation exclusively considered the predictions obtained on the test
set. In addition to the test set results, we also reported the development set results to identify any
mismatch between them, which could indicate overfitting or issues related to the data split distributions.


4. Results
In this work, we evaluated the developed systems using a flat evaluation approach [49] by comparing
the automatically generated results with those obtained by domain experts through manual annotation.
The primary focus was on identifying and classifying clinical mentions of diseases and medications
in cardiology reports. The performance metrics employed for flat evaluation include micro-averaged
precision, recall, and F1-score (MiF). These metrics were computed based on the exact matches of the
predicted entities and the annotated ground-truth. Table 1 summarises the evaluation results obtained
on the development and test sets using the official-released evaluation library for the MultiCardioNER
task. In the test set evaluation, we achieved the following F1-scores: 77.88% for Spanish Diseases
Recognition (SDR), 92.09% for Spanish Medications Recognition (SMR), 91.74% for English Medications
Recognition (EMR), and 88.9% for Italian Medications Recognition (IMR).
    Table 1
    Evaluation results on the development and test sets for the MultiCardioNER task. The best results on
    the test sets are highlighted in bold. The experiments marked with an (*) were conducted after the
    MultiCardioNER evaluation period and are not included in the official leaderboard.
                                                     Dev      Dev     Dev      Test     Test    Test
Subtask         Model                Fine-tuning
                                                   Precision Recall F1-score Precision Recall F1-score
Track1 (ES)     Clinical-SDR             No         0.6674   0.6243   0.6451    0.6758    0.6437   0.6593
Track1 (ES) *   Cardio-SDR               Yes        0.9713   0.9535   0.9623    0.7739    0.7837   0.7788
Track1 (ES) *   MultiClinical-SDR        No         0.6355   0.6118   0.6234    0.6387    0.6268   0.6327
Track1 (ES) *   MultiCardio-SDR          Yes        0.9406   0.9360   0.9383    0.7717    0.7788   0.7753
Track2 (ES)     Clinical-SMR             No         0.9019   0.8753   0.8884    0.8928    0.8778   0.8852
Track2 (ES) *   Cardio-SMR               Yes        0.9804   0.9562   0.9681    0.9289    0.9045   0.9165
Track2 (ES) *   MultiClinical-MMR        No         0.8783   0.8681   0.8732    0.8974    0.8807   0.8890
Track2 (ES) *   MultiCardio-MMR          Yes        0.9790   0.9482   0.9634    0.9341    0.9080   0.9209
Track2 (EN)     Clinical-EMR             No         0.8866   0.8625   0.8744    0.8685    0.8791   0.8738
Track2 (EN) *   Cardio-EMR               Yes        0.9575   0.9155   0.9360    0.9277    0.9018   0.9146
Track2 (EN) *   MultiClinical-MMR        No         0.8833   0.8594   0.8712    0.8920    0.8826   0.8873
Track2 (EN) *   MultiCardio-MMR          Yes        0.9681   0.9550   0.9615    0.9121    0.9227   0.9174
Track2 (IT)     Clinical-IMR             No         0.9122   0.8801   0.8958    0.8891    0.8689   0.8789
Track2 (IT) *   Cardio-IMR               Yes        0.9518   0.9250   0.9382    0.8994    0.8789   0.8890
Track2 (IT) *   MultiClinical-MMR        No         0.8868   0.8603   0.8734    0.8747    0.8378   0.8558
Track2 (IT) *   MultiCardio-MMR          Yes        0.9772   0.9455   0.9611    0.9046    0.8694   0.8867


   These results surpass the mean and median F1 scores in the test leaderboard across all subtasks, with
the mean/median values being: 69.61%/75.66% for SDR, 81.22%/90.18% for SMR, 89.2%/88.96% for EMR,
and 82.8%/87.76% for IMR.
   The experiments marked with an (*) in Table 1 were conducted after the MultiCardioNER evaluation
period and are not included in the official leaderboard. However, these supplementary experiments
provide further insights beyond the primary evaluation results. For instance, the fine-tuning process
considerably enhances performance across all developed systems. Additionally, employing a multilingual
model proves beneficial in certain substasks, such as Spanish Medications Recognition (SMR) and
English Medications Recognition (EMR), resulting in an improved F1-score from 91.65% (achieved by
the subsequent best performing model) to 92.09%, and from 91.46% to 91.74%, respectively.
   By comparing the results from the development and test sets, we can assess potential discrepancies
between these data splits and identify issues such as overfitting or distributional disparities. These
insights are crucial for enhancing model robustness and generalization, which are essential for suc-
cessfully utilizing the developed systems in real-world clinical scenarios. As illustrated in Table 1,
non-fine-tuned models exhibit similar evaluation metrics on both the development and test sets. For
these models, the development set was solely used to select the best-performing model across different
checkpoints. This consistency confirms that the two data splits originate from the same distribution. In
contrast, fine-tuned models – trained on the development set – demonstrate a performance gap between
the two sets. While some degree of performance difference is expected due to the model’s exposure to
the development data during training, excessively large gaps suggest overfitting. This is the case of
Spanish Diseases Recognition (SDR) models, where the performance gap between the development
and test sets is 18.35% for Cardio-SDR and 16.3% for MultiCardio-SDR. For all other fine-tuned models,
the F1-score on the development set is only slightly higher than that computed on the test set, with
differences ranging from 1.95% to 7.44%. Although these differences may indicate some overfitting, they
do not reach a severe extent. One plausible explanation for overfitting in these cases could be that the
model is too complex for the limited diversity of cardiology-specific entities present in the development
set. As a result, the model may capture specific patterns from the training data but struggle to generalize
to new data.
Figure 2: Prediction example for the Spanish Diseases Recognition (SDR) subtask, obtained using the best
performing model in terms of F1-score. Green represents correctly identified mentions along with their spans.
Red represents mentions that are not present in the ground-truth but predicted by the model. Yellow refers to
mentions that are incompletely predicted by the model, while orange marks the full mention as present in the
ground-truth.


Figure 3: Prediction example for the Spanish Medications Recognition (SMR) subtask, obtained using the best
performing model in terms of F1-score. Green represents correctly identified mentions along with their spans. In
this particular example, there were no missed, incomplete, or incorrect predictions.


  In addition to this performance analysis, we conducted a qualitative evaluation of the top-performing
models across all subtasks. The qualitative analysis complements the quantitative metrics, providing a
comprehensive assessment of the capabilities of the developed models in real-world clinical scenarios.
The outcomes, as illustrated in Figure 2, Figure 3, Figure 4, and Figure 5, indicate that the models
perform commendably in identifying medications within clinical texts across all three targeted lan-
guages. However, the Spanish Diseases Recognition (SDR) model exhibits room for improvement, as it
occasionally produces incomplete or incorrect predictions.


5. Conclusions
In this paper, we investigated the utilization of BERT-based contextual embeddings, trained on general
domain texts, for extracting mentions of diseases and medications from clinical case reports written in
English, Spanish, and Italian. We developed four distinct monolingual models: (1) Spanish Diseases
Recognition (SDR), (2) Spanish Medications Recognition (SMR), (3) English Medications Recognition
Figure 4: Prediction example for the English Medications Recognition (EMR) subtask, obtained using the best
performing model in terms of F1-score. Green represents correctly identified mentions along with their spans. In
this particular example, there were no missed, incomplete, or incorrect predictions.


Figure 5: Prediction example for the Italian Medications Recognition (IMR) subtask, obtained using the best
performing model in terms of F1-score. Green represents correctly identified mentions along with their spans. In
this particular example, there were no missed, incomplete, or incorrect predictions .


(EMR), and (4) Italian Medications Recognition (IMR). Additionally, we created two multilingual models:
one specialized for Spanish Diseases Recognition (Multi-SDR) and another for Medications Recognition
across all three targeted languages (Multi-MMR). While the results show promising performance in
identifying medications within clinical texts across all three languages, the models are not flawless.
Some weaknesses arise in diseases recognition, where they occasionally produce incomplete or incorrect
predictions. To address these issues, we aim to explore the capabilities of recent large language models
(LLMs).


Acknowledgments
This work received funding from the European Union’s Horizon Europe research and innovation
programme under Grant Agreement No. 101057849 (DataTools4Heart project).
References
 [1] E. T. Rubel Schneider, J. V. Andrioli de Souza, J. Knafou, L. E. Oliveira, Y. B. Gumiel, L. F. de Oliveira,
     D. Teodoro, E. C. Paraiso, C. Moro, et al., Biobertpt: a portuguese neural language model for
     clinical named entity recognition, in: Proceedings of the 3rd Clinical Natural Language Processing
     Workshop, 19 November 2020, 2020.
 [2] S. R. Kundeti, J. Vijayananda, S. Mujjiga, M. Kalyan, Clinical named entity recognition: Challenges
     and opportunities, in: 2016 IEEE International Conference on Big Data (Big Data), IEEE, 2016, pp.
     1937–1945.
 [3] M. Jin, M. T. Bahadori, A. Colak, P. Bhatia, B. Celikkaya, R. Bhakta, S. Senthivel, M. Khalilia,
     D. Navarro, B. Zhang, et al., Improving hospital mortality prediction with medical named entities
     and multimodal learning, arXiv preprint arXiv:1811.12276 (2018).
 [4] G. Riccio, A. Romano, A. Korsun, M. Cirillo, M. Postiglione, V. La Gatta, A. Ferraro, A. Galli,
     V. Moscato, Healthcare data summarization via medical entity recognition and generative ai
     (2023).
 [5] D. Zaikis, I. Vlahavas, Drug-drug interaction classification using attention based neural networks,
     in: 11th Hellenic conference on artificial intelligence, 2020, pp. 34–40.
 [6] M. Abulaish, M. A. Parwez, et al., Disease: A biomedical text analytics system for disease symptom
     extraction and characterization, Journal of Biomedical Informatics 100 (2019) 103324.
 [7] B. Rink, S. Harabagiu, K. Roberts, Automatic extraction of relations between medical concepts in
     clinical texts, Journal of the American Medical Informatics Association 18 (2011) 594–600.
 [8] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language
     representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240.
 [9] K. R. Kanakarajan, B. Kundumani, M. Sankarasubbu, Bioelectra: pretrained biomedical text encoder
     using discriminators, in: Proceedings of the 20th workshop on biomedical language processing,
     2021, pp. 143–154.
[10] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott, Publicly
     available clinical bert embeddings, arXiv preprint arXiv:1904.03323 (2019).
[11] F. Li, Y. Jin, W. Liu, B. P. S. Rawat, P. Cai, H. Yu, et al., Fine-tuning bidirectional encoder represen-
     tations from transformers (bert)–based models on large-scale electronic health record notes: an
     empirical study, JMIR medical informatics 7 (2019) e14830.
[12] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz,
     G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger,
     Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation
     of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková,
     A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the
     Evaluation Forum, 2024.
[13] A. Nentidis, A. Krithara, G. Paliouras, M. Krallinger, L. G. Sanchez, S. Lima, E. Farre,
     N. Loukachevitch, V. Davydova, E. Tutubalina, Bioasq at clef2024: The twelfth edition of the large-
     scale biomedical semantic indexing and question answering challenge, in: European Conference
     on Information Retrieval, Springer, 2024, pp. 490–497.
[14] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
     N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
     twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
     in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
     A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
     Association (CLEF 2024), 2024.
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[16] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual BERT?, in: A. Korhonen,
     D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for
     Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp.
     4996–5001. URL: https://aclanthology.org/P19-1493. doi:10.18653/v1/P19-1493.
[17] S. Wu, M. Dredze, Beto, bentz, becas: The surprising cross-lingual effectiveness of bert, arXiv
     preprint arXiv:1904.09077 (2019).
[18] Y. Tian, W. Shen, Y. Song, F. Xia, M. He, K. Li, Improving biomedical named entity recognition
     with syntactic information, BMC bioinformatics 21 (2020) 1–17.
[19] I. Pérez-Díez, R. Pérez-Moraga, A. López-Cerdán, J.-M. Salinas-Serrano, M. d. la Iglesia-Vayá, De-
     identifying spanish medical texts-named entity recognition applied to radiology reports, Journal
     of Biomedical Semantics 12 (2021) 1–13.
[20] M. Marimon, A. Gonzalez-Agirre, A. Intxaurrondo, H. Rodríguez, J. A. Lopez Martin, M. Vil-
     legas, M. Krallinger, MEDDOCAN corpus: gold standard annotations for Medical Document
     Anonymization on Spanish clinical case reports, 2020. URL: https://doi.org/10.5281/zenodo.4279323.
     doi:10.5281/zenodo.4279323.
[21] V. Kocaman, D. Talby, Biomedical named entity recognition at scale, in: Pattern Recognition. ICPR
     International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part I,
     Springer, 2021, pp. 635–646.
[22] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo,
     A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained biomedical language models for clinical
     nlp in spanish, in: Proceedings of the 21st Workshop on Biomedical Language Processing, 2022,
     pp. 193–199.
[23] S. Zhang, H. Cheng, J. Gao, H. Poon, Optimizing bi-encoder for named entity recognition via
     contrastive learning, arXiv preprint arXiv:2208.14565 (2022).
[24] C. Walker, L. D. Consortium, ACE 2005 Multilingual Training Corpus, LDC corpora, Linguistic
     Data Consortium, 2005. URL: https://books.google.at/books?id=SbjjuQEACAAJ.
[25] T. Ohta, Y. Tateisi, J.-D. Kim, The genia corpus: an annotated research abstract corpus in molecular
     biology domain, in: Proceedings of the Second International Conference on Human Language
     Technology Research, HLT ’02, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002,
     p. 82–86.
[26] A. Khandelwal, A. Kar, V. R. Chikka, K. Karlapalem, Biomedical ner using novel schema and
     distant supervision, in: Proceedings of the 21st Workshop on Biomedical Language Processing,
     2022, pp. 155–160.
[27] Y. Hu, I. Ameer, X. Zuo, X. Peng, Y. Zhou, Z. Li, Y. Li, J. Li, X. Jiang, H. Xu, Zero-shot clinical entity
     recognition using chatgpt, arXiv preprint arXiv:2303.16416 (2023).
[28] OpenAI, Chatgpt, 2022. URL: https://chat.openai.com, accessed: 2024-06-10.
[29] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
     D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
     learners, 2020. arXiv:2005.14165.
[30] M. Bhattacharya, S. Bhat, S. Tripathy, A. Bansal, M. Choudhary, Improving biomedical named
     entity recognition through transfer learning and asymmetric tri-training, Procedia Computer
     Science 218 (2023) 2723–2733.
[31] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara,
     G. Katsimpras, G. Paliouras, M. Krallinger, Overview of distemist at bioasq: Automatic detection
     and normalization of diseases from clinical texts: results, methods, evaluation and multilingual
     resources., in: CLEF (Working Notes), 2022, pp. 179–203.
[32] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras,
     M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at
     bioasq 2023., in: CLEF (Working Notes), 2023, pp. 1–18.
[33] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Krallinger, MultiCardioNER Corpus:
     Multilingual Adaptation of Clinical NER Systems to the Cardiology Domain, 2024. URL: https:
     //doi.org/10.5281/zenodo.11368861. doi:10.5281/zenodo.11368861.
[34] A. Miranda-Escalada, E. Farré, L. Gasco, S. Lima, M. Krallinger, DisTEMIST corpus: detection and
     normalization of disease mentions in spanish clinical cases, 2023. URL: https://doi.org/10.5281/
     zenodo.7614764. doi:10.5281/zenodo.7614764.
[35] A. Intxaurrondo, M. Krallinger, Spaccc, 2019. URL: https://doi.org/10.5281/zenodo.2560316. doi:10.
     5281/zenodo.2560316.
[36] E. Farré-Maduell, L. Gascó, S. Lima, A. Miranda-Escalada, M. Krallinger, DisTEMIST Guidelines:
     detection and normalization of disease mentions in spanish clinical cases, 2022. URL: https://doi.
     org/10.5281/zenodo.6477407. doi:10.5281/zenodo.6477407.
[37] S. Lima-López, E. Farré-Maduell, M. Krallinger, DrugTEMIST Guidelines: Annotation of Medica-
     tion in Medical Documents, 2024. URL: https://doi.org/10.5281/zenodo.11065433. doi:10.5281/
     zenodo.11065433.
[38] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, Brat: a web-based tool for
     nlp-assisted text annotation, in: Proceedings of the Demonstrations at the 13th Conference of the
     European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107.
[39] M. Romero, bert-spanish-cased-finetuned-ner, 2020. URL: https://huggingface.co/mrm8488/
     bert-spanish-cased-finetuned-ner, accessed: 2024-06-07.
[40] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
     evaluation data, in: PML4DC at ICLR 2020, 2020.
[41] E. F. Tjong Kim Sang, Introduction to the CoNLL-2002 shared task: Language-independent named
     entity recognition, in: COLING-02: The 6th Conference on Natural Language Learning 2002
     (CoNLL-2002), 2002. URL: https://aclanthology.org/W02-2024.
[42] J. Nivre, J. Hall, S. Kübler, R. McDonald, J. Nilsson, S. Riedel, D. Yuret, The conll 2007 shared task
     on dependency parsing, in: Proceedings of the 2007 Joint Conference on Empirical Methods in
     Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),
     2007, pp. 915–932.
[43] D. S. Lim, bert-base-ner, 2020. URL: https://huggingface.co/dslim/bert-base-NER, accessed: 2024-
     06-07.
[44] E. F. Tjong Kim Sang, F. De Meulder, Introduction to the CoNLL-2003 shared task: Language-
     independent named entity recognition, in: Proceedings of the Seventh Conference on Natural Lan-
     guage Learning at HLT-NAACL 2003, 2003, pp. 142–147. URL: https://www.aclweb.org/anthology/
     W03-0419.
[45] N. Procopio, bert-italian-finetuned-ner, 2023. URL: https://huggingface.co/nickprock/
     bert-italian-finetuned-ner, accessed: 2024-06-07.
[46] X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, H. Ji, Cross-lingual name tagging and linking for
     282 languages, in: R. Barzilay, M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the
     Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, Vancouver, Canada, 2017, pp. 1946–1958. URL: https://aclanthology.org/P17-1178.
     doi:10.18653/v1/P17-1178.
[47] D. Adelani, bert-base-multilingual-cased-ner-hrl, 2021. URL: https://huggingface.co/Davlan/
     bert-base-multilingual-cased-ner-hrl, accessed: 2024-06-10.
[48] L. A. Ramshaw, M. P. Marcus, Text chunking using transformation-based learning, in: Natural
     language processing using very large corpora, Springer, 1999, pp. 157–176.
[49] A. Kosmopoulos, I. Partalas, E. Gaussier, G. Paliouras, I. Androutsopoulos, Evaluation measures
     for hierarchical classification: a unified view and novel approaches, Data Mining and Knowledge
     Discovery 29 (2015) 820–865.