Comparative Analyses of Multilingual Drug Entity
                         Recognition Systems for Clinical Case Reports In
                         Cardiology
                         Notebook for the ICUE@MultiCardioNER submission at CLEF 2024

                         Chaeeun Lee1 , T. Ian Simpson1 , Joram M. Posma2 and Antoine D. Lain2,*
                         1
                         School of Informatics, University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, UK
                         2
                         Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of
                         Medicine, Imperial College London, London W12 0NN, United Kingdom


                                       Abstract
                                       Performance disparities exist in Named Entity Recognition (NER) systems across languages due to variations in
                                       available human-annotated data. We participated in the MultiDrug subtask of MultiCardioNER, a shared task
                                       focusing on multilingual NER for cardiology, to compare the effectiveness of fine-tuning BERT-based monolingual
                                       and multilingual language models, and prompting Large Language Models (LLMs) for drug entity recognition
                                       across multiple languages. Our findings demonstrate that monolingual BERT models pretrained on biomedical
                                       corpora generally outperform their multilingual counterparts. However, for languages lacking access to a broader
                                       range of pretrained models, combining the translation capability of LLM [1, 2, 3, 4] with the best-performing
                                       pretrained monolingual BERT model yielded superior results. This approach effectively reduces the resource
                                       disparity while leveraging domain-specific knowledge captured by the monolingual BERT model. Our best
                                       systems in the MultiCardioNER track yielded F1-scores of 0.9277 for Spanish, 0.9107 for English, and 0.8776 for
                                       Italian. We highlight the comparative advantages of domain-specific fine-tuning and LLM-powered language
                                       translation for multilingual drug NER.

                                       Keywords
                                       Natural Language Processing, Multilingual, Named Entity Recognition, Cardiology, BERT


                         1. Introduction
                         Named entity recognition (NER) is one of the central tasks in natural language processing (NLP),
                         particularly in specialised domains such as biomedicine [5, 6, 7, 8]. Accurately identifying specific
                         types of entities such as diseases and medications in text is crucial for extracting relevant information
                         from healthcare-related text data. However, clinical NER systems often rely on the availability and
                         quality of human-annotated resources, which can vary significantly across languages. Recognising
                         the challenges in clinical entity recognition across different languages, the MultiCardioNER shared
                         task [9] focuses on the recognition of disease and medication mentions in cardiological clinical case
                         documents in English, Spanish, and Italian. As part of this initiative, we participated in the MultiDrug
                         subtask, specifically targeting mentions of medications. Our participation was motivated by the goal of
                         adapting and comparatively analysing machine-learning based NER systems in the cardiology domain.
                         Our contributions are outlined as follows:

                                1. Fine-tuning Monolingual Language Models: We explored fine-tuning BERT-based monolin-
                                   gual language models for drug NER in each target language individually.

                                2. Multilingual Model Capabilities: We explored fine-tuning multilingual models on combined
                                   datasets of Spanish, English, and Italian and compared results with the monolingual approach.


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ chaeeun.lee@ed.ac.uk (C. Lee); ian.simpson@ed.ac.uk (T. I. Simpson); jmp111@ic.ac.uk (J. M. Posma);
                          a.lain@imperial.ac.uk (A. D. Lain)
                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   3. Large Language Model Integration: We developed drug NER system using generative Large
      Language Model (LLM), providing insights on the potential use of LLM in tasks that conventionally
      rely on fine-tuning approaches.

   4. LLM as Translation Module: We utilise translation capability of LLM to enhance cross-lingual
      applicability of best-performing monolingual models. This approach allowed us to test the
      effectiveness of LLM in bridging the gaps among different languages in clinical entity recognition,
      providing a way for monolingual systems to be applied to multilingual tasks.


2. Data Description
For the MultiDrug subtask [9], training and development datasets included two primary sources: the
DisTEMIST [10, 11] and DrugTEMIST corpora, and the CardioCCC dataset.

2.1. DisTEMIST and DrugTEMIST Corpora
The DisTEMIST [10] and newly introduced DrugTEMIST [9] corpora comprised training datasets,
including 1,000 Spanish clinical case documents across various medical fields such as oncology, pediatrics,
and psychiatry, among others. The source documents were drawn from the SPACCC [12] corpus. The
original Spanish documents were annotated manually by clinical experts [13] for medication mentions
and transferred into English and Italian, ultimately comprising a multilingual dataset.

2.2. CardioCCC
As a domain-specific dataset for drug NER, CardioCCC dataset is comprised of 508 cardiology clinical
case reports. The original Spanish documents were annotated under the same guidelines used for
DisTEMIST[10]/DrugTEMIST datasets, and likewise transferred to English and Italian. The dataset was
divided into development and test splits, consisting of 258 and 250 documents respectively.

2.3. Conversion to BIO2 Format for Fine-Tuning
For systems based on the fine-tuning BERT-based language models, we transformed the datasets into
BIO2 (Begin, Inside, Outside) format [14]. After tokenisation, original annotations indicating the start
and end offsets of drug mentions were converted so that each token was labeled with one of the
following tags:

    • B (Begin): start of a named entity.
    • I (Inside): tokens within the entity.
    • O (Outside): tokens that are not part of any entity.

  BIO2 format is particularly effective for encoder-based language models like BERT [15], which
predict the classification for each token individually [16]. This allows for accurate identification of the
boundaries of each drug entity.


3. System Description
Our strategy for the MultiDrug subtask was structured around two key dimensions: assessing the
comparative effectiveness of fine-tuning encoder-based language models versus prompting generative
LLM for NER, and exploring different multilingual processing approaches. For the latter, we evaluated
the results from multilingual BERT model compared to those achieved by translating NER results from
one language to others.
Table 1
Overview of submitted systems. Post-processing rules were applied to systems marked with "+ pp".
                                                            fine-tuned Models
 System                                                                                                                          LLM Translation
                         Spanish (ES)                           English (EN)                          Italian (IT)
 System 1   bert-base-multilingual-cased [15] + pp   bert-base-multilingual-cased + pp    bert-base-multilingual-cased + pp            -
 System 2          bsc-bio-ehr-es [17] + pp           scibert_scivocab_cased [18] + pp   bert-base-italian-xxl-cased [19] + pp         -
 System 3              bsc-bio-ehr-es                     scibert_scivocab_cased             bert-base-italian-xxl-cased               -
 System 4              bsc-bio-ehr-es                     scibert_scivocab_cased                          -                         ES → IT
 System 5              bsc-bio-ehr-es                                -                                    -                       ES → EN, IT


3.1. Fine-tuning BERT-based models
Our main strategy involved fine-tuning BERT-based language models [15, 18, 20] specifically for drug
NER. We explored a range of models, testing both general-purpose and domain-specific pretrained
language models of various parameter sizes. Only the training set from DrugTEMIST corpora was
used for fine-tuning. We then selected the best-performing models based on their performance on the
development set derived from the CardioCCC corpus.
   As a baseline approach, we utilised pretrained monolingual BERT-based models in Spanish, English,
and Italian. This method ensures that each model is attuned to the linguistic nuances specific to a
language, but has limitations in addressing resource disparities across different languages [20]. Each of
these models was trained on the respective monolingual training set. As an alternative approach, we
fine-tuned a multilingual BERT model on a combined training set of all three languages.

3.1.1. Multilingual BERT
In System 1, in view of the multilingual nature of the given dataset, we opted for
using a pretrained multilingual BERT-based language model.                Specifically, we employed
“google-bert/bert-base-multilingual-cased” [15]. To enhance its performance for drug entity
recognition, we further fine-tuned the model on the BIO2-formatted data in English, Italian, and Spanish
provided by the organiser. Each document was split into individual sentences and a batch size of
32, learning rate of 8e-5, weight decay of 1e-5, and 5 epochs were employed during fine-tuning.
Hyperparameter optimisation was not performed.

3.1.2. Monolingual BERT
Given the limited range of available multilingual BERT models, particularly the lack of biomedi-
cal domain-specific pretrained models, we investigated the effectiveness of fine-tuning independent
monolingual models in Systems 2 and 3. This approach aimed to improve performance by lever-
aging language-specific pretrained biomedical models. However, availability of pretrained models
differed across languages. While a wider range of selection was available for English and Spanish,
a limited range of pretrained models available for Italian. Consequently, for Italian, we opted for
“dbmdz/bert-base-italian-xxl-cased” [19]. To evaluate the efficacy of both general and domain-
specific models available for the other two languages, we used the SeqEval library. This resulted in using
“allenai/scibert_scivocab_cased” [18] for English and “PlanTL-GOB-ES/bsc-bio-ehr-es” [17]
for Spanish. Hyperparameter search was not performed. Following the approach used for multilingual
BERT, we applied sentence-level split of the data, a batch size of 32, learning rate of 8e-5, weight decay
of 1e-5, and 5 epochs for all monolingual models.

3.2. Large Language Model Integration
In light of recent advancements in general-purpose LLMs, we explored their potential in biomedical
multilingual NER. We employed two distinct strategies. The first strategy involved directly using the
LLM for NER, bypassing the fine-tuning process. This approach aimed to assess the effectiveness of
prompting LLM for multilingual NER tasks without domain-specific adaptation. The second strategy
leveraged the best-performing monolingual BERT-based model for NER, while utilising LLM as a
translation module to convert the NER results in a given language into the target language. This
approach investigated the potential use of LLM to bridge the resource gaps among different languages
while capitalising on the domain-specific knowledge captured by pretrained monolingual BERT models.

3.2.1. Drug NER with LLMs
In addition to fine-tuning BERT-based models for drug NER, we also experimented with directly using
generative LLM for NER task. This approach involved prompting the LLM to produce drug annotations
from clinical case reports without the conventional fine-tuning process. While this method takes
advantage of the contextual understanding and embedded domain knowledge of LLMs [21, 22], it is
recognised in existing literature that LLMs often fall short in achieving comparable performance on
specific non-generative tasks such as NER [23].
   Figure 1 illustrates an example of prompts used for LLM drug entity recognition. Our experiments
were conducted in a zero-shot setting on the English development set, where we supplied the complete
clinical text and tasked the LLM with identifying all drug mentions. We included an instruction in the
prompt to output the predictions in JSON format for efficient post-processing. We experimented with
three LLMs, Meta-Llama-2-7B, Meta-Llama-3-8B, and gpt-3.5-turbo. Llama-2-7B produced results
that were significantly lower than the average scores of fine-tuned models, with a precision of 0.6689,
recall of 0.1964, and F1 score of 0.3037. This outcome was expected due to the limited parameter size
and the zero-shot setting. On the other hand, GPT-3.5 and Llama-3-8B demonstrated performances
comparable to the average results from fine-tuned models (avg. precision: 0.8373; avg. recall: 0.8779; avg.
F1: 0.8564). Comparing the two LLMs, GPT-3.5 achieved better recall (precision: 0.8236; recall: 0.8538;
F1: 0.8384), while Llama-8B showed higher precision and F1 values (precision: 0.8767; recall: 0.8303; F1:
0.8529). LLaMA-3-8B achieved higher precision (0.8767) than the average precision of fine-tuned models
(0.8373), which is noteworthy given that the LLaMA-3-8B model has significantly fewer parameters
compared to GPT-3.5 and was used in a zero-shot setting. Further studies with a broader range of LLMs,
as well as exploring few-shot settings and additional prompting methods, will help better understand
the potential use of LLM in drug NER.


                [
                    {"role": "system", "content": "Your job is to review a clinical note that
                    potentially contains mentions of drug names."},

                    {"role": "user", "content":
                    "Find all mentions of drug names in the following clinical note. Output your
                    response in JSON format with keys 'drug 1', 'drug 2', and so on.

                    Clinical Note:
                    ANAMNESIS
                    46-year-old Spanish woman.
                    Married, with an 18-year-old daughter.
                    Good family support.
                    ..."}
                ]


                                    Figure 1 – Example LLM Prompt for NER
3.2.2. Entity Translation


                [
                    {"role": "system", "content": "Your job is to review a clinical note that
                    potentially contains mentions of drug names."},

                    {"role": "user", "content":
                    "I have Spanish drug names 'fluoxetina', 'clonazepam'. Find the corresponding
                    drugs in English in the following clinical note. Output your response in JSON
                    format, where the keys are the given Spanish drug names ('fluoxetina',
                    'clonazepam'), and the values are the corresponding drug names in English found in
                    the clinical note. Note that for every Spanish drug name, there is always at least
                    one mention of the corresponding drug in English.

                    Clinical Note:
                    Francisca Valero, a 33-year-old stock market analyst, married with two children,
                    was brought to the emergency department (ED) after 10 days of what her husband
                    described as ..."}
                ]


                                Figure 2 – Example LLM Prompt for Translation


   In addition to comparing systems that used three monolingual models and a single multilingual
model, we explored the use of an LLM as a translation module to enable monolingual models to perform
multilingual NER tasks. Figure 2 shows an example of prompts used for the LLM-powered translation.
In System 4, we generated predictions for the Italian test set by translating the predictions from the
best-performing Spanish monolingual model into Italian using GPT-3.5. For the English test set, we
retained predictions from the English monolingual model. In System 5, predictions for both the English
and Italian test sets were derived from the Spanish predictions and translated into the respective
languages using GPT-3.5. This approach tests the feasibility of using an LLM-based translation system
for multilingual NER tasks, especially when there is a disparity in available resources for each language,
as an alternative to directly applying LLM for NER. [1, 2, 3, 4].


4. Results


                filename                          label      start_span   end_span   text

                casos_clinicos_cardiologia132      FARMACO     333    350     hidroclorotiazida
                casos_clinicos_cardiologia132      FARMACO     359    369     olmersatan
                casos_clinicos_cardiologia132      FARMACO     377    390     atorvastatina
                casos_clinicos_cardiologia132      FARMACO     399    408     omeprazol
                casos_clinicos_cardiologia132      FARMACO     2862    2875     noradrenalina
                casos_clinicos_cardiologia132      FARMACO     2894    2904     dobutamina


                              Figure 3 – Example of the predictions on the testset


   We submitted five drug entity recognition systems in all three languages (English, Italian, and Spanish).
Details of these systems and the underlying BERT models are provided in Table 1. An example of the
prediction format is presented in Figure 3, adhering to the formatting guidelines set by the organisers.
For each system, three distinct files were generated, each corresponding to a different language. These
files included the following details: filename, label, start span, end span, and text. However, during the
system evaluation, only the filename, start span, and end span are utilised for comparison with the
ground truth labels. System selection for final submission was based on performance on the development
set (Table 2). The precision, recall, and F1 metrics are calculated based on a strict matching with the
ground truth. In this approach, an exact match between the predicted and ground truth intervals is
required to consider a true positive. The relaxed F1 score is computed based on a more lenient criterion.
A true positive is counted if either the start of the prediction falls between or coincides with one of the
start and end intervals present in the ground truth, or if a ground truth start falls between or coincides
with one of the start and end intervals present in the prediction. This approach allows for some degree
of imprecision in the predicted intervals while still considering them as correct. The relaxed F1 aims to
understand the model predictions and if post-processing rules can be found to reduce the difference
between the strict and relaxed scores.

    • Spanish: Our monolingual BERT model achieved the top recall and F1 score (0.9277) among all
      participant submissions (mean F1: 0.6373; median F1: 0.8502).
    • English: While our monolingual BERT model showed the highest precision, its F1 score (0.9107)
      was marginally lower than the best overall system (best F1: 0.9223; mean F1: 0.7101; median F1:
      0.8768).
    • Italian: In Italian, where model selection may be limited, we combined LLM translation with our
      Spanish monolingual system. This combined approach achieved the highest precision, while our
      monolingual Italian system showed the highest recall. Ultimately, the F1 score of our best Italian
      system (0.8776) was somewhat lower than the top-ranked submission (best F1: 0.8842; mean F1:
      0.6506; median F1: 0.8421).

   These results highlight the effectiveness of fine-tuning monolingual BERT models, particularly for
high-resource languages with large amount of available domain-specific training data. The combination
of LLM translation and monolingual models from other languages shows promise for low-resource
languages (i.e., Italian) and requires further exploration.The results of our five submissions on the test
set can be found in Table 3 alongside the best score for Precision, Recall and F1, the mean F1 and the
median F1.

Table 2
Development Set Performance for model selection. Relaxed F1 is computed based on the overlap between the
ground-truth and predicted spans. Post-processing rules were applied to systems marked with "+ pp". Values in
bold represent the highest value for a given metric and a given language.
   Language     Model                                       Precision    Recall      F1      Relaxed F1
                BioBERT cased                                 0.8327     0.8486    0.8406      0.9404
                SciBERT cased                                 0.8617     0.8717    0.8667      0.9404
                SciBERT cased + pp                            0.9261     0.8992    0.9125         -
   English
                BERT base cased                               0.8140     0.8371    0.8254      0.9156
                BERT multilingual base model cased            0.7624     0.8884    0.8206      0.9041
                BERT multilingual base model cased + pp       0.8271     0.9227    0.8723         -
                bsc-bio-ehr-es cased                          0.7500     0.7665    0.7678      0.9680
                bsc-bio-ehr-es cased + pp                     0.9395     0.9406    0.9401         -
   Spanish      bert-base-spanish-wwm-uncased                 0.7004     0.7422    0.7207      0.9275
                BERT multilingual base model cased            0.7785     0.9016    0.8355      0.9153
                BERT multilingual base model cased + pp       0.8370     0.9291    0.8807         -
                bert-base-italian-xxl-cased                   0.7969     0.8816    0.8371      0.9227
                bert-base-italian-xxl-cased + pp              0.8665     0.9141    0.8897         -
   Italian
                BERT multilingual base model cased            0.7805     0.8778    0.8263      0.9067
                BERT multilingual base model cased + pp       0.8453     0.9091    0.8751         -
                BERT multilingual base model cased            0.7742     0.8894    0.8278      0.9242
   All
                BERT multilingual base model cased + pp       0.8362     0.9203    0.8762         -
Table 3
Test set results communicated by the challenge organisers. Values in bold represent the highest value obtained
across all submissions. Values that are underlined represent the second-best performance according to our
submissions against the best submission across all participants. The Mean and Median are calculated from all
the participants of the challenge.
                                  Spanish                         English                          Italian
System
                      Precision     Recall     F1     Precision     Recall     F1      Precision     Recall       F1
System 1                0.8287      0.9348   0.8786     0.8314      0.9343    0.8799     0.8139      0.9114      0.8487
System 2                0.9146      0.9412   0.9277     0.9086      0.9128    0.9107     0.8186      0.9000      0.8574
System 3                0.8777      0.9272   0.9018     0.8734      0.8977    0.8854     0.7879      0.8894      0.8356
System 4                0.9146      0.9412   0.9277     0.9086      0.9128    0.9107     0.9114      0.8461      0.8776
System 5                0.9146      0.9412   0.9277     0.8767      0.8635    0.8700     0.9114      0.8461      0.8776
Best submission         0.9242      0.9412   0.9277     0.9086      0.9477   0.9223      0.9114      0.9000      0.8842
Mean submission            -           -     0.6373        -           -     0.7101         -           -        0.6506
Median submission          -           -     0.8502        -           -     0.8768         -           -        0.8421


4.1. Error analysis
While the relaxed F1 scores in Table 2 are encouraging, we analysed the entities that overlapped partially
with the gold standard annotations, suggesting a potential mismatch between system predictions and
the human-annotated ground truth. We observed that the annotation guidelines provided to human
annotators allowed for the inclusion of consecutive drug entities separated by specific delimiters like ‘/’,
‘and’ (corresponding to ‘y’ in Spanish and ‘e’ in Italian), ‘of’ (corresponding to ‘de’ in Spanish and ‘di’ in
Italian), ‘+’, and ‘/ ’. This specific format caused performance degradation for our system.
   Interestingly, combining entities separated by ‘/’ and ‘of’ (with its translation in Spanish and Italian)
resulted in improved performance, while the impact of other delimiters was negligible. Additionally,
we implemented a rule to include the entire span of a word entity even if our model only predicted a
subset of the words. This adjustment aimed to further reduce the discrepancy between predicted and
gold standard annotations.
   It is worth noting that our strategy did not involve training on the validation set. Only the training
split of the dataset, which is derived from the DrugTEMIST dataset, was used for fine-tuning, while the
development and test sets were obtained from CardioCCC. There was no significant difference between
the models’ performances on the validation and test sets, suggesting that our systems likely did not
exhibit signs of overfitting. Our analysis highlights the importance of considering relaxed F1 scores
when evaluating NER systems. It also suggests that incorporating post-processing rules specific to the
annotation guidelines employed can improve performance, particularly when dealing with specific
entity formatting conventions.


5. Conclusion
In this study, we conducted a comparative analysis of multilingual drug NER in English, Italian, and
Spanish. We evaluated the effectiveness of four distinct approaches: fine-tuning multilingual BERT,
fine-tuning monolingual BERT models for each language, zero-shot LLM prompting, and a hybrid
method that combines monolingual BERT and LLM as a translation module. Our findings revealed
that fine-tuning monolingual BERT models generally outperformed other approaches. Specifically, our
system for the Spanish test set, based on fine-tuning a BERT-based monolingual Spanish language
model pretrained on a biomedical corpus, achieved the top ranking among all participant submissions.
   On the other hand, for Italian, where there is limited availability of domain-specific human-annotated
data and consequently a smaller range of pretrained models, a hybrid approach of combining best-
performing monolingual model’s prediction and LLM translation demonstrated greater efficacy com-
pared to results achieved by a monolingual Italian BERT model alone. This suggests that LLM can
be used as a translation module to enhance the cross-lingual applicability of fine-tuned monolingual
models.
   Promising results were obtained with zero-shot LLM prompting, where LLaMA-3-8B achieved higher
precision (0.8767) than the average precision of fine-tuned models (0.8373) on English development
set. Future work will involve experimenting with a broader range of LLMs and prompting methods in
few-shot settings.


Funding
C.L. was supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre
for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. For the
purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any
author accepted manuscript version arising. J.M.P. and A.D.L. are supported by the CoDiet project. The
CoDiet project is funded by the European Union under Horizon Europe grant number 101084642 and
supported by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding
guarantee [grant number 101084642].


References
 [1] S. Kumar, A. Anastasopoulos, S. Wintner, Y. Tsvetkov, Machine translation into low-resource
     language varieties, arXiv preprint arXiv:2106.06797 (2021).
 [2] S. Ranathunga, E.-S. A. Lee, M. Prifti Skenduli, R. Shekhar, M. Alam, R. Kaur, Neural machine
     translation for low-resource languages: A survey, ACM Computing Surveys 55 (2023) 1–37.
 [3] R. Koshkin, K. Sudoh, S. Nakamura, Transllama: Llm-based simultaneous translation system,
     arXiv preprint arXiv:2402.04636 (2024).
 [4] H. Huang, S. Wu, X. Liang, B. Wang, Y. Shi, P. Wu, M. Yang, T. Zhao, Towards making the most
     of llm for translation quality estimation, in: CCF International Conference on Natural Language
     Processing and Chinese Computing, Springer, 2023, pp. 375–386.
 [5] B. Song, F. Li, Y. Liu, X. Zeng, Deep learning methods for biomedical named entity recognition: a
     survey and qualitative comparison, Briefings in Bioinformatics 22 (2021) bbab282.
 [6] P. N. Ahmad, A. M. Shah, K. Lee, A review on electronic health record text-mining for biomedical
     name entity recognition in healthcare domain, in: Healthcare, volume 11, MDPI, 2023, p. 1268.
 [7] D. F. Navarro, K. Ijaz, D. Rezazadegan, H. Rahimi-Ardabili, M. Dras, E. Coiera, S. Berkovsky,
     Clinical named entity recognition and relation extraction using natural language processing of
     medical free text: A systematic review, International Journal of Medical Informatics (2023) 105122.
 [8] N. S. Pagad, N. Pradeep, Clinical named entity recognition methods: an overview, in: International
     Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Volume
     2, Springer, 2022, pp. 151–165.
 [9] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Krallinger, MultiCardioNER Corpus:
     Multilingual Adaptation of Clinical NER Systems to the Cardiology Domain, 2024. URL: https:
     //doi.org/10.5281/zenodo.11368861. doi:10.5281/zenodo.11368861.
[10] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara,
     G. Katsimpras, G. Paliouras, M. Krallinger, Overview of distemist at bioasq: Automatic detection
     and normalization of diseases from clinical texts: results, methods, evaluation and multilingual
     resources., in: CLEF (Working Notes), 2022, pp. 179–203.
[11] S. L. López, E. F. Maduell, L. G. Sánchez, M. Krallinger, MedProcNER Corpus: Gold Standard
     annotations for Clinical Procedures Information Extraction, 2023. URL: https://doi.org/10.5281/
     zenodo.8224056. doi:10.5281/zenodo.8224056.
[12] A. Intxaurrondo, M. Krallinger, Spaccc, 2019. URL: https://doi.org/10.5281/zenodo.2560316. doi:10.
     5281/zenodo.2560316.
[13] S. Lima-López, E. Farré-Maduell, M. Krallinger, DrugTEMIST Guidelines: Annotation of Medication
     in Medical Documents, 2024. URL: https://doi.org/10.5281/zenodo.11065433. doi:10.5281/zenodo.
     11065433.
[14] L. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in: Third Workshop
     on Very Large Corpora, 1995. URL: https://aclanthology.org/W95-0107.
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[16] R. Sharma, D. Chauhan, R. Sharma, Named entity recognition system for the biomedical domain,
     in: 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), IEEE, 2022, pp.
     837–840.
[17] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo,
     A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained biomedical language models for clin-
     ical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Language Pro-
     cessing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL:
     https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19.
[18] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: EMNLP, As-
     sociation for Computational Linguistics, 2019. URL: https://www.aclweb.org/anthology/D19-1371.
[19] S. Schweter, Italian bert and electra models, 2020. URL: https://doi.org/10.5281/zenodo.4263142.
     doi:10.5281/zenodo.4263142.
[20] K. Hakala, S. Pyysalo, Biomedical named entity recognition with multilingual BERT, in: K. Jin-
     Dong, N. Claire, B. Robert, D. Louise (Eds.), Proceedings of the 5th Workshop on BioNLP Open
     Shared Tasks, Association for Computational Linguistics, Hong Kong, China, 2019, pp. 56–61. URL:
     https://aclanthology.org/D19-5709. doi:10.18653/v1/D19-5709.
[21] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, G. Wang, Gpt-ner: Named entity
     recognition via large language models, arXiv preprint arXiv:2304.10428 (2023).
[22] D. Ashok, Z. C. Lipton, Promptner: Prompting for named entity recognition, arXiv preprint
     arXiv:2305.15444 (2023).
[23] I. Jahan, M. T. R. Laskar, C. Peng, J. X. Huang, A comprehensive evaluation of large language
     models on benchmark biomedical text processing tasks, Computers in Biology and Medicine (2024)
     108189.