Cross-Linguistic Disease and Drug Detection in Cardiology
                         Clinical Texts: Methods and Outcomes
                         Notebook for the BioASQ Lab at CLEF 2024

                         Patrick Styll1,* , Leonardo Campillos-Llanos2 , Wojciech Kusa1 and Allan Hanbury1
                         1
                          Data Science Research Unit (E194-04), Technische Universität Wien, Favoritenstraße 9-11, 1040 Vienna, Austria
                         2
                          Institute of Language, Literature and Anthropology, Spanish National Research Council (CSIC), c/Albasanz 26, 28037 Madrid,
                         Spain


                                     Abstract
                                     This paper presents our approach to the MultiCardioNER lab at CLEF2024, focusing on disease detection in
                                     Spanish texts and drug detection in Italian, Spanish, and English texts. We enhance model performance through
                                     several strategies: (1) fine-tuning on automatically translated TREC Clinical Trials admission notes using Masked
                                     Language Modeling (MLM); (2) data augmentation with translated MTSamples processed through a Spanish
                                     medical lexicon (MedLexSp) for accurate vocabulary matching; and (3) employing sliding windows with overlap
                                     to improve data capture. Additionally, we use transfer learning with a clinical trials corpus (CT-EMB-SP) to refine
                                     the outcomes. We further fine-tune several already established disease and drug extraction models to leverage
                                     their extensive vocabulary and compare their performance to models trained from scratch. Our methods and
                                     experiments demonstrate notable improvements in multilingual clinical NER, as evidenced by our track results.

                                     Keywords
                                     Clinical Named Entity Recognition, Transfer Learning, Data Augmentation, Cardiology


                         1. Introduction

                         The increasing volume of clinical text data presents both challenges and opportunities for the healthcare
                         sector [1]. Extracting meaningful information from these texts, such as disease and drug mentions, is
                         critical for applications such as patient care, clinical research and healthcare management [2]. In this
                         context, the MultiCardioNER [3] task from the BioASQ [4] workshop at CLEF2024 provides an important
                         platform for evaluating and advancing clinical named entity recognition (NER) technologies, both in
                         monolingual and multilingual settings. MultiCardioNER is organized by the Barcelona Supercomputing
                         Center’s Natural Language Processing (NLP) for Biomedical Information Analysis group and is promoted
                         by Spanish and European projects such as DataTools4Heart, AI4HF, BARITONE, and AI4ProfHealth.
                         This shared task focuses on the multilingual adaptation of clinical NER systems to the cardiology
                         domain. It includes two key tasks: disease detection in Spanish texts and drug detection across Italian,
                         Spanish, and English texts. Our work addresses these tasks through innovative strategies designed to
                         enhance model performance, which are detailed in this paper. Section 2 provides the background and
                         an overview of the proposed techniques, baseline models and evaluation metrics. In Section 3, we take
                         a look at the practical effect of the introduced methodology via preliminary experiments. Furthermore,
                         we reflect on the results we obtained from the submitted runs via an extensive error analysis. Finally, in
                         Section 4 we conclude our research and discuss and summarize our findings.


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ patrick.styll@tuwien.ac.at (P. Styll); leonardo.campillos@csic.es (L. Campillos-Llanos); wojciech.kusa@tuwien.ac.at
                         (W. Kusa); allan.hanbury@tuwien.ac.at (A. Hanbury)
                          https://github.com/Padraig20 (P. Styll); https://sites.google.com/view/lcampillos/index (L. Campillos-Llanos);
                         https://wojciechkusa.github.io/ (W. Kusa); https://informatics.tuwien.ac.at/people/allan-hanbury (A. Hanbury)
                          0000-0003-3040-1756 (L. Campillos-Llanos); 0000-0003-4420-4147 (W. Kusa); 0000-0002-7149-5843 (A. Hanbury)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Methodology

In this Section, we give the background to our methodology. We describe our proposed techniques to
further enhance results, the baseline models we used and discuss how results and outputs are effectively
evaluated.


2.1. Proposed Techniques
    • Fine-tuning via Masked Language Modeling
      We proposed fine-tuning on 240 automatically translated admission notes of the TREC Clinical
      Trials track via Masked Language Modeling [5]. This should help the model to produce a sense of
      what patient notes look like and enhance their understanding.
    • Data Augmentation
      We used the NickyNicky/medical_mtsamples dataset from HuggingFace as a means of data
      augmentation. We extracted Cardiology Diseases and Drugs, automatically translated the texts to
      Spanish via the Google Translate API [6] and additionally processed the entities using a medical
      lexicon for Spanish (MedLexSp [7]) to ensure only correct medical vocabulary was used.
    • Sliding Windows with Overlap
      We employed a sliding windows attention approach with overlap to handle long sequences of
      clinical text. This method has been effectively utilized in various Natural Language Processing
      (NLP) tasks to manage texts that exceed the input size limitations of standard models [5]. By
      breaking the text into smaller, overlapping segments, the model can better understand the context
      and connections between different Sections of the document.
    • Additional Fine-Tuning/Transfer Learning on general diseases/drugs
      We fine-tuned several baseline models to detect diseases and drugs from the CT-EMB-SP corpus [8]
      with the goal of enhancing the model’s vocabulary of specific medical data. This is a collection of
      1200 texts about clinical trials in Spanish (500 journal abstracts and 700 trial announcements). It
      was annotated with entities for four semantic groups of the Unified Medical Language System
      [9]: ANAT, CHEM, DISO and PROC. This resource facilitates machine learning experiments for
      information extraction on evidence-based medicine. For information on the models and training
      process, see Table 3 and Figures 6a and 6b in Appendix A.


2.2. Baseline Models

This Section introduces all pre-trained models which we used in both experiments and submissions. It
is explained how they were pre-trained, why the model is potentially useful and how we employed it in
our research.

    • google-bert/bert-base-multilingual-cased
      This is a multilingual version of the BERT model [5], which served as our baseline. It is a small
      model that we fine-tuned and used in the preliminary evaluation of track 1.
    • microsoft/mdeberta-v3-base [10] [11]
      This large, multilingual general-domain model has recently gained recognition for its effectiveness
      in processing medical data. We fine-tuned it both from scratch and on the CT-EMB-SP corpus [8]
      for increased vocabulary. We used this model for every part of the track. Table 3 in Appendix A
      shows the parameters and performance of this model.
    • lcampillos/roberta-es-clinical-trials-ner [8]
      This model is based on the RoBERTa architecture and is specifically fine-tuned for named entity
      recognition tasks in Spanish clinical trial texts. It is designed to effectively identify medical
      entities within the domain of clinical trials, enhancing the extraction of relevant information
      from these documents. On the evaluation set of its training data, it achieved a strong F1-score of
      86.47%, demonstrating its effectiveness. We used it for every part of the track.
      This model seemed very promising, since on preliminary testing on the MultiCardioNER data,
      it already achieved an F1-score of 45.52% for track 1 and 76.04% for track 2. This suggests that
      there is more difference between the general medical domain and the cardiology domain for
      diseases than for pharmaceuticals.
    • PlanTL-GOB-ES/bsc-bio-ehr-es [12]
      This model is pre-trained on Spanish electronic health records (EHR), a large corpus of biomed-
      ical texts. It evidently outperformed other popular models on certain tasks, showcasing its
      performance. We used it for track 1 and the Spanish part of track 2.
    • IVN-RIN/bioBIT [13]
      BioBIT (Biomedical Bert for Italian) is a model tailored for the biomedical domain, pre-trained on
      an Italian biomedical corpus derived from machine-translated PubMed abstracts. Built on the
      BERT architecture, BioBIT utilizes Masked Language Modeling and Next Sentence Prediction for
      pretraining. It excels in multiple tasks, including Named Entity Recognition (NER), achieving
      high accuracy across several biomedical datasets. We used it for evaluating the Italian part of
      track 2.
    • alvaroalon2/biobert_chemical_ner [14]
      This BioBERT model is fine-tuned for named entity recognition (NER) tasks specifically targeting
      chemical entities. It has been trained on the BC5CDR-chemicals [15] and BC4CHEMD corpora [16],
      making it highly effective for identifying chemical mentions in biomedical texts. This model is a
      valuable tool for chemical NER in the biomedical domain, supporting advanced research and data
      extraction.


2.3. Metrics

For evaluating the models, we used evaluation metrics based on entity-level [17]. Since we are working
with a highly imbalanced dataset, this provides a more accurate assessment of Named Entity Recognition
(NER) performance.
The International Workshop on Semantic Evaluation (SemEval’13) introduced four ways to evaluate
Named Entity Recognition (NER) performance: Strict, Exact, Partial, and Type. These methods consider
various aspects of matches between system predictions and ground truth annotations. The evalua-
tion schemas assess correctness, incorrectness, partial matches, missed entities, and spurious entities
differently, impacting the calculated precision, recall, and F1-scores.
Strict requires an exact boundary and type match, Exact requires just boundary match, Partial accepts
partial boundaries, and Type requires some overlap. These metrics provide a comprehensive evaluation
of Named Entity Recognition (NER) systems under different match criteria.
When assessing the performance of our models, we use the average of the four F1-scores of the evaluation
metrics: Strict, Exact, Partial, and Type. This average F1-score (𝐹 1𝑎𝑣𝑔 ) is calculated as follows:

                                       𝑆𝑡𝑟𝑖𝑐𝑡 + 𝐸𝑥𝑎𝑐𝑡 + 𝑃 𝑎𝑟𝑡𝑖𝑎𝑙 + 𝑇 𝑦𝑝𝑒
                            𝐹 1𝑎𝑣𝑔 =
                                                      4
This method allows us to effectively determine the most performant model by considering a balanced
view of different evaluation criteria.
3. Experiments and Results

3.1. Preliminary Experiments

The performance of the models is evaluated by cutting off excessive tokens from each patient note,
where each model has an input size of 512 tokens. If models are not trained via sliding windows, excess
tokens are simply cut off during the tokenization process. Note that, due to time constraints, we run
the experiments just once for each model. However, a better methodology would be initializing each
model with different seed values and reporting the average and standard deviation of all runs. This
method would provide a more realistic overview of each model’s performance.
Please note that these preliminary experiments do not yet include error analysis, as described in
Section 3.3. The absolute performance of the models depicted here is not demonstrative, but the relative
difference of the separate runs showcases success of the proposed techniques and some insights into
how the models behave. For additional experiments where absolute performance of the models is
depicted, please see Appendix C.


3.1.1. Experiments for Track 1

Baseline; Baseline values are given by the multilingual-bert and bsc-bio-ehr-es, which already achieved
decent F1-scores on the development set, with bsc-bio-ehr-es being at 81.48% and multilingual-bert at
76.50%. This already suggests a clear benefit for fine-tuning on domain-specific data for specific tasks.


Domain-Specific Model; For this purpose, the roberta-es-clinical-trials-ner model (as introduced in
Section 2.2) was used as a baseline. Since it was fine-tuned on general diseases in the Spanish language
(not only cardiovascular conditions), it already started with a relatively high F1-score in the first epoch
(see Figure 1a). We can see that the data-augmentation technique showed a promising influence in
further fine-tuning the model to the cardiology domain. The sliding windows approach showed slightly
worse results. However, the difference is not large enough for a conclusion.
After some hyperparameter-tuning, the model achieved 87.9% F1-score at its peak. As evident in
Figure 1b, the Masked Language Modeling approach did not necessarily influence results. This might
be due to the lack of data for this kind of fine-tuning, which may add unnecessary bias.
During the process of hyperparameter-tuning, we saw that a higher learning rate (i.e. 1𝑒−4 instead of
2𝑒−5 ) performed slightly better (approximately an increase of 4% in the evaluation metric). The same
can be said for the batch size, where a higher size yielded better results (approximately 7% in F1-score).
Unfortunately, experiments were rather limited here due to the lack of GPU RAM.
                      0.6

                      0.5

                      0.4
   Average F1-Score


                      0.3

                      0.2

                      0.1
                                                    roberta-es-clinical-trials-ner-cutoff
                                                    roberta-es-clinical-trials-ner-windows
                      0.0                           roberta-es-clinical-trials-ner-cutoff-mlm
                            2      4            6                8                     10
                                        Epoch


                            (a) Domain-Specific Model.                                          (b) Domain-Specific Model (tuned).
Figure 1: Domain Specific Model Experiments. roberta-es-clinical-trials-ner-cutoff used the cutoff-strategy,
roberta-es-clinical-trials-ner-windows used the sliding windows technique and roberta-es-clinical-trials-ner-cutoff-
mlm was first fine-tuned via Masked Language Modeling (MLM) on admission notes and then fine-tuned onto
the cardiology domain via the cutoff-strategy. For the tuned models in Figure 1b, data-augmentation as well as
the cutoff-strategy was performed for both runs, while roberta-es-clinical-trials-ner-cutoff-dataaugment-mlm-tuned
was further fine-tuned via Masked Language Modeling (MLM) on admission notes.


Multilingual Model For this purpose, the mdeberta-v3-base model was used as a pre-trained model.
We first fine-tuned the baseline model on the cardiology data provided by the shared task, which already
showed promising results. It is also interesting to see that in the beginning the model’s performance
started much lower than those models that were already fine-tuned on general diseases (see Figure 2).
Fine-tuning the model on general diseases using the CT-EMB-SP corpus showed promising changes
in performance. Adding data-augmentation and Masked Language Modeling (MLM) as additional
techniques only influenced the results slightly. In the end, it reached an F1-score of 87% at its peak.


Figure 2: Multilingual Model. baseline is the mdeberta-v3-base model without any special techniques. mdeberta-
v3-ct-cutoff was fine-tuned on general diseases before being fine-tuned onto the cardiology domain via the
cutoff-strategy. mdeberta-v3-ct-cutoff-dataaugment additionally used data-augmentation, and mdeberta-v3-ct-
cutoff-dataaugment-mlm was additionally pre-trained via Masked Language Modeling (MLM) onto admission
notes.


3.1.2. Experiments for Track 2

Domain-Specific Model (es) The roberta-es-clinical-trials-ner model was used as a baseline. Sur-
prisingly, the scores were relatively low. This was unexpected, since we already measured much better
performance by this model on the MultiCardioNER data. Eventually, we obtained an F1-score of 59.79%.
Multilingual Model (es) As in previous experiments, the mdeberta-v3-base model was used as a
baseline and fine-tuned on drugs from the CT-EMB-SP corpus (which did not happen for the other
languages where the base model was used). As can be seen in Figure 3b, the combination of Masked
Language Modeling and data-augmentation actually brought much benefit. In the end, the model
achieved an F1-score of 70.87% at its peak.

                     0.86
                                                                                                                           0.7

                     0.84                                                                                                  0.6

                                                                                                                           0.5
                     0.82
  Average F1-Score


                                                                                                        Average F1-Score
                                                                                                                           0.4
                     0.80
                                                                                                                           0.3
                     0.78                                                                                                  0.2

                     0.76                           roberta-es-clinical-trials-ner-cutoff                                  0.1                                    mdeberta-ct-dg-windows
                                                    roberta-es-clinical-trials-ner-windows                                                                        mdeberta-ct-dg
                                                    roberta-es-clinical-trials-ner-cutoff-dataaugment                                                             mdeberta-ct-mlm-dg
                                                                                                                           0.0
                                2      4            6                    8                      10                               2         4            6     8                      10
                                            Epoch                                                                                               Epoch


                                (a) Domain-Specific Model.                                                                           (b) Multilingual Model.
Figure 3: Spanish track performance. roberta-es-clinical-trials-ner-cutoff used the cutoff-strategy when being
fine-tuned onto the Cardiology domain, while roberta-es-clinical-trials-ner-cutoff-dataaugment additionally used
data augmentation. On the other hand, roberta-es-clinical-trials-ner-windows used the sliding windows approach.
mdeberta-ct-dg was first fine-tuned onto the general domain, before being fine-tuned onto cardiology via the
cutoff-strategy and data-augmentation. mdeberta-ct-mlm-dg further used Masked Language Modeling (MLM)
onto admission notes. On the other hand, mdeberta-ct-dg-windows used the sliding windows approach as well as
data-augmentation.


Insights from other languages (en, it) Testing the roberta-es-clinical-trials-ner model on both the
English and Italian track provided us with interesting results. Looking at Figure 4a of the English track,
we can actually see that the Spanish model outperformed both the multilingual general domain model
(i.e. mdeberta-v3-base and the domain-specific model (i.e. BioBERT ). The same cannot be said for the
Italian track, where it was outperformed by every other run (see Figure 4b). For the Italian track, the
mdeberta-v3-base model won with an F1-score of 90.7%, while the roberta-es-clinical-trials-ner achieved
80.45% on the English track. These results suggest a great multilingual overlap for pharmaceuticals.


                          0.8
                          0.7                                                                                              0.8

                          0.6
                                                                                                                           0.6
       Average F1-Score


                                                                                                        Average F1-Score


                          0.5
                          0.4
                                                                                                                           0.4
                          0.3
                          0.2                                                                                              0.2
                                                                                                                                                             biobit-mlm
                          0.1                             biobert-mlm                                                                                        biobit
                                                          mdeberta-windows                                                                                   mdeberta
                          0.0                             roberta-es-clinical-trials-ner-windows-1e4                       0.0                               roberta-es-clinical-trials-ner
                                2      4            6                    8                      10                               2         4            6     8                      10
                                            Epoch                                                                                               Epoch


                                    (a) English Track.                                                                                  (b) Italian Track.
Figure 4: English and Italian track performance. In Figure 4a, biobert-mlm used the cutoff-strategy as well as
Masked Language Modeling (MLM) via admission notes. Both mdeberta-windows and roberta-es-clinical-trials-
ner-windows-1e4 used the sliding windows approach. In Figure 4b, all models used the cutoff-strategy, where
biobit-mlm additionally used MLM via admission notes.
3.2. Official Submissions

The submission runs are described in Appendix D.

Table 1
Results of Submission Runs.
                 Track        Run name                        P        R        F1
                 Track1       run1_mdeberta-ct-mlm-dg         59.28%   67.15%   62.97%
                 Track1       run2-mdeberta-ct                50.27%   68.84%   58.10%
                 Track1       run3_mdeberta-ct-dg             48.00%   67.73%   56.18%
                 Track1       run4-roberta-dg                 65.65%   73.76%   69.47%
                 Track1       run5-roberta-dg-windows         65.46%   72.44%   68.77%
                 Track2_ES    run1_mdeberta-multilingual      39.14%   15.31%   22.01%
                 Track2_ES    run2_mdeberta-ct-multilingual   76.47%   35.56%   48.55%
                 Track2_ES    run3_roberta-ct-multilingual    87.05%   43.42%   57.94%
                 Track2_ES    run4_mdeberta_ct_mlm_dg         68.15%   38.36%   49.09%
                 Track2_ES    run5_roberta-ct-mlm             84.21%   39.12%   53.42%
                 Track2_EN    run1_mdeberta-multilingual      56.48%   24.81%   34.48%
                 Track2_EN    run2_mdeberta-ct-multilingual   84.53%   37.77%   52.21%
                 Track2_EN    run3_roberta-ct-multilingual    86.32%   43.64%   57.97%
                 Track2_EN    run4-mdeberta-windows           79.55%   43.17%   55.97%
                 Track2_EN    run5-biobert-mlm-windows        67.71%   44.10%   53.41%
                 Track2_IT    run1_mdeberta-multilingual      50.74%   20.94%   29.65%
                 Track2_IT    run2_mdeberta-ct-multilingual   74.33%   33.94%   46.61%
                 Track2_IT    run3_roberta-ct-multilingual    82.64%   42.06%   55.74%
                 Track2_IT    run4-mdeberta                   74.81%   39.28%   51.51%
                 Track2_IT    run5-biobit-mlm                 79.22%   35.17%   48.71%

It is important to note that these results do not reflect the absolute performance of the models (see
Section 3.3 for further details and Appendix C for additional experiments with demonstrative perfor-
mance), but the relative difference of the separate runs showcase the success of the proposed techniques
and some insights into how the models behave.
    • Fine-Tuning on General Diseases (Transfer-Learning)
      The first three runs (multilingual runs) of track 2 show evidence that fine-tuning models on
      general diseases before focusing on the cardiology domain yielded great benefit. The first run,
      which was not fine-tuned on general diseases, showed worse performance than run 2 and 3,
      which were both fine-tuned on the CT-EMB-SP corpus.
    • Multilingual Overlap for Pharmaceuticals
      Looking at run 2 and run 3 of track 2, we can see that the multilingual models performed similarly
      to the monolingual models. This suggests a big multilingual overlap for drugs in Spanish, English
      and Italian.
    • Possible Noise by Data Augmentation due to Machine Translation
      Several runs (e.g. run 2 and run 3 of track 1) showed slightly worse performance when data
      augmentation was used. This suggests possible additional noise in the training data. For data
      augmentation, we used the MTSamples dataset, where we used the keywords as entities and
      translated all text via the Google Translate API. On first inspection, these keywords may refer to
      laboratory tests, procedures or anatomical entities. Therefore, we have processed the translated
      MTSamples with MedLexSp; we output only DISO and CHEM categories, and we re-named them
      to ENFERMEDAD and FARMACO, respectively. Nonetheless, it is unsure whether these data
      contain general diseases or only cardiology diseases. Furthermore, on closer inspection, there are
      some issues with the machine translation, which is also visible in the automatic translation of the
      TREC Admission Notes for Masked Language Modeling (MLM) fine-tuning. Either some words
      are not translated or new words are created, possibly due to the sub-words of neural models. An
      example would be *leucitos en urino, which should be leucocitos en orina (’white cells in urine’).


3.3. Error Analysis

3.3.1. Data Reconstruction

Several problems arose during generating the runs; namely, reconstructing the output of the mdeberta-
v3-base and roberta-es-clinical-trials-ner model.


mdeberta-v3-base This model posed several problems, particularly in generating the correct index
of the span. Often, the start of the span would be one or two tokens off, leading to a decrease in the
F1-score for the runs. Additionally, tokens might have leading spaces or newline characters at the
beginning or end. These extraneous characters need to be removed to ensure the entity text is clean
and accurate. This also includes adjusting the start and end offsets to reflect the new positions of the
cleaned tokens.
The presence of punctuation at the end of tokens can create issues in entity recognition. Special rules
are required to handle exceptions such as units (mg.) or cases where brackets are involved. Unnecessary
punctuation needs to be removed, but care must be taken to preserve punctuation that is part of the
entity. Furthermore, the model would sometimes add tabs instead of spaces into the extracted entity.
When tokens are merged or cleaned, their character offsets in the text need to be recalculated. This
ensures that the entities’ positions in the text are accurately represented, which is crucial for tasks like
text highlighting or linking entities back to their original context.


roberta-es-clinical-trials-ner This model exhibits significant issues with handling sub-words, often
treating them as separate entities. Specifically, leading sub-words are represented as individual entities
with a preceding space. This requires special considerations during the reconstruction process to ensure
accurate entities.


General Remarks We analysed all errors made by the roberta-es-clinical-trials-ner model in run 4 of
track 1; we used a Python script to count each type. There are several types of errors that the model
made while generating the runs. Examples of this run may be found in Table 2.

   1. Scope Errors
        a) Incompletely predicted entities
            These are entities where the model predicted only a part of the actual entity, missing some
            crucial parts.
        b) Entities where too many words were predicted
            Entities where the predicted span includes extra information not part of the actual entity.
        c) Entities that would belong together
            Entities where the predicted spans should be combined to form a single coherent entity.
   2. False Positives
      Entities that were incorrectly identified by the model, but not labeled in the ground truth.
   3. False Negatives (Missed entities)
      Entities that were missed by the model, but were labeled in the ground truth.

Looking into Figure 5, we can see that the high number of scope errors (correctly identified entities)
indicates that the model generally has a strong baseline capability for recognizing entities correctly.
Despite the high accuracy in identifying entities, the model still exhibits significant precision and recall
Table 2
Examples for incorrectly extracted entities from the test dataset. T is representative of the type of error that the
model made while generating the runs.
 Filename                    T Prediction                                     Test Set
 multicardioner_test+bg_7336 1a fumador                                       fumador activo
 multicardioner_test+bg_7845 1a taquicardia de QRS estrecho arrít-            taquicardia de QRS estrecho
                                mica
 multicardioner_test+bg_7845 1a vía accesoria lateral                         vía accesoria lateral izquierda
 multicardioner_test+bg_7560 1b dilatación de VI con función severa-          dilatación de VI
                                mente deprimida
 multicardioner_test+bg_7845 1b taquicardia ventricular originada             taquicardia ventricular
                                en el músculo papilar posterior
 multicardioner_test+bg_7845 1b Cardiopatía hipertensiva con dis-             Cardiopatía hipertensiva
                                función diastólica
 multicardioner_test+bg_75   1c IM jetivamente                                IM subjetivamente
 multicardioner_test+bg_7560 2 a IECA
 multicardioner_test+bg_7560 2 hipertrabeculación
 multicardioner_test+bg_543 3                                                 IAM
 multicardioner_test+bg_7336 3                                                carga trombótica


Figure 5: Normalized frequency of occurrences of different types of errors. All errors of run 4 of track 1 (see
Table 1) has been chosen.


issues, as evidenced by the presence of false positives and false negatives. Furthermore, the high number
of true positives amidst other errors implies that errors are not due to a fundamental flaw in the model
but likely due to specific cases or contexts where the model’s performance drops.
Nonetheless, score errors tend to cause less harm in the information extraction of clinical cases. Not
detecting an entity (false negative, case 3 in Table 2) is more severe, whereas detecting fumador instead
of fumador activo (case 1a in Table 2) is a mild error. This becomes even clearer when looking into
errors where abbreviations were used. If written in the text is diabetes mellitus (DM), the model would
extract diabetes mellitus (DM), while the test set would annotate both whole text and abbreviation
separately. This happens very frequently, which makes the evaluation unsuitable for measuring the
model’s practical performance. In the end, a more relaxed evaluation metric could have been more
appropriate and yielded higher results.


3.3.2. Data Capture

There are several factors that contributed to why the models underperformed in the submission runs
(see Table 1). Further data analysis shows that the main issue was insufficient data capture during both
training and evaluation. Plots may be found in Appendix B. In general, we refer to data capture to the
degree to which the model has difficulty with capturing all information from the patient notes. After
all, in preliminary experiments and for generating submission runs, we simply cut off excess tokens
that did not fit into the model.


General Remarks The relative distribution of entities over patient notes, as depicted in the density
plots in Figure 8 (Appendix B), reveals several interesting insights.
For track 1, the training set displays a prominent spike at the beginning, followed by a relatively uniform
distribution throughout the rest of the notes. In contrast, the development set and test set exhibit two
significant spikes: one at the beginning and a larger one at the end, with a notably lower density in the
middle. This pattern suggests that most diseases are mentioned either at the beginning or the end of
the patient notes.
Turning to pharmaceuticals in track 2, we observe similar entity distributions across all plots. The
training set again shows a more uniform distribution, whereas the development and test sets both
feature two prominent spikes at the beginning and end, mirroring the pattern observed in track 1.
Looking at the words counts in the boxplots of Figure 7, we can clearly see that the training set exhibits
significantly less words than both the development set and the test set. About 75% of patient notes
from the test set have less than 550 words, which applies to less than 25% of patient notes from the
development/test set.
The Venn diagrams in Figure 9 (Appendix B) are also worth mentioning. As we can see, track 1 seems
to have only little overlap between the datasets, while track 2 has notably more overlap. This may
imply that the model suffers from a few-shot learning problem, especially since results on track 2 are
significantly better in terms of performance than those of track 1. Another factor may be the amount of
unique entities in the datasets, which is way larger in track 1 than in track 2, further complicating the
task for the model.


Implications for Training and Evaluation The previous training and evaluation strategies were
significantly affected by these entity distribution patterns. During previous evaluation, a cutoff strategy
was used, where all excess tokens were trimmed to fit the model’s input layer, which was uniformly set
at 512 tokens. This meant that only approximately 60% of the data was fully utilized during training.
However, due to the high density of entities at the end of patient notes, this approach resulted in
sub-optimal data capture. The situation was even worse during evaluation on the development set,
where only less than 25% of the data fit into the models without token cutoff. This led to the model
being evaluated on a non-representative portion of the dataset, inflating the performance metrics.
To improve data capture, we decided to split the patient notes into individual sentences using spaCy [18]
for both training and evaluation. This change not only yielded better results, particularly for track 2,
but also provided more reliable and representative metrics. Consequently, several experiments were
re-conducted (refer to Appendix C). It is important to note that these new runs expand and confirm the
trends observed in earlier experiments (see Section 3.1).


4. Conclusion

We can see some interesting trends in the data, allowing us to draw both conclusions about our proposed
strategies as well as the provided data.

    • Fine-tuning via Masked Language Modeling: This approach had very little influence on the
      model’s results. This can be attributed to (i) the lack of sufficient data for this kind of fine-tuning,
      (ii) the fact that the patient notes are based on the general domain, and (iii) erroneous machine
      translation.
    • Data Augmentation: The effects of data augmentation are still unclear. We have observed both
      positive and negative effects across different model architectures. More experiments with different
      models and types of data augmentation resources are necessary to draw definitive conclusions.
    • Sliding Windows with Overlap: The impact of the sliding windows approach, as opposed to
      cutting off excess tokens, is also difficult to judge. Despite expecting better data capture, some
      experiments actually showed slightly worse results. This effect may be due to patient notes being
      split in random positions, resulting in incorrect grammar and split entities, which can disrupt the
      contextual information the model relies on. This issue becomes more evident when considering
      that processing patient notes at the sentence level improved results notably.
    • Additional Fine-Tuning/Transfer-Learning on general diseases/drugs: This approach
      significantly improved the model’s performance. Various experiments demonstrated that adapting
      a general model to a specific domain requires less effort and yields promising results with relatively
      little training.
    • Insufficient Data Capture: Due to the high density of entities in the beginning and end of the
      patient notes, the cutoff strategy performed poorly due to missing entities at the end of the notes.
    • Overlap of Entities over Datasets: There are significantly less overlapping entities between
      the training, development and testing dataset for track 1 than there are for track 2. This may
      explain the generally worse results for track 1, indicating that models may suffer from a few shot
      learning problem.
    • Multilingual Overlap for Pharmaceuticals: We have shown that there is a big multilingual
      overlap concerning pharmaceuticals in Spanish, Italian and English. This can be largely attributed
      to the standardized pharmaceutical nomenclature, which suggests that a multilingual approach
      to drug entity extraction can leverage these similarities to enhance accuracy and consistency
      across different languages.


Acknowledgments

Leonardo Campillos-Llanos’ work is conducted in the CLARA-MeD project (PID2020-116001RA-C33),
funded by MICIU/AEI/10.13039/501100011033/, in call Proyectos I+D+i Retos Investigación.


References
 [1] H. Dalianis, Clinical text mining: Secondary use of electronic patient records, Springer Nature,
     2018.
 [2] D. Demner-Fushman, N. Elhadad, C. Friedman, Natural language processing for health-related
     texts, in: Biomedical Informatics: Computer Applications in Health Care and Biomedicine, Springer,
     2021, pp. 241–272.
 [3] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz,
     G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger,
     Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation
     of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková,
     A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
 [4] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
     N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
     twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
     in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
     A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
     Association (CLEF 2024), 2024.
 [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
     and Short Papers), 2019, pp. 4171–4186.
 [6] Google Inc., Google translate api, https://cloud.google.com/translate, n.d. Accessed: 2024-05-21.
 [7] L. Campillos-Llanos, MedLexSp – a medical lexicon for Spanish medical natural language process-
     ing, Journal of Biomedical Semantics 14 (2023) 2. URL: https://doi.org/10.1186/s13326-022-00281-5.
     doi:10.1186/s13326-022-00281-5.
 [8] L. Campillos-Llanos, A. Valverde-Mateos, A. Capllonch-Carrión, A. Moreno-Sandoval, A clinical
     trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine,
     BMC Medical Informatics and Decision Making 21 (2021) 69. URL: https://doi.org/10.1186/
     s12911-021-01395-z. doi:10.1186/s12911-021-01395-z.
 [9] O. Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology,
     Nucleic acids research 32 (2004) D267–D270.
[10] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
     Gradient-Disentangled Embedding Sharing, 2021. arXiv:2111.09543.
[11] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, in:
     International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?
     id=XPZIaotutsD.
[12] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo,
     A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained biomedical language models for clin-
     ical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Language Pro-
     cessing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL:
     https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19.
[13] T. M. Buonocore, C. Crema, A. Redolfi, R. Bellazzi, E. Parimbelli, Localizing in-domain adaptation
     of transformer-based biomedical language models, Journal of Biomedical Informatics 144 (2023)
     104431. URL: https://www.sciencedirect.com/science/article/pii/S1532046423001521. doi:https:
     //doi.org/10.1016/j.jbi.2023.104431.
[14] Á. Alonso Casero, Named entity recognition and normalization in biomedical literature: a practical
     case in SARS-CoV-2 literature, 2021. URL: https://oa.upm.es/67933/, unpublished.
[15] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C.
     Wiegers, Z. Lu, Biocreative V CDR task corpus: a resource for chemical disease relation extraction,
     Database J. Biol. Databases Curation 2016 (2016). URL: https://doi.org/10.1093/database/baw068.
     doi:10.1093/database/baw068.
[16] M. Krallinger, O. Rabal, F. Leitner, M. Vazquez, D. Salgado, Z. Lu, R. Leaman, Y. Lu, D. Ji, D. M.
     Lowe, R. A. Sayle, R. T. Batista-Navarro, R. Rak, T. Huber, T. Rocktäschel, S. Matos, D. Campos,
     B. Tang, H. Xu, T. Munkhdalai, K. H. Ryu, S. V. Ramanan, S. Nathan, S. Žitnik, M. Bajec, L. Weber,
     M. Irmer, S. A. Akhondi, J. A. Kors, S. Xu, X. An, U. K. Sikdar, A. Ekbal, M. Yoshioka, T. M.
     Dieb, M. Choi, K. Verspoor, M. Khabsa, C. L. Giles, H. Liu, K. E. Ravikumar, A. Lamurias, F. M.
     Couto, H.-J. Dai, R. T.-H. Tsai, C. Ata, T. Can, A. Usié, R. Alves, I. Segura-Bedmar, P. Martínez,
     J. Oyarzabal, A. Valencia, The CHEMDNER corpus of chemicals and drugs and its annotation
     principles, Journal of Cheminformatics 7 (2015) S2. URL: https://doi.org/10.1186/1758-2946-7-S1-S2.
     doi:10.1186/1758-2946-7-S1-S2.
[17] D. S. Batista, Named-entity evaluation metrics based on entity-level, 2018. URL: https://www.
     davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/, accessed: 2024-05-21.
[18] Explosion-AI, spaCy: Industrial-strength Natural Language Processing in Python, https://spacy.io/
     usage/linguistic-features#sbd, 2023. URL: https://spacy.io/, version 3.0.
A. CT-EMB-SP Fine-Tuning

Table 3
Parameters and Performance for ENFERMEDAD and FARMACO using mdeberta-v3-base.
                                                                   ENFERMEDAD         FARMACO
                                                   Learning Rate        2e-5             2e-5
                                                   Batch Size            16               16
                                                   Epochs                10               10
                                                   Input Size           512              512
                                                   Weight Decay         0.01             0.01
                                                   Optimizer          AdamW            AdamW
                                                   F1 Score            94.45%           93.89%


                    0.945                                                                             0.94

                    0.940                                                                             0.93
                    0.935
                                                                                                      0.92
 Average F1-Score


                                                                                   Average F1-Score
                    0.930

                    0.925                                                                             0.91

                    0.920
                                                                                                      0.90
                    0.915
                                                                                                      0.89
                    0.910

                            2     4            6    8    10                                                  2    4           6   8   10
                                       Epoch                                                                          Epoch


                                (a) Enfermedad.                                                                  (b) Farmaco.

Figure 6: Model Performance during Fine-Tuning.


B. Data Analysis

It is important to note that for track 2 (FARMACO), the density plots in Figure 8 and boxplots in Figure 7
look the same among the three different languages, despite translation. Trivially, the boxplots in Figure 7
look the same for both track 1 and track 2 since the same data was used.


                                 (a) Train Set.                                                                   (b) Dev Set.


                                                                   (c) Test Set.

Figure 7: Amount of Words per Patient Note - Boxplot.
              (a) Track 1 - Train Set.                                         (b) Track 1 - Dev Set.


               (c) Track 1 - Test Set.                                        (d) Track 2 - Train Set.


               (e) Track 2 - Dev Set.                                          (f) Track 2 - Test Set.
Figure 8: Relative Entity Positions - Density Plot. Low values (close to 0) represent text positions at the beginning
of the document; and values close to 1, positions at the end of the document.


        (a) Track 1 - ENFERMEDAD.                                               (b) Track 2 - FARMACO.

Figure 9: Overlap of Unique Entities over train, development and test set.
C. Additional Experiments

C.1. Cardiology Domain Adaptation

This experiment serves to show how easily a general model, i.e. trained on general pharmaceuticals,
can be adapted to a special medical domain. The roberta-es-clinical-trials-ner model was fine-tuned on
general drugs using the CT-EMB-SP corpus in Spanish, and it was used as a base model and fine-tuned
on the cardiology domain for pharmaceuticals. As previously mentioned in Section 2.2, it already
achieved an F1-score of 76.04% before fine-tuning on cardiology data.
As seen in Figure 10, epoch 1 already shows an incredible performance on the development set. Evalu-
ating this model of epoch 1 on the test set, we achieved a precision of 86.15%, a recall of 93.77% and
an F1-score of 89.80%. Using a model trained on three epochs (which shows the peak in Figure 10) we
obtained a precision of 90.25%, a recall of 94.30% and an F1-score of 92.23%.

                                                                                                                                     roberta-farmaco-es-ss
                                                      0.964

                                                      0.963

                                                      0.962
                                   Average F1-Score


                                                      0.961

                                                      0.960

                                                      0.959

                                                      0.958

                                                      0.957                2                   4                 6               8                   10
                                                                                                         Epoch


Figure 10: Fine-tuning roberta-es-clinical-trials-ner on the cardiology domain for Track 2 (FARMACO).


C.2. Effect of Data Augmentation

When looking into the effects of our proposed data augmentation, we trained roberta-es-clinical-trials-
ner with and without data augmentation (same setup, i.e. same hyper-parameters). In Figure 11, we can
see similar behaviour in training, but with less performance on the development set. When evaluating
the model on the test set, we got a precision of 92.08%, a recall of 94.06% and an F1-score of 93.06%.
Considering the model trained in Section C.1, data augmentation actually led to a slightly higher score.

                                                      0.965         roberta-farmaco-es-ss
                                                                    roberta-farmaco-es-ss-dg
                                                      0.960

                                                      0.955
                                   Average F1-Score


                                                      0.950

                                                      0.945

                                                      0.940

                                                      0.935
                                                              1.0          1.5         2.0         2.5    3.0        3.5   4.0           4.5         5.0
                                                                                                         Epoch


Figure 11: Fine-tuning roberta-es-clinical-trials-ner on the cardiology domain for the Spanish part for Track 2
(FARMACO) with data augmentation.


The same can be said when training the same model on track 1, where the plain model achieved a
precision of 62.86%, a recall of 65.42% and an F1-score of 64.11%. With data augmentation, we achieved
a significant improvement with a precision of 65.65%, a recall of 73.76% and an F1-score of 69.47%.
Nonetheless, when reflecting on the results in Section 3.2, we discussed possible negative effects due to
incorrect machine translation. These effects are definitely visible when using e.g. mdeberta-v3-base
as a baseline architecture (see also Table 1), which is why we are not entirely capable of judging the
effect of data augmentation. Although the outcomes of some models seem to support that it may help
adapting a general model to a specific domain, we would need to experiment with more models and
test more types of resources for data augmentation.


C.3. Effect of Fine-Tuning on General Domain - Transfer-Learning

In order to more precisely measure the benefit of fine-tuning a model on a general domain before
fine-tuning it on a specialized domain, we conducted the following experiment on the Spanish part of
track 2.
mdeberta-v3-base was used as a baseline. We compared the plain mdeberta-v3-base model with one
fine-tuned on the general medical domain via the CT-EMB-SP corpus. Looking at the graph in Figure 12,
the general model already showed greater performance than the base model in very early stages of
training (validating once more what was seen with regard to adapting a general model to the medical
domain, in Section C.1). The base model started with relatively low F1-score, but caught up in the last
epochs.


                                                    0.954

                                                    0.952

                                                    0.950
                                 Average F1-Score


                                                    0.948

                                                    0.946

                                                    0.944

                                                    0.942

                                                    0.940                                          mdeberta-ct-farmaco-es-ss
                                                                                                   mdeberta-farmaco-es-ss
                                                    0.938 1.0   1.5   2.0    2.5    3.0    3.5   4.0       4.5         5.0
                                                                                   Epoch


Figure 12: Fine-tuning mdeberta-v3-base on the cardiology domain for the Spanish part Track 2 (FARMACO).
Model is either plain or originally fine-tuned on general pharmaceuticals in Spanish.


In Table 4, we can see the notable increase in performance not only on the development set, but also on
the test set.

Table 4
Performance Metrics of the Plain and Fine-tuned Models Evaluated on the Ground Truth.
                           Model                                            Precision            Recall                      F1 Score
                           Plain Model                                       87.56%              90.57%                        89.04%
                           Transfer Learning                                 90.34%              93.60%                        91.94%


C.4. Multilingual Capabilities

In order to check the assumption of a possible multilingual overlap for pharmaceuticals in Spanish,
Italian and English, we trained and evaluated a multilingual general model (i.e. mdeberta-v3-base) on all
three datasets simultaneously by having concatenated the data of all three languages.
As can be seen in Figure 13, the training exhibits a significant drop in performance to what can be seen
in language specific models (see Figure 10 in Section C.1). The same can be said when looking into the
results obtained when evaluating the model on the test set for each language separately (table 5). This
can be explained via minor differences for special pharmaceutical words among the different languages,
which may slightly add noise to the data.
The analysis of drug entity recognition across English, Spanish, and Italian (see Figure 14) demonstrates
a significant overlap and similarity in the pharmaceutical terminology used in these languages. As seen
Table 5
Performance Metrics of the Multilingual Named Entity Recognition (NER) Model Evaluated on the Ground
Truth for Each Language.
                                Language                                         Precision                      Recall            F1 Score
                                Spanish                                               86.29%                    89.00%             87.62%
                                English                                               85.37%                    89.48%             87.38%
                                Italian                                               85.72%                    86.10%             85.91%


                                                                   mdeberta-farmaco-multilingual-ss
                                                     0.935

                                                     0.930

                                                     0.925
                                  Average F1-Score


                                                     0.920

                                                     0.915

                                                     0.910

                                                     0.905

                                                     0.900
                                                             1.0          1.5         2.0         2.5    3.0    3.5   4.0   4.5    5.0
                                                                                                        Epoch


Figure 13: Fine-tuning mdeberta-v3-base on the cardiology domain for the all languages in Track 2 (FARMACO).
The files for training and evaluation have simply been concatenated among all three languages.


in Table 6, many drug names exhibit minor variations that are primarily due to linguistic differences
such as suffixes and spelling conventions. This overlap can be attributed to the standardized nature of
pharmaceutical nomenclature and the widespread use of international nonproprietary names (INNs).
These findings suggest that a multilingual approach to drug entity recognition can leverage these
similarities to enhance accuracy and consistency across different languages.

Table 6
Examples of Similar Drug Names Across English, Spanish, and Italian.
 Drug (Stemmed)            English Form                                                                 Spanish Form                          Italian Form
 lenalidomid               lenalidomide                                                                 lenalidomida                          lenalidomide
 caffein                   caffeine                                                                     cafeína                               caffeina
 triamcinolone acetonid    triamcinolone acetonide                                                      triamcinolona acetónido               triamcinolone acetonide
 ampicillin                ampicillin                                                                   ampicilina                            ampicillina
 sulfacetamid              sulfacetamide                                                                sulfacetamida                         sulfacetamide


        (a) No preprocessing.                                                                                                            (b) Stemmed and Lemmatized.
Figure 14: Overlap of pharmaceuticals in the training data for Spanish, Italian and English. To make the data
more representative for similarity, we also provide entities that have been pre-processed via stemming and
lemmatization. NLTK was used for Italian and English, while MedLexSp was used for Spanish.
D. Original Submission Runs

D.1. Track 1

   run1-mdeberta-ct-mlm-dg The architecture of mdeberta-v3-base was used and fine-tuned on
admission notes via Masked Language Modeling (MLM), with continual fine-tuning on general diseases.
In order to further tune the model, we used data augmentation.
MLM
    • Epochs: 5
    • Learning Rate: 5e-6
    • Loss: 9.3413
    • Perplexity: 11399.24
Cardiology Task
    • Learning Rate: 1e-4
    • Epochs: 5
    • F1 Avg: 87.24%
    • F1 Exact: 86.76%
    • F1 Partial: 86.76%
    • F1 Ent Type: 88.69%
    • F1 Strict: 86.76%


   run2-mdeberta-ct The architecture of mdeberta-v3-base was used and fine-tuned on general
diseases.
    • Learning Rate: 2e-5
    • Epochs: 10
    • F1 Avg: 86.66%
    • F1 Exact: 86.61%
    • F1 Partial: 86.61%
    • F1 Ent Type: 88.31%
    • F1 Strict: 86.10%


   run3-mdeberta-ct-dg The architecture of mdeberta-v3-base was used and fine-tuned on general
diseases, this time including data augmentation via MTsamples.
    • Learning Rate: 2e-5
    • Epochs: 10
    • F1 Avg: 85.08%
    • F1 Exact: 84.43%
    • F1 Partial: 84.43%
    • F1 Ent Type: 87.03%
    • F1 Strict: 84.43%


   run4-roberta-dg The architecture of lcampillos/roberta-es-clinical-trials-ner was used and fine-
tuned on the task of diseases in cardiology. To further tune the model on identifying only cardiology
diseases, we used data augmentation via MTsamples. The base model already has a solid understanding
of diseases and reaches an F1-score of 45.52%.
    • Learning Rate: 2e-4
    • Epochs: 5
    • F1 Avg: 87.91%
    • F1 Exact: 87.52%
    • F1 Partial: 87.52%
    • F1 Ent Type: 89.08%
    • F1 Strict: 87.52%


  run5-roberta-dg-windows The architecture of lcampillos/roberta-es-clinical-trials-ner was used
and trained on the task of diseases in cardiology. To further tune the model on identifying only
cardiology diseases, we used data augmentation via MTsamples. To further data capture, we used the
proposed sliding windows technique.
    • Learning Rate: 2e-4
    • Epochs: 3
    • Window Overlap: 60 tokens
    • F1 Avg: 86.07%
    • F1 Exact: 85.48%
    • F1 Partial: 85.48%
    • F1 Ent Type: 87.78%
    • F1 Strict: 85.94%


D.2. Track 2

D.2.1. Multilingual Models

We propose three types of multilingual models, where all data from all three languages are taken and
concatenated for training and evaluation.


  run1-mdeberta-multilingual       The architecture of mdeberta-v3-base was used.
    • Learning Rate: 2e-5
    • Epochs: 5
    • F1 Avg: 82.43%
    • F1 Exact: 82.06%
    • F1 Partial: 82.06%
    • F1 Ent Type: 83.54%
    • F1 Strict: 82.06%


  run2-mdeberta-ct-multilingual The architecture of mdeberta-v3-base was used and fine-tuned
on general drugs in Spanish.
    • Learning Rate: 2e-5
    • Epochs: 5
    • F1 Avg: 83.22%
    • F1 Exact: 82.92%
    • F1 Partial: 82.92%
    • F1 Ent Type: 84.14%
    • F1 Strict: 82.92%
  run3-roberta-multilingual The architecture of lcampillos/roberta-es-clinical-trials-ner was used
and fine-tuned on the task of detecting cardiology drugs. We worked under the assumption that
pharmaceuticals may have very similar or even the same names in Spanish, Italian, and English. The
base model already has a solid understanding of general drugs and reaches an F1-score of 81.79% for
exact matching and 76.04% for strict matching.
     • Learning Rate: 8e-5
     • Epochs: 10
     • F1 Avg: 75.14%
     • F1 Exact: 74.86%
     • F1 Partial: 74.86%
     • F1 Ent Type: 76.01%
     • F1 Strict: 74.86%


D.2.2. Language Specific Models

Each language has 2 language specific runs. The purpose of these runs is to compare domain-specific
models (i.e. models specially trained on the medical domain and use transfer learning to specialize the
model on the cardiology domain) to large language-agnostic, base models (i.e. mdeberta-v3-base). Run 4
contains the base model, while run 5 contains the domain-specific model.


es


  run4-mdeberta-ct-mlm-dg The architecture of mdeberta-v3-base was used and fine-tuned on
general drugs in Spanish. Furthermore, it was fine-tuned on Spanish admission notes via Masked
Language Modeling (MLM). Additional data via the automatically translated MTsamples dataset was
used.
MLM
     • Epochs: 5
     • Learning Rate: 8e-6
     • Loss: 8.7417
     • Perplexity: 10589.27
Cardiology Task
     • Learning Rate: 8e-5
     • Epochs: 4
     • F1 Avg: 70.03%
     • F1 Exact: 69.23%
     • F1 Partial: 69.23%
     • F1 Ent Type: 72.41%
     • F1 Strict: 69.23%


   run5-roberta-ct-mlm The architecture of lcampillos/roberta-es-clinical-trials-ner was used and
fine-tuned on Spanish admission notes via Masked Language Modeling (MLM).
MLM
     • Epochs: 5
     • Learning Rate: 1e-6
     • Loss: 8.8788
     • Perplexity: 7178.02
Cardiology Task
     • Learning Rate: 1e-4
     • Epochs: 10
     • F1 Avg: 59.59%
     • F1 Exact: 59.05%
     • F1 Partial: 59.05%
     • F1 Ent Type: 61.21%
     • F1 Strict: 59.05%


en


  run4-mdeberta-windows The architecture of mdeberta-v3-base was used, including the sliding
windows approach to enhance data capture.
     • Learning Rate: 1e-4
     • Epochs: 10
     • Window Overlap: 60 tokens
     • F1 Avg: 80.45%
     • F1 Exact: 80.22%
     • F1 Partial: 80.22%
     • F1 Ent Type: 81.15%
     • F1 Strict: 80.22%


   run5-biobert-mlm-windows The architecture of alvaroalon2/biobert_chemical_ner was used
and fine-tuned on English (original) admission notes via Masked Language Modeling (MLM). Further-
more, we used the sliding windows approach to enhance data capture. It is worth mentioning that
lcampillos/roberta-es-clinical-trials-ner (with the same specifications) actually achieved slightly better
results than alvaroalon2/biobert_chemical_ner, i.e. an average F1-score of 79.00%.


MLM
     • Epochs: 5
     • Learning Rate: 1e-6
     • Loss: 8.65492
     • Perplexity: 5738.31


Cardiology Task
     • Learning Rate: 1e-4
     • Epochs: 5
     • Window Overlap: 60 tokens
     • F1 Avg: 75.50%
     • F1 Exact: 74.94%
     • F1 Partial: 74.94%
     • F1 Ent Type: 75.54%
     • F1 Strict: 77.18%
it


   run4-mdeberta The architecture of mdeberta-v3-base was used, without any data enhancing
techniques.
     • Learning Rate: 1e-4
     • Epochs: 10
     • F1 Avg: 90.70%
     • F1 Exact: 90.49%
     • F1 Partial: 90.49%
     • F1 Ent Type: 91.34%
     • F1 Strict: 90.49%


   run5-biobit-mlm The architecture of IVN-RIN/bioBIT was used and fine-tuned on Italian admission
notes via Masked Language Modeling (MLM). It is worth mentioning that lcampillos/roberta-es-clinical-
trials-ner (with the same specifications) achieved worse results than IVN-RIN/bioBIT this time, i.e. an
average F1-score of 76.81%.
     • Learning Rate: 1e-4
     • Epochs: 10
     • F1 Avg: 89.77%
     • F1 Exact: 89.56%
     • F1 Partial: 89.56%
     • F1 Ent Type: 90.40%
     • F1 Strict: 89.56%