<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Biomedical Semantics 14 (2023) 2. URL: https://doi.org/10.1186/s13326</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1186/s13326-022-00281-5</article-id>
      <title-group>
        <article-title>Cross-Linguistic Disease and Drug Detection in Cardiology Clinical Texts: Methods and Outcomes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patrick Styll</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Campillos-Llanos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wojciech Kusa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Research Unit (E194-04), Technische Universität Wien</institution>
          ,
          <addr-line>Favoritenstraße 9-11, 1040 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Language, Literature and Anthropology, Spanish National Research Council (CSIC)</institution>
          ,
          <addr-line>c/Albasanz 26, 28037 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>4171</fpage>
      <lpage>4186</lpage>
      <abstract>
        <p>This paper presents our approach to the MultiCardioNER lab at CLEF2024, focusing on disease detection in Spanish texts and drug detection in Italian, Spanish, and English texts. We enhance model performance through several strategies: (1) fine-tuning on automatically translated TREC Clinical Trials admission notes using Masked Language Modeling (MLM); (2) data augmentation with translated MTSamples processed through a Spanish medical lexicon (MedLexSp) for accurate vocabulary matching; and (3) employing sliding windows with overlap to improve data capture. Additionally, we use transfer learning with a clinical trials corpus (CT-EMB-SP) to refine the outcomes. We further fine-tune several already established disease and drug extraction models to leverage their extensive vocabulary and compare their performance to models trained from scratch. Our methods and experiments demonstrate notable improvements in multilingual clinical NER, as evidenced by our track results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clinical Named Entity Recognition</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Cardiology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this Section, we give the background to our methodology. We describe our proposed techniques to
further enhance results, the baseline models we used and discuss how results and outputs are efectively
evaluated.</p>
      <sec id="sec-2-1">
        <title>2.1. Proposed Techniques</title>
        <p>• Fine-tuning via Masked Language Modeling</p>
        <p>We proposed fine-tuning on 240 automatically translated admission notes of the TREC Clinical
Trials track via Masked Language Modeling [5]. This should help the model to produce a sense of
what patient notes look like and enhance their understanding.
• Data Augmentation</p>
        <p>We used the NickyNicky/medical_mtsamples dataset from HuggingFace as a means of data
augmentation. We extracted Cardiology Diseases and Drugs, automatically translated the texts to
Spanish via the Google Translate API [6] and additionally processed the entities using a medical
lexicon for Spanish (MedLexSp [7]) to ensure only correct medical vocabulary was used.
• Sliding Windows with Overlap</p>
        <p>We employed a sliding windows attention approach with overlap to handle long sequences of
clinical text. This method has been efectively utilized in various Natural Language Processing
(NLP) tasks to manage texts that exceed the input size limitations of standard models [5]. By
breaking the text into smaller, overlapping segments, the model can better understand the context
and connections between diferent Sections of the document.
• Additional Fine-Tuning/Transfer Learning on general diseases/drugs
We fine-tuned several baseline models to detect diseases and drugs from the CT-EMB-SP corpus [ 8]
with the goal of enhancing the model’s vocabulary of specific medical data. This is a collection of
1200 texts about clinical trials in Spanish (500 journal abstracts and 700 trial announcements). It
was annotated with entities for four semantic groups of the Unified Medical Language System
[9]: ANAT, CHEM, DISO and PROC. This resource facilitates machine learning experiments for
information extraction on evidence-based medicine. For information on the models and training
process, see Table 3 and Figures 6a and 6b in Appendix A.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Baseline Models</title>
        <p>This Section introduces all pre-trained models which we used in both experiments and submissions. It
is explained how they were pre-trained, why the model is potentially useful and how we employed it in
our research.</p>
        <p>• google-bert/bert-base-multilingual-cased</p>
        <p>This is a multilingual version of the BERT model [5], which served as our baseline. It is a small
model that we fine-tuned and used in the preliminary evaluation of track 1.
• microsoft/mdeberta-v3-base [10] [11]</p>
        <p>This large, multilingual general-domain model has recently gained recognition for its efectiveness
in processing medical data. We fine-tuned it both from scratch and on the CT-EMB-SP corpus [ 8]
for increased vocabulary. We used this model for every part of the track. Table 3 in Appendix A
shows the parameters and performance of this model.
• lcampillos/roberta-es-clinical-trials-ner [8]</p>
        <p>This model is based on the RoBERTa architecture and is specifically fine-tuned for named entity
recognition tasks in Spanish clinical trial texts. It is designed to efectively identify medical
entities within the domain of clinical trials, enhancing the extraction of relevant information
from these documents. On the evaluation set of its training data, it achieved a strong F1-score of
86.47%, demonstrating its efectiveness. We used it for every part of the track.</p>
        <p>This model seemed very promising, since on preliminary testing on the MultiCardioNER data,
it already achieved an F1-score of 45.52% for track 1 and 76.04% for track 2. This suggests that
there is more diference between the general medical domain and the cardiology domain for
diseases than for pharmaceuticals.
• PlanTL-GOB-ES/bsc-bio-ehr-es [12]</p>
        <p>This model is pre-trained on Spanish electronic health records (EHR), a large corpus of
biomedical texts. It evidently outperformed other popular models on certain tasks, showcasing its
performance. We used it for track 1 and the Spanish part of track 2.
• IVN-RIN/bioBIT [13]</p>
        <p>BioBIT (Biomedical Bert for Italian) is a model tailored for the biomedical domain, pre-trained on
an Italian biomedical corpus derived from machine-translated PubMed abstracts. Built on the
BERT architecture, BioBIT utilizes Masked Language Modeling and Next Sentence Prediction for
pretraining. It excels in multiple tasks, including Named Entity Recognition (NER), achieving
high accuracy across several biomedical datasets. We used it for evaluating the Italian part of
track 2.
• alvaroalon2/biobert_chemical_ner [14]</p>
        <p>This BioBERT model is fine-tuned for named entity recognition (NER) tasks specifically targeting
chemical entities. It has been trained on the BC5CDR-chemicals [15] and BC4CHEMD corpora [16],
making it highly efective for identifying chemical mentions in biomedical texts. This model is a
valuable tool for chemical NER in the biomedical domain, supporting advanced research and data
extraction.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Metrics</title>
        <p>For evaluating the models, we used evaluation metrics based on entity-level [17]. Since we are working
with a highly imbalanced dataset, this provides a more accurate assessment of Named Entity Recognition
(NER) performance.</p>
        <p>The International Workshop on Semantic Evaluation (SemEval’13) introduced four ways to evaluate
Named Entity Recognition (NER) performance: Strict, Exact, Partial, and Type. These methods consider
various aspects of matches between system predictions and ground truth annotations. The
evaluation schemas assess correctness, incorrectness, partial matches, missed entities, and spurious entities
diferently, impacting the calculated precision, recall, and F1-scores.</p>
        <p>Strict requires an exact boundary and type match, Exact requires just boundary match, Partial accepts
partial boundaries, and Type requires some overlap. These metrics provide a comprehensive evaluation
of Named Entity Recognition (NER) systems under diferent match criteria.</p>
        <p>When assessing the performance of our models, we use the average of the four F1-scores of the evaluation
metrics: Strict, Exact, Partial, and Type. This average F1-score ( 1) is calculated as follows:
 1 =
 +  +   +  
4
This method allows us to efectively determine the most performant model by considering a balanced
view of diferent evaluation criteria.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <sec id="sec-3-1">
        <title>3.1. Preliminary Experiments</title>
        <p>The performance of the models is evaluated by cutting of excessive tokens from each patient note,
where each model has an input size of 512 tokens. If models are not trained via sliding windows, excess
tokens are simply cut of during the tokenization process. Note that, due to time constraints, we run
the experiments just once for each model. However, a better methodology would be initializing each
model with diferent seed values and reporting the average and standard deviation of all runs. This
method would provide a more realistic overview of each model’s performance.</p>
        <p>Please note that these preliminary experiments do not yet include error analysis, as described in
Section 3.3. The absolute performance of the models depicted here is not demonstrative, but the relative
diference of the separate runs showcases success of the proposed techniques and some insights into
how the models behave. For additional experiments where absolute performance of the models is
depicted, please see Appendix C.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Experiments for Track 1</title>
          <p>Baseline; Baseline values are given by the multilingual-bert and bsc-bio-ehr-es, which already achieved
decent F1-scores on the development set, with bsc-bio-ehr-es being at 81.48% and multilingual-bert at
76.50%. This already suggests a clear benefit for fine-tuning on domain-specific data for specific tasks.
Domain-Specific Model; For this purpose, the roberta-es-clinical-trials-ner model (as introduced in
Section 2.2) was used as a baseline. Since it was fine-tuned on general diseases in the Spanish language
(not only cardiovascular conditions), it already started with a relatively high F1-score in the first epoch
(see Figure 1a). We can see that the data-augmentation technique showed a promising influence in
further fine-tuning the model to the cardiology domain. The sliding windows approach showed slightly
worse results. However, the diference is not large enough for a conclusion.</p>
          <p>After some hyperparameter-tuning, the model achieved 87.9% F1-score at its peak. As evident in
Figure 1b, the Masked Language Modeling approach did not necessarily influence results. This might
be due to the lack of data for this kind of fine-tuning, which may add unnecessary bias.
During the process of hyperparameter-tuning, we saw that a higher learning rate (i.e. 1− 4 instead of
2− 5) performed slightly better (approximately an increase of 4% in the evaluation metric). The same
can be said for the batch size, where a higher size yielded better results (approximately 7% in F1-score).
Unfortunately, experiments were rather limited here due to the lack of GPU RAM.
0.6
0.5
e0.4
rcoS
F10.3
e
g
a
r
e
vA0.2
0.1
0.0
2
4</p>
          <p>Multilingual Model For this purpose, the mdeberta-v3-base model was used as a pre-trained model.
We first fine-tuned the baseline model on the cardiology data provided by the shared task, which already
showed promising results. It is also interesting to see that in the beginning the model’s performance
started much lower than those models that were already fine-tuned on general diseases (see Figure 2).
Fine-tuning the model on general diseases using the CT-EMB-SP corpus showed promising changes
in performance. Adding data-augmentation and Masked Language Modeling (MLM) as additional
techniques only influenced the results slightly. In the end, it reached an F1-score of 87% at its peak.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Experiments for Track 2</title>
          <p>Domain-Specific Model (es) The roberta-es-clinical-trials-ner model was used as a baseline.
Surprisingly, the scores were relatively low. This was unexpected, since we already measured much better
performance by this model on the MultiCardioNER data. Eventually, we obtained an F1-score of 59.79%.
2
4
roberta-es-clinical-trials-ner-cutof
roberta-es-clinical-trials-ner-windows
roberta-es-clinical-trials-ner-cutof-dataaugment</p>
          <p>Epoch 6 8 10
(a) Domain-Specific Model.
2
4</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Insights from other languages (en, it) Testing the roberta-es-clinical-trials-ner model on both the</title>
          <p>English and Italian track provided us with interesting results. Looking at Figure 4a of the English track,
we can actually see that the Spanish model outperformed both the multilingual general domain model
(i.e. mdeberta-v3-base and the domain-specific model (i.e. BioBERT ). The same cannot be said for the
Italian track, where it was outperformed by every other run (see Figure 4b). For the Italian track, the
mdeberta-v3-base model won with an F1-score of 90.7%, while the roberta-es-clinical-trials-ner achieved
80.45% on the English track. These results suggest a great multilingual overlap for pharmaceuticals.
Multilingual Model (es) As in previous experiments, the mdeberta-v3-base model was used as a
baseline and fine-tuned on drugs from the CT-EMB-SP corpus (which did not happen for the other
languages where the base model was used). As can be seen in Figure 3b, the combination of Masked
Language Modeling and data-augmentation actually brought much benefit. In the end, the model
achieved an F1-score of 70.87% at its peak.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Oficial Submissions</title>
        <p>The submission runs are described in Appendix D.
It is important to note that these results do not reflect the absolute performance of the models (see
Section 3.3 for further details and Appendix C for additional experiments with demonstrative
performance), but the relative diference of the separate runs showcase the success of the proposed techniques
and some insights into how the models behave.</p>
        <p>• Fine-Tuning on General Diseases (Transfer-Learning)</p>
        <p>The first three runs (multilingual runs) of track 2 show evidence that fine-tuning models on
general diseases before focusing on the cardiology domain yielded great benefit. The first run,
which was not fine-tuned on general diseases, showed worse performance than run 2 and 3,
which were both fine-tuned on the CT-EMB-SP corpus.
• Multilingual Overlap for Pharmaceuticals</p>
        <p>Looking at run 2 and run 3 of track 2, we can see that the multilingual models performed similarly
to the monolingual models. This suggests a big multilingual overlap for drugs in Spanish, English
and Italian.
• Possible Noise by Data Augmentation due to Machine Translation</p>
        <p>Several runs (e.g. run 2 and run 3 of track 1) showed slightly worse performance when data
augmentation was used. This suggests possible additional noise in the training data. For data
augmentation, we used the MTSamples dataset, where we used the keywords as entities and
translated all text via the Google Translate API. On first inspection, these keywords may refer to
laboratory tests, procedures or anatomical entities. Therefore, we have processed the translated
MTSamples with MedLexSp; we output only DISO and CHEM categories, and we re-named them
to ENFERMEDAD and FARMACO, respectively. Nonetheless, it is unsure whether these data
contain general diseases or only cardiology diseases. Furthermore, on closer inspection, there are
some issues with the machine translation, which is also visible in the automatic translation of the
TREC Admission Notes for Masked Language Modeling (MLM) fine-tuning. Either some words
are not translated or new words are created, possibly due to the sub-words of neural models. An
example would be *leucitos en urino, which should be leucocitos en orina (’white cells in urine’).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Error Analysis</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Data Reconstruction</title>
          <p>Several problems arose during generating the runs; namely, reconstructing the output of the
mdebertav3-base and roberta-es-clinical-trials-ner model.
mdeberta-v3-base This model posed several problems, particularly in generating the correct index
of the span. Often, the start of the span would be one or two tokens of, leading to a decrease in the
F1-score for the runs. Additionally, tokens might have leading spaces or newline characters at the
beginning or end. These extraneous characters need to be removed to ensure the entity text is clean
and accurate. This also includes adjusting the start and end ofsets to reflect the new positions of the
cleaned tokens.</p>
          <p>The presence of punctuation at the end of tokens can create issues in entity recognition. Special rules
are required to handle exceptions such as units (mg.) or cases where brackets are involved. Unnecessary
punctuation needs to be removed, but care must be taken to preserve punctuation that is part of the
entity. Furthermore, the model would sometimes add tabs instead of spaces into the extracted entity.
When tokens are merged or cleaned, their character ofsets in the text need to be recalculated. This
ensures that the entities’ positions in the text are accurately represented, which is crucial for tasks like
text highlighting or linking entities back to their original context.
roberta-es-clinical-trials-ner This model exhibits significant issues with handling sub-words, often
treating them as separate entities. Specifically, leading sub-words are represented as individual entities
with a preceding space. This requires special considerations during the reconstruction process to ensure
accurate entities.</p>
          <p>General Remarks We analysed all errors made by the roberta-es-clinical-trials-ner model in run 4 of
track 1; we used a Python script to count each type. There are several types of errors that the model
made while generating the runs. Examples of this run may be found in Table 2.</p>
          <p>1. Scope Errors
a) Incompletely predicted entities</p>
          <p>These are entities where the model predicted only a part of the actual entity, missing some
crucial parts.
b) Entities where too many words were predicted</p>
          <p>Entities where the predicted span includes extra information not part of the actual entity.
c) Entities that would belong together</p>
          <p>Entities where the predicted spans should be combined to form a single coherent entity.
2. False Positives</p>
          <p>Entities that were incorrectly identified by the model, but not labeled in the ground truth.
3. False Negatives (Missed entities)</p>
          <p>Entities that were missed by the model, but were labeled in the ground truth.</p>
          <p>Looking into Figure 5, we can see that the high number of scope errors (correctly identified entities)
indicates that the model generally has a strong baseline capability for recognizing entities correctly.
Despite the high accuracy in identifying entities, the model still exhibits significant precision and recall
issues, as evidenced by the presence of false positives and false negatives. Furthermore, the high number
of true positives amidst other errors implies that errors are not due to a fundamental flaw in the model
but likely due to specific cases or contexts where the model’s performance drops.</p>
          <p>Nonetheless, score errors tend to cause less harm in the information extraction of clinical cases. Not
detecting an entity (false negative, case 3 in Table 2) is more severe, whereas detecting fumador instead
of fumador activo (case 1a in Table 2) is a mild error. This becomes even clearer when looking into
errors where abbreviations were used. If written in the text is diabetes mellitus (DM), the model would
extract diabetes mellitus (DM), while the test set would annotate both whole text and abbreviation
separately. This happens very frequently, which makes the evaluation unsuitable for measuring the
model’s practical performance. In the end, a more relaxed evaluation metric could have been more
appropriate and yielded higher results.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Data Capture</title>
          <p>There are several factors that contributed to why the models underperformed in the submission runs
(see Table 1). Further data analysis shows that the main issue was insuficient data capture during both
training and evaluation. Plots may be found in Appendix B. In general, we refer to data capture to the
degree to which the model has dificulty with capturing all information from the patient notes. After
all, in preliminary experiments and for generating submission runs, we simply cut of excess tokens
that did not fit into the model.</p>
          <p>General Remarks The relative distribution of entities over patient notes, as depicted in the density
plots in Figure 8 (Appendix B), reveals several interesting insights.</p>
          <p>For track 1, the training set displays a prominent spike at the beginning, followed by a relatively uniform
distribution throughout the rest of the notes. In contrast, the development set and test set exhibit two
significant spikes: one at the beginning and a larger one at the end, with a notably lower density in the
middle. This pattern suggests that most diseases are mentioned either at the beginning or the end of
the patient notes.</p>
          <p>Turning to pharmaceuticals in track 2, we observe similar entity distributions across all plots. The
training set again shows a more uniform distribution, whereas the development and test sets both
feature two prominent spikes at the beginning and end, mirroring the pattern observed in track 1.
Looking at the words counts in the boxplots of Figure 7, we can clearly see that the training set exhibits
significantly less words than both the development set and the test set. About 75% of patient notes
from the test set have less than 550 words, which applies to less than 25% of patient notes from the
development/test set.</p>
          <p>The Venn diagrams in Figure 9 (Appendix B) are also worth mentioning. As we can see, track 1 seems
to have only little overlap between the datasets, while track 2 has notably more overlap. This may
imply that the model sufers from a few-shot learning problem, especially since results on track 2 are
significantly better in terms of performance than those of track 1. Another factor may be the amount of
unique entities in the datasets, which is way larger in track 1 than in track 2, further complicating the
task for the model.</p>
          <p>Implications for Training and Evaluation The previous training and evaluation strategies were
significantly afected by these entity distribution patterns. During previous evaluation, a cutof strategy
was used, where all excess tokens were trimmed to fit the model’s input layer, which was uniformly set
at 512 tokens. This meant that only approximately 60% of the data was fully utilized during training.
However, due to the high density of entities at the end of patient notes, this approach resulted in
sub-optimal data capture. The situation was even worse during evaluation on the development set,
where only less than 25% of the data fit into the models without token cutof. This led to the model
being evaluated on a non-representative portion of the dataset, inflating the performance metrics.
To improve data capture, we decided to split the patient notes into individual sentences using spaCy [18]
for both training and evaluation. This change not only yielded better results, particularly for track 2,
but also provided more reliable and representative metrics. Consequently, several experiments were
re-conducted (refer to Appendix C). It is important to note that these new runs expand and confirm the
trends observed in earlier experiments (see Section 3.1).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We can see some interesting trends in the data, allowing us to draw both conclusions about our proposed
strategies as well as the provided data.</p>
      <p>• Fine-tuning via Masked Language Modeling: This approach had very little influence on the
model’s results. This can be attributed to (i) the lack of suficient data for this kind of fine-tuning,
(ii) the fact that the patient notes are based on the general domain, and (iii) erroneous machine
translation.
• Data Augmentation: The efects of data augmentation are still unclear. We have observed both
positive and negative efects across diferent model architectures. More experiments with diferent
models and types of data augmentation resources are necessary to draw definitive conclusions.
• Sliding Windows with Overlap: The impact of the sliding windows approach, as opposed to
cutting of excess tokens, is also dificult to judge. Despite expecting better data capture, some
experiments actually showed slightly worse results. This efect may be due to patient notes being
split in random positions, resulting in incorrect grammar and split entities, which can disrupt the
contextual information the model relies on. This issue becomes more evident when considering
that processing patient notes at the sentence level improved results notably.
• Additional Fine-Tuning/Transfer-Learning on general diseases/drugs: This approach
significantly improved the model’s performance. Various experiments demonstrated that adapting
a general model to a specific domain requires less efort and yields promising results with relatively
little training.
• Insuficient Data Capture: Due to the high density of entities in the beginning and end of the
patient notes, the cutof strategy performed poorly due to missing entities at the end of the notes.
• Overlap of Entities over Datasets: There are significantly less overlapping entities between
the training, development and testing dataset for track 1 than there are for track 2. This may
explain the generally worse results for track 1, indicating that models may sufer from a few shot
learning problem.
• Multilingual Overlap for Pharmaceuticals: We have shown that there is a big multilingual
overlap concerning pharmaceuticals in Spanish, Italian and English. This can be largely attributed
to the standardized pharmaceutical nomenclature, which suggests that a multilingual approach
to drug entity extraction can leverage these similarities to enhance accuracy and consistency
across diferent languages.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>Leonardo Campillos-Llanos’ work is conducted in the CLARA-MeD project (PID2020-116001RA-C33),
funded by MICIU/AEI/10.13039/501100011033/, in call Proyectos I+D+i Retos Investigación.</p>
    </sec>
    <sec id="sec-6">
      <title>B. Data Analysis</title>
      <p>It is important to note that for track 2 (FARMACO), the density plots in Figure 8 and boxplots in Figure 7
look the same among the three diferent languages, despite translation. Trivially, the boxplots in Figure 7
look the same for both track 1 and track 2 since the same data was used.</p>
      <p>(a) Train Set.</p>
      <p>(b) Dev Set.</p>
      <p>(c) Test Set.</p>
      <p>(a) Track 1 - Train Set.
(b) Track 1 - Dev Set.
(c) Track 1 - Test Set.
(d) Track 2 - Train Set.
(e) Track 2 - Dev Set.
(f) Track 2 - Test Set.</p>
      <p>(a) Track 1 - ENFERMEDAD.
(b) Track 2 - FARMACO.</p>
    </sec>
    <sec id="sec-7">
      <title>C. Additional Experiments</title>
      <sec id="sec-7-1">
        <title>C.1. Cardiology Domain Adaptation</title>
        <p>This experiment serves to show how easily a general model, i.e. trained on general pharmaceuticals,
can be adapted to a special medical domain. The roberta-es-clinical-trials-ner model was fine-tuned on
general drugs using the CT-EMB-SP corpus in Spanish, and it was used as a base model and fine-tuned
on the cardiology domain for pharmaceuticals. As previously mentioned in Section 2.2, it already
achieved an F1-score of 76.04% before fine-tuning on cardiology data.</p>
        <p>As seen in Figure 10, epoch 1 already shows an incredible performance on the development set.
Evaluating this model of epoch 1 on the test set, we achieved a precision of 86.15%, a recall of 93.77% and
an F1-score of 89.80%. Using a model trained on three epochs (which shows the peak in Figure 10) we
obtained a precision of 90.25%, a recall of 94.30% and an F1-score of 92.23%.</p>
        <p>roberta-farmaco-es-ss</p>
        <p>The same can be said when training the same model on track 1, where the plain model achieved a
precision of 62.86%, a recall of 65.42% and an F1-score of 64.11%. With data augmentation, we achieved
a significant improvement with a precision of 65.65%, a recall of 73.76% and an F1-score of 69.47%.
Nonetheless, when reflecting on the results in Section 3.2, we discussed possible negative efects due to
incorrect machine translation. These efects are definitely visible when using e.g. mdeberta-v3-base
as a baseline architecture (see also Table 1), which is why we are not entirely capable of judging the</p>
      </sec>
      <sec id="sec-7-2">
        <title>C.2. Efect of Data Augmentation</title>
        <p>When looking into the efects of our proposed data augmentation, we trained
roberta-es-clinical-trialsner with and without data augmentation (same setup, i.e. same hyper-parameters). In Figure 11, we can
see similar behaviour in training, but with less performance on the development set. When evaluating
the model on the test set, we got a precision of 92.08%, a recall of 94.06% and an F1-score of 93.06%.
Considering the model trained in Section C.1, data augmentation actually led to a slightly higher score.
efect of data augmentation. Although the outcomes of some models seem to support that it may help
adapting a general model to a specific domain, we would need to experiment with more models and
test more types of resources for data augmentation.</p>
      </sec>
      <sec id="sec-7-3">
        <title>C.3. Efect of Fine-Tuning on General Domain - Transfer-Learning</title>
        <p>In order to more precisely measure the benefit of fine-tuning a model on a general domain before
ifne-tuning it on a specialized domain, we conducted the following experiment on the Spanish part of
track 2.
mdeberta-v3-base was used as a baseline. We compared the plain mdeberta-v3-base model with one
ifne-tuned on the general medical domain via the CT-EMB-SP corpus. Looking at the graph in Figure 12,
the general model already showed greater performance than the base model in very early stages of
training (validating once more what was seen with regard to adapting a general model to the medical
domain, in Section C.1). The base model started with relatively low F1-score, but caught up in the last
epochs.</p>
        <p>In Table 4, we can see the notable increase in performance not only on the development set, but also on
the test set.</p>
      </sec>
      <sec id="sec-7-4">
        <title>C.4. Multilingual Capabilities</title>
        <p>In order to check the assumption of a possible multilingual overlap for pharmaceuticals in Spanish,
Italian and English, we trained and evaluated a multilingual general model (i.e. mdeberta-v3-base) on all
three datasets simultaneously by having concatenated the data of all three languages.
As can be seen in Figure 13, the training exhibits a significant drop in performance to what can be seen
in language specific models (see Figure 10 in Section C.1). The same can be said when looking into the
results obtained when evaluating the model on the test set for each language separately (table 5). This
can be explained via minor diferences for special pharmaceutical words among the diferent languages,
which may slightly add noise to the data.</p>
        <p>The analysis of drug entity recognition across English, Spanish, and Italian (see Figure 14) demonstrates
a significant overlap and similarity in the pharmaceutical terminology used in these languages. As seen
Spanish
English
Italian
in Table 6, many drug names exhibit minor variations that are primarily due to linguistic diferences
such as sufixes and spelling conventions. This overlap can be attributed to the standardized nature of
pharmaceutical nomenclature and the widespread use of international nonproprietary names (INNs).
These findings suggest that a multilingual approach to drug entity recognition can leverage these
similarities to enhance accuracy and consistency across diferent languages.
lenalidomid lenalidomide
cafein cafeine
triamcinolone acetonid triamcinolone acetonide
ampicillin ampicillin
sulfacetamid sulfacetamide</p>
        <p>lenalidomida
cafeína
triamcinolona acetónido
ampicilina
sulfacetamida</p>
        <p>lenalidomide
cafeina
triamcinolone acetonide
ampicillina
sulfacetamide
(a) No preprocessing.
(b) Stemmed and Lemmatized.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>D. Original Submission Runs</title>
      <p>D.1. Track 1</p>
      <p>run1-mdeberta-ct-mlm-dg The architecture of mdeberta-v3-base was used and fine-tuned on
admission notes via Masked Language Modeling (MLM), with continual fine-tuning on general diseases.
In order to further tune the model, we used data augmentation.</p>
      <p>MLM
• Epochs: 5
• Learning Rate: 5e-6
• Loss: 9.3413
• Perplexity: 11399.24
run2-mdeberta-ct The architecture of mdeberta-v3-base was used and fine-tuned on general
diseases.</p>
      <p>run3-mdeberta-ct-dg The architecture of mdeberta-v3-base was used and fine-tuned on general
diseases, this time including data augmentation via MTsamples.</p>
      <p>run4-roberta-dg The architecture of lcampillos/roberta-es-clinical-trials-ner was used and
finetuned on the task of diseases in cardiology. To further tune the model on identifying only cardiology
diseases, we used data augmentation via MTsamples. The base model already has a solid understanding
of diseases and reaches an F1-score of 45.52%.</p>
      <p>• Learning Rate: 2e-4
run5-roberta-dg-windows The architecture of lcampillos/roberta-es-clinical-trials-ner was used
and trained on the task of diseases in cardiology. To further tune the model on identifying only
cardiology diseases, we used data augmentation via MTsamples. To further data capture, we used the
proposed sliding windows technique.</p>
      <p>• Learning Rate: 2e-4
• Epochs: 3
• Window Overlap: 60 tokens
• F1 Avg: 86.07%
• F1 Exact: 85.48%
• F1 Partial: 85.48%
• F1 Ent Type: 87.78%
• F1 Strict: 85.94%
D.2. Track 2
D.2.1. Multilingual Models
• Learning Rate: 2e-5
• Epochs: 5
• F1 Avg: 82.43%
• F1 Exact: 82.06%
• F1 Partial: 82.06%
• F1 Ent Type: 83.54%
• F1 Strict: 82.06%
• Learning Rate: 2e-5
• Epochs: 5
• F1 Avg: 83.22%
• F1 Exact: 82.92%
• F1 Partial: 82.92%
• F1 Ent Type: 84.14%
• F1 Strict: 82.92%
We propose three types of multilingual models, where all data from all three languages are taken and
concatenated for training and evaluation.</p>
      <p>run1-mdeberta-multilingual The architecture of mdeberta-v3-base was used.</p>
      <p>run2-mdeberta-ct-multilingual The architecture of mdeberta-v3-base was used and fine-tuned
on general drugs in Spanish.</p>
      <p>run3-roberta-multilingual The architecture of lcampillos/roberta-es-clinical-trials-ner was used
and fine-tuned on the task of detecting cardiology drugs. We worked under the assumption that
pharmaceuticals may have very similar or even the same names in Spanish, Italian, and English. The
base model already has a solid understanding of general drugs and reaches an F1-score of 81.79% for
exact matching and 76.04% for strict matching.</p>
      <p>• Learning Rate: 8e-5
• Epochs: 10
• F1 Avg: 75.14%
• F1 Exact: 74.86%
• F1 Partial: 74.86%
• F1 Ent Type: 76.01%
• F1 Strict: 74.86%</p>
      <sec id="sec-8-1">
        <title>D.2.2. Language Specific Models</title>
        <p>Each language has 2 language specific runs. The purpose of these runs is to compare domain-specific
models (i.e. models specially trained on the medical domain and use transfer learning to specialize the
model on the cardiology domain) to large language-agnostic, base models (i.e. mdeberta-v3-base). Run 4
contains the base model, while run 5 contains the domain-specific model.
es
MLM</p>
        <p>run4-mdeberta-ct-mlm-dg The architecture of mdeberta-v3-base was used and fine-tuned on
general drugs in Spanish. Furthermore, it was fine-tuned on Spanish admission notes via Masked
Language Modeling (MLM). Additional data via the automatically translated MTsamples dataset was
used.</p>
        <p>• Epochs: 5
• Learning Rate: 8e-6
• Loss: 8.7417
• Perplexity: 10589.27
Cardiology Task
• Learning Rate: 8e-5
• Epochs: 4
• F1 Avg: 70.03%
• F1 Exact: 69.23%
• F1 Partial: 69.23%
• F1 Ent Type: 72.41%
• F1 Strict: 69.23%
run5-roberta-ct-mlm The architecture of lcampillos/roberta-es-clinical-trials-ner was used and
ifne-tuned on Spanish admission notes via Masked Language Modeling (MLM).</p>
        <p>MLM
• Epochs: 5
• Learning Rate: 1e-6
• Loss: 8.8788
• Perplexity: 7178.02
• Epochs: 5
• Learning Rate: 1e-6
• Loss: 8.65492
• Perplexity: 5738.31
Cardiology Task
• Learning Rate: 1e-4
• Epochs: 5
• Window Overlap: 60 tokens
• F1 Avg: 75.50%
• F1 Exact: 74.94%
• F1 Partial: 74.94%
• F1 Ent Type: 75.54%
• F1 Strict: 77.18%
run4-mdeberta-windows The architecture of mdeberta-v3-base was used, including the sliding
windows approach to enhance data capture.</p>
        <p>• Learning Rate: 1e-4
• Epochs: 10
• Window Overlap: 60 tokens
• F1 Avg: 80.45%
• F1 Exact: 80.22%
• F1 Partial: 80.22%
• F1 Ent Type: 81.15%
• F1 Strict: 80.22%
run5-biobert-mlm-windows The architecture of alvaroalon2/biobert_chemical_ner was used
and fine-tuned on English (original) admission notes via Masked Language Modeling (MLM).
Furthermore, we used the sliding windows approach to enhance data capture. It is worth mentioning that
lcampillos/roberta-es-clinical-trials-ner (with the same specifications) actually achieved slightly better
results than alvaroalon2/biobert_chemical_ner, i.e. an average F1-score of 79.00%.</p>
        <p>run4-mdeberta
techniques.</p>
        <p>The architecture of mdeberta-v3-base was used, without any data enhancing
• Learning Rate: 1e-4
• Epochs: 10
• F1 Avg: 89.77%
• F1 Exact: 89.56%
• F1 Partial: 89.56%
• F1 Ent Type: 90.40%
• F1 Strict: 89.56%
run5-biobit-mlm The architecture of IVN-RIN/bioBIT was used and fine-tuned on Italian admission
notes via Masked Language Modeling (MLM). It is worth mentioning that
lcampillos/roberta-es-clinicaltrials-ner (with the same specifications) achieved worse results than IVN-RIN/bioBIT this time, i.e. an
average F1-score of 76.81%.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dalianis</surname>
          </string-name>
          ,
          <article-title>Clinical text mining: Secondary use of electronic patient records</article-title>
          , Springer Nature,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhadad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <article-title>Natural language processing for health-related texts</article-title>
          , in: Biomedical Informatics: Computer Applications in Health Care and Biomedicine, Springer,
          <year>2021</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rodríguez-Miret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lenkowicz</surname>
          </string-name>
          , G. Ceroni,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kossof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          , G. Katsimpras, G. Paliouras,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krallinger, Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.),
          <source>CLEF Working Notes</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Davydova</surname>
          </string-name>
          , E. Tutubalina, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2024</year>
          :
          <article-title>The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Maria Di Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>