1. Introduction

Journal of Biomedical Semantics 14 (2023) 2. URL: https://doi.org/10.1186/s13326

10.1186/s13326-022-00281-5

Cross-Linguistic Disease and Drug Detection in Cardiology Clinical Texts: Methods and Outcomes

Patrick Styll

Leonardo Campillos-Llanos

Wojciech Kusa

Allan Hanbury

0 0 Data Science Research Unit (E194-04), Technische Universität Wien , Favoritenstraße 9-11, 1040 Vienna , Austria 1 Institute of Language, Literature and Anthropology, Spanish National Research Council (CSIC) , c/Albasanz 26, 28037 Madrid , Spain

2024

1 4171 4186

This paper presents our approach to the MultiCardioNER lab at CLEF2024, focusing on disease detection in Spanish texts and drug detection in Italian, Spanish, and English texts. We enhance model performance through several strategies: (1) fine-tuning on automatically translated TREC Clinical Trials admission notes using Masked Language Modeling (MLM); (2) data augmentation with translated MTSamples processed through a Spanish medical lexicon (MedLexSp) for accurate vocabulary matching; and (3) employing sliding windows with overlap to improve data capture. Additionally, we use transfer learning with a clinical trials corpus (CT-EMB-SP) to refine the outcomes. We further fine-tune several already established disease and drug extraction models to leverage their extensive vocabulary and compare their performance to models trained from scratch. Our methods and experiments demonstrate notable improvements in multilingual clinical NER, as evidenced by our track results.

eol>Clinical Named Entity Recognition Transfer Learning Data Augmentation Cardiology

1. Introduction 2. Methodology

In this Section, we give the background to our methodology. We describe our proposed techniques to further enhance results, the baseline models we used and discuss how results and outputs are efectively evaluated.

2.1. Proposed Techniques

• Fine-tuning via Masked Language Modeling

We proposed fine-tuning on 240 automatically translated admission notes of the TREC Clinical Trials track via Masked Language Modeling [5]. This should help the model to produce a sense of what patient notes look like and enhance their understanding. • Data Augmentation

We used the NickyNicky/medical_mtsamples dataset from HuggingFace as a means of data augmentation. We extracted Cardiology Diseases and Drugs, automatically translated the texts to Spanish via the Google Translate API [6] and additionally processed the entities using a medical lexicon for Spanish (MedLexSp [7]) to ensure only correct medical vocabulary was used. • Sliding Windows with Overlap

We employed a sliding windows attention approach with overlap to handle long sequences of clinical text. This method has been efectively utilized in various Natural Language Processing (NLP) tasks to manage texts that exceed the input size limitations of standard models [5]. By breaking the text into smaller, overlapping segments, the model can better understand the context and connections between diferent Sections of the document. • Additional Fine-Tuning/Transfer Learning on general diseases/drugs We fine-tuned several baseline models to detect diseases and drugs from the CT-EMB-SP corpus [ 8] with the goal of enhancing the model’s vocabulary of specific medical data. This is a collection of 1200 texts about clinical trials in Spanish (500 journal abstracts and 700 trial announcements). It was annotated with entities for four semantic groups of the Unified Medical Language System [9]: ANAT, CHEM, DISO and PROC. This resource facilitates machine learning experiments for information extraction on evidence-based medicine. For information on the models and training process, see Table 3 and Figures 6a and 6b in Appendix A.

2.2. Baseline Models

This Section introduces all pre-trained models which we used in both experiments and submissions. It is explained how they were pre-trained, why the model is potentially useful and how we employed it in our research.

• google-bert/bert-base-multilingual-cased

This is a multilingual version of the BERT model [5], which served as our baseline. It is a small model that we fine-tuned and used in the preliminary evaluation of track 1. • microsoft/mdeberta-v3-base [10] [11]

This large, multilingual general-domain model has recently gained recognition for its efectiveness in processing medical data. We fine-tuned it both from scratch and on the CT-EMB-SP corpus [ 8] for increased vocabulary. We used this model for every part of the track. Table 3 in Appendix A shows the parameters and performance of this model. • lcampillos/roberta-es-clinical-trials-ner [8]

This model is based on the RoBERTa architecture and is specifically fine-tuned for named entity recognition tasks in Spanish clinical trial texts. It is designed to efectively identify medical entities within the domain of clinical trials, enhancing the extraction of relevant information from these documents. On the evaluation set of its training data, it achieved a strong F1-score of 86.47%, demonstrating its efectiveness. We used it for every part of the track.

This model seemed very promising, since on preliminary testing on the MultiCardioNER data, it already achieved an F1-score of 45.52% for track 1 and 76.04% for track 2. This suggests that there is more diference between the general medical domain and the cardiology domain for diseases than for pharmaceuticals. • PlanTL-GOB-ES/bsc-bio-ehr-es [12]

This model is pre-trained on Spanish electronic health records (EHR), a large corpus of biomedical texts. It evidently outperformed other popular models on certain tasks, showcasing its performance. We used it for track 1 and the Spanish part of track 2. • IVN-RIN/bioBIT [13]

BioBIT (Biomedical Bert for Italian) is a model tailored for the biomedical domain, pre-trained on an Italian biomedical corpus derived from machine-translated PubMed abstracts. Built on the BERT architecture, BioBIT utilizes Masked Language Modeling and Next Sentence Prediction for pretraining. It excels in multiple tasks, including Named Entity Recognition (NER), achieving high accuracy across several biomedical datasets. We used it for evaluating the Italian part of track 2. • alvaroalon2/biobert_chemical_ner [14]

This BioBERT model is fine-tuned for named entity recognition (NER) tasks specifically targeting chemical entities. It has been trained on the BC5CDR-chemicals [15] and BC4CHEMD corpora [16], making it highly efective for identifying chemical mentions in biomedical texts. This model is a valuable tool for chemical NER in the biomedical domain, supporting advanced research and data extraction.

2.3. Metrics

For evaluating the models, we used evaluation metrics based on entity-level [17]. Since we are working with a highly imbalanced dataset, this provides a more accurate assessment of Named Entity Recognition (NER) performance.

The International Workshop on Semantic Evaluation (SemEval’13) introduced four ways to evaluate Named Entity Recognition (NER) performance: Strict, Exact, Partial, and Type. These methods consider various aspects of matches between system predictions and ground truth annotations. The evaluation schemas assess correctness, incorrectness, partial matches, missed entities, and spurious entities diferently, impacting the calculated precision, recall, and F1-scores.

Strict requires an exact boundary and type match, Exact requires just boundary match, Partial accepts partial boundaries, and Type requires some overlap. These metrics provide a comprehensive evaluation of Named Entity Recognition (NER) systems under diferent match criteria.

When assessing the performance of our models, we use the average of the four F1-scores of the evaluation metrics: Strict, Exact, Partial, and Type. This average F1-score ( 1) is calculated as follows: 1 = + + + 4 This method allows us to efectively determine the most performant model by considering a balanced view of diferent evaluation criteria.

3. Experiments and Results 3.1. Preliminary Experiments

The performance of the models is evaluated by cutting of excessive tokens from each patient note, where each model has an input size of 512 tokens. If models are not trained via sliding windows, excess tokens are simply cut of during the tokenization process. Note that, due to time constraints, we run the experiments just once for each model. However, a better methodology would be initializing each model with diferent seed values and reporting the average and standard deviation of all runs. This method would provide a more realistic overview of each model’s performance.

Please note that these preliminary experiments do not yet include error analysis, as described in Section 3.3. The absolute performance of the models depicted here is not demonstrative, but the relative diference of the separate runs showcases success of the proposed techniques and some insights into how the models behave. For additional experiments where absolute performance of the models is depicted, please see Appendix C.

3.1.1. Experiments for Track 1

Baseline; Baseline values are given by the multilingual-bert and bsc-bio-ehr-es, which already achieved decent F1-scores on the development set, with bsc-bio-ehr-es being at 81.48% and multilingual-bert at 76.50%. This already suggests a clear benefit for fine-tuning on domain-specific data for specific tasks. Domain-Specific Model; For this purpose, the roberta-es-clinical-trials-ner model (as introduced in Section 2.2) was used as a baseline. Since it was fine-tuned on general diseases in the Spanish language (not only cardiovascular conditions), it already started with a relatively high F1-score in the first epoch (see Figure 1a). We can see that the data-augmentation technique showed a promising influence in further fine-tuning the model to the cardiology domain. The sliding windows approach showed slightly worse results. However, the diference is not large enough for a conclusion.

After some hyperparameter-tuning, the model achieved 87.9% F1-score at its peak. As evident in Figure 1b, the Masked Language Modeling approach did not necessarily influence results. This might be due to the lack of data for this kind of fine-tuning, which may add unnecessary bias. During the process of hyperparameter-tuning, we saw that a higher learning rate (i.e. 1− 4 instead of 2− 5) performed slightly better (approximately an increase of 4% in the evaluation metric). The same can be said for the batch size, where a higher size yielded better results (approximately 7% in F1-score). Unfortunately, experiments were rather limited here due to the lack of GPU RAM. 0.6 0.5 e0.4 rcoS F10.3 e g a r e vA0.2 0.1 0.0 2 4

Multilingual Model For this purpose, the mdeberta-v3-base model was used as a pre-trained model. We first fine-tuned the baseline model on the cardiology data provided by the shared task, which already showed promising results. It is also interesting to see that in the beginning the model’s performance started much lower than those models that were already fine-tuned on general diseases (see Figure 2). Fine-tuning the model on general diseases using the CT-EMB-SP corpus showed promising changes in performance. Adding data-augmentation and Masked Language Modeling (MLM) as additional techniques only influenced the results slightly. In the end, it reached an F1-score of 87% at its peak.

3.1.2. Experiments for Track 2

Domain-Specific Model (es) The roberta-es-clinical-trials-ner model was used as a baseline. Surprisingly, the scores were relatively low. This was unexpected, since we already measured much better performance by this model on the MultiCardioNER data. Eventually, we obtained an F1-score of 59.79%. 2 4 roberta-es-clinical-trials-ner-cutof roberta-es-clinical-trials-ner-windows roberta-es-clinical-trials-ner-cutof-dataaugment

Epoch 6 8 10 (a) Domain-Specific Model. 2 4

Insights from other languages (en, it) Testing the roberta-es-clinical-trials-ner model on both the

English and Italian track provided us with interesting results. Looking at Figure 4a of the English track, we can actually see that the Spanish model outperformed both the multilingual general domain model (i.e. mdeberta-v3-base and the domain-specific model (i.e. BioBERT ). The same cannot be said for the Italian track, where it was outperformed by every other run (see Figure 4b). For the Italian track, the mdeberta-v3-base model won with an F1-score of 90.7%, while the roberta-es-clinical-trials-ner achieved 80.45% on the English track. These results suggest a great multilingual overlap for pharmaceuticals. Multilingual Model (es) As in previous experiments, the mdeberta-v3-base model was used as a baseline and fine-tuned on drugs from the CT-EMB-SP corpus (which did not happen for the other languages where the base model was used). As can be seen in Figure 3b, the combination of Masked Language Modeling and data-augmentation actually brought much benefit. In the end, the model achieved an F1-score of 70.87% at its peak.

3.2. Oficial Submissions

The submission runs are described in Appendix D. It is important to note that these results do not reflect the absolute performance of the models (see Section 3.3 for further details and Appendix C for additional experiments with demonstrative performance), but the relative diference of the separate runs showcase the success of the proposed techniques and some insights into how the models behave.

• Fine-Tuning on General Diseases (Transfer-Learning)

The first three runs (multilingual runs) of track 2 show evidence that fine-tuning models on general diseases before focusing on the cardiology domain yielded great benefit. The first run, which was not fine-tuned on general diseases, showed worse performance than run 2 and 3, which were both fine-tuned on the CT-EMB-SP corpus. • Multilingual Overlap for Pharmaceuticals

Looking at run 2 and run 3 of track 2, we can see that the multilingual models performed similarly to the monolingual models. This suggests a big multilingual overlap for drugs in Spanish, English and Italian. • Possible Noise by Data Augmentation due to Machine Translation

Several runs (e.g. run 2 and run 3 of track 1) showed slightly worse performance when data augmentation was used. This suggests possible additional noise in the training data. For data augmentation, we used the MTSamples dataset, where we used the keywords as entities and translated all text via the Google Translate API. On first inspection, these keywords may refer to laboratory tests, procedures or anatomical entities. Therefore, we have processed the translated MTSamples with MedLexSp; we output only DISO and CHEM categories, and we re-named them to ENFERMEDAD and FARMACO, respectively. Nonetheless, it is unsure whether these data contain general diseases or only cardiology diseases. Furthermore, on closer inspection, there are some issues with the machine translation, which is also visible in the automatic translation of the TREC Admission Notes for Masked Language Modeling (MLM) fine-tuning. Either some words are not translated or new words are created, possibly due to the sub-words of neural models. An example would be *leucitos en urino, which should be leucocitos en orina (’white cells in urine’).

3.3. Error Analysis 3.3.1. Data Reconstruction

Several problems arose during generating the runs; namely, reconstructing the output of the mdebertav3-base and roberta-es-clinical-trials-ner model. mdeberta-v3-base This model posed several problems, particularly in generating the correct index of the span. Often, the start of the span would be one or two tokens of, leading to a decrease in the F1-score for the runs. Additionally, tokens might have leading spaces or newline characters at the beginning or end. These extraneous characters need to be removed to ensure the entity text is clean and accurate. This also includes adjusting the start and end ofsets to reflect the new positions of the cleaned tokens.

The presence of punctuation at the end of tokens can create issues in entity recognition. Special rules are required to handle exceptions such as units (mg.) or cases where brackets are involved. Unnecessary punctuation needs to be removed, but care must be taken to preserve punctuation that is part of the entity. Furthermore, the model would sometimes add tabs instead of spaces into the extracted entity. When tokens are merged or cleaned, their character ofsets in the text need to be recalculated. This ensures that the entities’ positions in the text are accurately represented, which is crucial for tasks like text highlighting or linking entities back to their original context. roberta-es-clinical-trials-ner This model exhibits significant issues with handling sub-words, often treating them as separate entities. Specifically, leading sub-words are represented as individual entities with a preceding space. This requires special considerations during the reconstruction process to ensure accurate entities.

General Remarks We analysed all errors made by the roberta-es-clinical-trials-ner model in run 4 of track 1; we used a Python script to count each type. There are several types of errors that the model made while generating the runs. Examples of this run may be found in Table 2.

1. Scope Errors a) Incompletely predicted entities

These are entities where the model predicted only a part of the actual entity, missing some crucial parts. b) Entities where too many words were predicted

Entities where the predicted span includes extra information not part of the actual entity. c) Entities that would belong together

Entities where the predicted spans should be combined to form a single coherent entity. 2. False Positives

Entities that were incorrectly identified by the model, but not labeled in the ground truth. 3. False Negatives (Missed entities)

Entities that were missed by the model, but were labeled in the ground truth.

Looking into Figure 5, we can see that the high number of scope errors (correctly identified entities) indicates that the model generally has a strong baseline capability for recognizing entities correctly. Despite the high accuracy in identifying entities, the model still exhibits significant precision and recall issues, as evidenced by the presence of false positives and false negatives. Furthermore, the high number of true positives amidst other errors implies that errors are not due to a fundamental flaw in the model but likely due to specific cases or contexts where the model’s performance drops.

Nonetheless, score errors tend to cause less harm in the information extraction of clinical cases. Not detecting an entity (false negative, case 3 in Table 2) is more severe, whereas detecting fumador instead of fumador activo (case 1a in Table 2) is a mild error. This becomes even clearer when looking into errors where abbreviations were used. If written in the text is diabetes mellitus (DM), the model would extract diabetes mellitus (DM), while the test set would annotate both whole text and abbreviation separately. This happens very frequently, which makes the evaluation unsuitable for measuring the model’s practical performance. In the end, a more relaxed evaluation metric could have been more appropriate and yielded higher results.

3.3.2. Data Capture

There are several factors that contributed to why the models underperformed in the submission runs (see Table 1). Further data analysis shows that the main issue was insuficient data capture during both training and evaluation. Plots may be found in Appendix B. In general, we refer to data capture to the degree to which the model has dificulty with capturing all information from the patient notes. After all, in preliminary experiments and for generating submission runs, we simply cut of excess tokens that did not fit into the model.

General Remarks The relative distribution of entities over patient notes, as depicted in the density plots in Figure 8 (Appendix B), reveals several interesting insights.

For track 1, the training set displays a prominent spike at the beginning, followed by a relatively uniform distribution throughout the rest of the notes. In contrast, the development set and test set exhibit two significant spikes: one at the beginning and a larger one at the end, with a notably lower density in the middle. This pattern suggests that most diseases are mentioned either at the beginning or the end of the patient notes.

Turning to pharmaceuticals in track 2, we observe similar entity distributions across all plots. The training set again shows a more uniform distribution, whereas the development and test sets both feature two prominent spikes at the beginning and end, mirroring the pattern observed in track 1. Looking at the words counts in the boxplots of Figure 7, we can clearly see that the training set exhibits significantly less words than both the development set and the test set. About 75% of patient notes from the test set have less than 550 words, which applies to less than 25% of patient notes from the development/test set.

The Venn diagrams in Figure 9 (Appendix B) are also worth mentioning. As we can see, track 1 seems to have only little overlap between the datasets, while track 2 has notably more overlap. This may imply that the model sufers from a few-shot learning problem, especially since results on track 2 are significantly better in terms of performance than those of track 1. Another factor may be the amount of unique entities in the datasets, which is way larger in track 1 than in track 2, further complicating the task for the model.

Implications for Training and Evaluation The previous training and evaluation strategies were significantly afected by these entity distribution patterns. During previous evaluation, a cutof strategy was used, where all excess tokens were trimmed to fit the model’s input layer, which was uniformly set at 512 tokens. This meant that only approximately 60% of the data was fully utilized during training. However, due to the high density of entities at the end of patient notes, this approach resulted in sub-optimal data capture. The situation was even worse during evaluation on the development set, where only less than 25% of the data fit into the models without token cutof. This led to the model being evaluated on a non-representative portion of the dataset, inflating the performance metrics. To improve data capture, we decided to split the patient notes into individual sentences using spaCy [18] for both training and evaluation. This change not only yielded better results, particularly for track 2, but also provided more reliable and representative metrics. Consequently, several experiments were re-conducted (refer to Appendix C). It is important to note that these new runs expand and confirm the trends observed in earlier experiments (see Section 3.1).

4. Conclusion

We can see some interesting trends in the data, allowing us to draw both conclusions about our proposed strategies as well as the provided data.

• Fine-tuning via Masked Language Modeling: This approach had very little influence on the model’s results. This can be attributed to (i) the lack of suficient data for this kind of fine-tuning, (ii) the fact that the patient notes are based on the general domain, and (iii) erroneous machine translation. • Data Augmentation: The efects of data augmentation are still unclear. We have observed both positive and negative efects across diferent model architectures. More experiments with diferent models and types of data augmentation resources are necessary to draw definitive conclusions. • Sliding Windows with Overlap: The impact of the sliding windows approach, as opposed to cutting of excess tokens, is also dificult to judge. Despite expecting better data capture, some experiments actually showed slightly worse results. This efect may be due to patient notes being split in random positions, resulting in incorrect grammar and split entities, which can disrupt the contextual information the model relies on. This issue becomes more evident when considering that processing patient notes at the sentence level improved results notably. • Additional Fine-Tuning/Transfer-Learning on general diseases/drugs: This approach significantly improved the model’s performance. Various experiments demonstrated that adapting a general model to a specific domain requires less efort and yields promising results with relatively little training. • Insuficient Data Capture: Due to the high density of entities in the beginning and end of the patient notes, the cutof strategy performed poorly due to missing entities at the end of the notes. • Overlap of Entities over Datasets: There are significantly less overlapping entities between the training, development and testing dataset for track 1 than there are for track 2. This may explain the generally worse results for track 1, indicating that models may sufer from a few shot learning problem. • Multilingual Overlap for Pharmaceuticals: We have shown that there is a big multilingual overlap concerning pharmaceuticals in Spanish, Italian and English. This can be largely attributed to the standardized pharmaceutical nomenclature, which suggests that a multilingual approach to drug entity extraction can leverage these similarities to enhance accuracy and consistency across diferent languages.

Acknowledgments

Leonardo Campillos-Llanos’ work is conducted in the CLARA-MeD project (PID2020-116001RA-C33), funded by MICIU/AEI/10.13039/501100011033/, in call Proyectos I+D+i Retos Investigación.

B. Data Analysis

It is important to note that for track 2 (FARMACO), the density plots in Figure 8 and boxplots in Figure 7 look the same among the three diferent languages, despite translation. Trivially, the boxplots in Figure 7 look the same for both track 1 and track 2 since the same data was used.

(a) Train Set.

(b) Dev Set.

(a) Track 1 - Train Set. (b) Track 1 - Dev Set. (c) Track 1 - Test Set. (d) Track 2 - Train Set. (e) Track 2 - Dev Set. (f) Track 2 - Test Set.

(a) Track 1 - ENFERMEDAD. (b) Track 2 - FARMACO.

C. Additional Experiments C.1. Cardiology Domain Adaptation

This experiment serves to show how easily a general model, i.e. trained on general pharmaceuticals, can be adapted to a special medical domain. The roberta-es-clinical-trials-ner model was fine-tuned on general drugs using the CT-EMB-SP corpus in Spanish, and it was used as a base model and fine-tuned on the cardiology domain for pharmaceuticals. As previously mentioned in Section 2.2, it already achieved an F1-score of 76.04% before fine-tuning on cardiology data.

As seen in Figure 10, epoch 1 already shows an incredible performance on the development set. Evaluating this model of epoch 1 on the test set, we achieved a precision of 86.15%, a recall of 93.77% and an F1-score of 89.80%. Using a model trained on three epochs (which shows the peak in Figure 10) we obtained a precision of 90.25%, a recall of 94.30% and an F1-score of 92.23%.

roberta-farmaco-es-ss

The same can be said when training the same model on track 1, where the plain model achieved a precision of 62.86%, a recall of 65.42% and an F1-score of 64.11%. With data augmentation, we achieved a significant improvement with a precision of 65.65%, a recall of 73.76% and an F1-score of 69.47%. Nonetheless, when reflecting on the results in Section 3.2, we discussed possible negative efects due to incorrect machine translation. These efects are definitely visible when using e.g. mdeberta-v3-base as a baseline architecture (see also Table 1), which is why we are not entirely capable of judging the

C.2. Efect of Data Augmentation

When looking into the efects of our proposed data augmentation, we trained roberta-es-clinical-trialsner with and without data augmentation (same setup, i.e. same hyper-parameters). In Figure 11, we can see similar behaviour in training, but with less performance on the development set. When evaluating the model on the test set, we got a precision of 92.08%, a recall of 94.06% and an F1-score of 93.06%. Considering the model trained in Section C.1, data augmentation actually led to a slightly higher score. efect of data augmentation. Although the outcomes of some models seem to support that it may help adapting a general model to a specific domain, we would need to experiment with more models and test more types of resources for data augmentation.

C.3. Efect of Fine-Tuning on General Domain - Transfer-Learning

In order to more precisely measure the benefit of fine-tuning a model on a general domain before ifne-tuning it on a specialized domain, we conducted the following experiment on the Spanish part of track 2. mdeberta-v3-base was used as a baseline. We compared the plain mdeberta-v3-base model with one ifne-tuned on the general medical domain via the CT-EMB-SP corpus. Looking at the graph in Figure 12, the general model already showed greater performance than the base model in very early stages of training (validating once more what was seen with regard to adapting a general model to the medical domain, in Section C.1). The base model started with relatively low F1-score, but caught up in the last epochs.

In Table 4, we can see the notable increase in performance not only on the development set, but also on the test set.

C.4. Multilingual Capabilities

In order to check the assumption of a possible multilingual overlap for pharmaceuticals in Spanish, Italian and English, we trained and evaluated a multilingual general model (i.e. mdeberta-v3-base) on all three datasets simultaneously by having concatenated the data of all three languages. As can be seen in Figure 13, the training exhibits a significant drop in performance to what can be seen in language specific models (see Figure 10 in Section C.1). The same can be said when looking into the results obtained when evaluating the model on the test set for each language separately (table 5). This can be explained via minor diferences for special pharmaceutical words among the diferent languages, which may slightly add noise to the data.

The analysis of drug entity recognition across English, Spanish, and Italian (see Figure 14) demonstrates a significant overlap and similarity in the pharmaceutical terminology used in these languages. As seen Spanish English Italian in Table 6, many drug names exhibit minor variations that are primarily due to linguistic diferences such as sufixes and spelling conventions. This overlap can be attributed to the standardized nature of pharmaceutical nomenclature and the widespread use of international nonproprietary names (INNs). These findings suggest that a multilingual approach to drug entity recognition can leverage these similarities to enhance accuracy and consistency across diferent languages. lenalidomid lenalidomide cafein cafeine triamcinolone acetonid triamcinolone acetonide ampicillin ampicillin sulfacetamid sulfacetamide

lenalidomida cafeína triamcinolona acetónido ampicilina sulfacetamida

lenalidomide cafeina triamcinolone acetonide ampicillina sulfacetamide (a) No preprocessing. (b) Stemmed and Lemmatized.

D. Original Submission Runs

D.1. Track 1

run1-mdeberta-ct-mlm-dg The architecture of mdeberta-v3-base was used and fine-tuned on admission notes via Masked Language Modeling (MLM), with continual fine-tuning on general diseases. In order to further tune the model, we used data augmentation.

MLM • Epochs: 5 • Learning Rate: 5e-6 • Loss: 9.3413 • Perplexity: 11399.24 run2-mdeberta-ct The architecture of mdeberta-v3-base was used and fine-tuned on general diseases.

run3-mdeberta-ct-dg The architecture of mdeberta-v3-base was used and fine-tuned on general diseases, this time including data augmentation via MTsamples.

run4-roberta-dg The architecture of lcampillos/roberta-es-clinical-trials-ner was used and finetuned on the task of diseases in cardiology. To further tune the model on identifying only cardiology diseases, we used data augmentation via MTsamples. The base model already has a solid understanding of diseases and reaches an F1-score of 45.52%.

• Learning Rate: 2e-4 run5-roberta-dg-windows The architecture of lcampillos/roberta-es-clinical-trials-ner was used and trained on the task of diseases in cardiology. To further tune the model on identifying only cardiology diseases, we used data augmentation via MTsamples. To further data capture, we used the proposed sliding windows technique.

• Learning Rate: 2e-4 • Epochs: 3 • Window Overlap: 60 tokens • F1 Avg: 86.07% • F1 Exact: 85.48% • F1 Partial: 85.48% • F1 Ent Type: 87.78% • F1 Strict: 85.94% D.2. Track 2 D.2.1. Multilingual Models • Learning Rate: 2e-5 • Epochs: 5 • F1 Avg: 82.43% • F1 Exact: 82.06% • F1 Partial: 82.06% • F1 Ent Type: 83.54% • F1 Strict: 82.06% • Learning Rate: 2e-5 • Epochs: 5 • F1 Avg: 83.22% • F1 Exact: 82.92% • F1 Partial: 82.92% • F1 Ent Type: 84.14% • F1 Strict: 82.92% We propose three types of multilingual models, where all data from all three languages are taken and concatenated for training and evaluation.

run1-mdeberta-multilingual The architecture of mdeberta-v3-base was used.

run2-mdeberta-ct-multilingual The architecture of mdeberta-v3-base was used and fine-tuned on general drugs in Spanish.

run3-roberta-multilingual The architecture of lcampillos/roberta-es-clinical-trials-ner was used and fine-tuned on the task of detecting cardiology drugs. We worked under the assumption that pharmaceuticals may have very similar or even the same names in Spanish, Italian, and English. The base model already has a solid understanding of general drugs and reaches an F1-score of 81.79% for exact matching and 76.04% for strict matching.

• Learning Rate: 8e-5 • Epochs: 10 • F1 Avg: 75.14% • F1 Exact: 74.86% • F1 Partial: 74.86% • F1 Ent Type: 76.01% • F1 Strict: 74.86%

D.2.2. Language Specific Models

Each language has 2 language specific runs. The purpose of these runs is to compare domain-specific models (i.e. models specially trained on the medical domain and use transfer learning to specialize the model on the cardiology domain) to large language-agnostic, base models (i.e. mdeberta-v3-base). Run 4 contains the base model, while run 5 contains the domain-specific model. es MLM

run4-mdeberta-ct-mlm-dg The architecture of mdeberta-v3-base was used and fine-tuned on general drugs in Spanish. Furthermore, it was fine-tuned on Spanish admission notes via Masked Language Modeling (MLM). Additional data via the automatically translated MTsamples dataset was used.

• Epochs: 5 • Learning Rate: 8e-6 • Loss: 8.7417 • Perplexity: 10589.27 Cardiology Task • Learning Rate: 8e-5 • Epochs: 4 • F1 Avg: 70.03% • F1 Exact: 69.23% • F1 Partial: 69.23% • F1 Ent Type: 72.41% • F1 Strict: 69.23% run5-roberta-ct-mlm The architecture of lcampillos/roberta-es-clinical-trials-ner was used and ifne-tuned on Spanish admission notes via Masked Language Modeling (MLM).

MLM • Epochs: 5 • Learning Rate: 1e-6 • Loss: 8.8788 • Perplexity: 7178.02 • Epochs: 5 • Learning Rate: 1e-6 • Loss: 8.65492 • Perplexity: 5738.31 Cardiology Task • Learning Rate: 1e-4 • Epochs: 5 • Window Overlap: 60 tokens • F1 Avg: 75.50% • F1 Exact: 74.94% • F1 Partial: 74.94% • F1 Ent Type: 75.54% • F1 Strict: 77.18% run4-mdeberta-windows The architecture of mdeberta-v3-base was used, including the sliding windows approach to enhance data capture.

• Learning Rate: 1e-4 • Epochs: 10 • Window Overlap: 60 tokens • F1 Avg: 80.45% • F1 Exact: 80.22% • F1 Partial: 80.22% • F1 Ent Type: 81.15% • F1 Strict: 80.22% run5-biobert-mlm-windows The architecture of alvaroalon2/biobert_chemical_ner was used and fine-tuned on English (original) admission notes via Masked Language Modeling (MLM). Furthermore, we used the sliding windows approach to enhance data capture. It is worth mentioning that lcampillos/roberta-es-clinical-trials-ner (with the same specifications) actually achieved slightly better results than alvaroalon2/biobert_chemical_ner, i.e. an average F1-score of 79.00%.

run4-mdeberta techniques.

The architecture of mdeberta-v3-base was used, without any data enhancing • Learning Rate: 1e-4 • Epochs: 10 • F1 Avg: 89.77% • F1 Exact: 89.56% • F1 Partial: 89.56% • F1 Ent Type: 90.40% • F1 Strict: 89.56% run5-biobit-mlm The architecture of IVN-RIN/bioBIT was used and fine-tuned on Italian admission notes via Masked Language Modeling (MLM). It is worth mentioning that lcampillos/roberta-es-clinicaltrials-ner (with the same specifications) achieved worse results than IVN-RIN/bioBIT this time, i.e. an average F1-score of 76.81%.

[1]

Dalianis , Clinical text mining: Secondary use of electronic patient records , Springer Nature, 2018 .

[2]

Demner-Fushman ,

Elhadad ,

Friedman , Natural language processing for health-related texts , in: Biomedical Informatics: Computer Applications in Health Care and Biomedicine, Springer, 2021 , pp. 241 - 272 .

[3]

Lima-López ,

Farré-Maduell ,

Rodríguez-Miret ,

Rodríguez-Ortega ,

Lilli ,

Lenkowicz , G. Ceroni,

Kossof ,

Shah ,

Nentidis ,

Krithara , G. Katsimpras, G. Paliouras, M. Krallinger, Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian , in: G. Faggioli,

Ferro ,

Galuščáková , A . García Seco de Herrera (Eds.), CLEF Working Notes , 2024 .

[4]

Nentidis ,

Katsimpras ,

Krithara ,

Lima-López ,

Farré-Maduell ,

Krallinger ,

Loukachevitch ,

Davydova , E. Tutubalina, G. Paliouras, Overview of BioASQ 2024 : The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

Soulier , G.

Maria Di Nunzio , P.

Galuščáková , A.

García Seco de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), 2024 .