Identifying Cardiological Disorders in Spanish via Data Augmentation and Fine-Tuned Language Models Antonio Romano1,2 , Giuseppe Riccio1,2 , Marco Postiglione1 and Vincenzo Moscato1,2 1 University of Naples Federico II, Department of Electrical Engineering and Information Technology (DIETI), Via Claudio, 21 - 80125 - Naples, Italy 2 Consorzio Interuniversitario Nazionale per l’Informatica (CINI) - ITEM National Lab, Complesso Universitario Monte S.Angelo, Naples, Italy Abstract This study presents a novel approach to Biomedical Named Entity Recognition (BioNER), specifically tailored for the cardiology domain. The challenge of adapting models to specific fields is addressed through the integration of cross-domain transfer learning and data augmentation techniques. The process begins with the fine-tuning of a compact Biomedical Transformer model on the DisTEMIST corpus, enabling the capture of general biomedical concepts. This model is then further trained on the CardioCCC corpus, a cardiology-specific dataset, enhancing its ability to identify and interpret cardiological entities. A data augmentation strategy then is employed, leveraging Context Similarity and K-Nearest Neighbors (KNN) to generate augmented datasets. This enhances the model’s ability to recognize medical entities. The final step involves a NER Fusion strategy, which combines outputs from multiple BioNER taggers to bolster robustness and accuracy in entity recognition. Experimental results from the MultiCardioNER challenge demonstrate the effectiveness of the proposed approach. Our framework surpasses the median F1 Score of 0.7566 by approximately 4%, achieving a score of 0.791, which is only 2% lower w.r.t. the top submission, despite being based on much smaller language models. Keywords Biomedical Named Entity Recognition, Data Augmentation, Language Models, EHRs 1. Introduction In recent years, given the increasing volume of clinical data generated by medical personnel and the evolution of Artificial Intelligence (AI) models, it has become necessary to adopt techniques for the automatic extraction of medical concepts in order to support the development of personalized insights useful for patient health. Specializing pre-trained BioNER models from general medical domains to specific fields like cardiology presents significant challenges due to limited specialized data availability, as highlighted by Nguyen et al. [1] and Chen et al. [2]. Transfer learning is a pivotal method for enhancing model performance in specific domains, as shown by Sasikumar and Mantri [3] and Zhou et al. [4], who adapted pre-trained biomedical models to specialized areas. Nevertheless, this approach is often insufficient due to the complexity of domain-specific texts. Our approach attempts to address this problem by generating new data that increases the presence of less frequent medical entities by replacing them with similar medical entities. Therefore, for the identification of novel medical entities within the specified domain, it is necessary to establish a substitution strategy that, in contrast to other methodologies (Phan and Nguyen [5]; Ghosh et al. [6]), exploits the contextual similarity of the sentence in which the entity is to be augmented. The proposed annotation methodology includes, in addition to data augmentation, a late fusion mechanism that leverages the use of various pre-trained models in the medical domain, similar to the CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ antonio.romano45@studenti.unina.it (A. Romano); giuseppe.riccio9@studenti.unina.it (G. Riccio); marco.postiglione@unina.it (M. Postiglione); vincenzo.moscato@unina.it (V. Moscato) € https://github.com/LaErre9 (A. Romano); https://github.com/giuseppericcio (G. Riccio); http://wpage.unina.it/vmoscato/ (V. Moscato)  0009-0000-5377-5051 (A. Romano); 0009-0002-8613-1126 (G. Riccio); 0000-0001-6092-940X (M. Postiglione); 0000-0002-0754-7696 (V. Moscato) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings work proposed by Sun and Bhatia [7], and fine-tuned with cardiology data. This mechanism aims to improve the robustness and coverage of the generated annotations, as these models, trained on heterogeneous data, allow our system to recognize a greater number of medical entities through their combination. We experimented our approach within the first track of the MultiCardioNER1 [8] challenge, part of the BioASQ 2024 [9] workshop. Specifically, we employed a diverse range of pre-trained models, each fine-tuned on combinations of the DisTEMIST dataset [10] as well as a new dataset (CardioCCC) of cardiology clinical cases annotated using the same guidelines. Our method surpassed the median F1 Score of 0.7566 by approximately 4% to achieve a score of 0.791. Interestingly, our score is close to the winning submission (only 2% lower) despite being based on much smaller transformer architectures. The remainder of this paper is structured as follows: Section 2 discusses the existing literature related to our work; in Section 3 we outline the scope and objectives of our study; Section 4 presents the datasets used in our framework and their main characteristics. In Section 5, we detail our method, while experiments are presented in Section 6, discussing the results and their implications within the context of our research objectives. Finally, Section 7 summarizes the contributions of this paper and suggests avenues for future research. 2. Related Work Adapting BioNER to specific medical domains, such as cardiology, presents significant challenges due to the complexity and variety of medical language. Transfer learning has proven to be an essential method in the field of cross-domain BioNER for enhancing model resilience with respect to the medical concepts of specific domains. For example, Sasikumar and Mantri [3] leverage pre-trained models on biomedical corpora to adapt them to specific medical domains. Similarly, Zhou et al. [4] utilized transfer learning to leverage pre-trained features of general medicine models to improve the accuracy of specialized NER systems in clinical records. However, despite the effectiveness of transfer learning, there are significant challenges. One of these is the adaptation of general clinical concept recognition systems to cardiology, a domain with unique complexity and specificity. Transfer learning alone may not be sufficient to address the challenges associated with domain-specific NER, due to the diversity and complexity of biomedical texts. To bridge this gap, our study proposes an approach that integrates transfer learning with Data Augmentation. Therefore, extensive research has been conducted on strategies to increase text data in order to solve the issue of the lack of manually annotated data in specific medical domain (e.g. cardiological area). For example, Bartolini et al. [11] propose COSINER, which generates distinct increased data by using the contextual replacement of entities. Furthermore, Ghosh et al. [6] presented BioAug, which conditionally generates augmented data using the BART model to guarantee factual accuracy and diversity. Another approach for entity replacement proposed by Phan and Nguyen [5], creates new sentences by substituting entities with semantically equivalent ones using Gazetteer terms. Following the cross-domain phase, to improve the robustness and coverage of the BioNER models, we utilize the merging of BioNER taggers. Sun and Bhatia [7] proposed merging the results to manage tag overlap and improve the complete concept extraction. Our approach, takes inspiration from the above- mentioned method, merging the results generated by different BioNER taggers to increase coverage, relying on these research results to ensure a more complete and accurate extraction of entities. In addition, merging taggers involves handling overlapping tags and conflicting results, ensuring that the final output is more precise and coherent. Our contribution In the proposed framework, we have adopted four pre-trained models. In particular, the basic models are those of Carrino et al. [12] and Carrino et al. [13], which have demonstrated excellent results in BioNER 1 MultiCardioNER challenge website: https://temu.bsc.es/multicardioner/ in medical texts in Spanish. The other two models have been trained from the previous ones using wider medical datasets, such as those from Cohen et al. [14] and Llanos et al. [15], thus enriching knowledge of basic models and improving recognition of a greater number of medical entities. Subsequently, we implemented a phase of data augmentation on the CardioCCC cardiology-specific dataset in order to increase the number of medical concepts useful for cross-domain transfer learning, inspired by the entity replacement technique proposed by Phan and Nguyen [5]. Finally, our approach involves merging the predictions made by the different BioNER taggers, following an overlapping management strategy between the various annotations, inspired by the merging technique described by Sun and Bhatia [7]. 3. Problem formulation Starting from a dataset of annotated sentences denoted as 𝐴𝑁 𝑁𝐷 = {(x𝑖 , y𝑖 ) ∈ X × Y}, where: • X is the collection of all sentences. • 𝑖 ∈ {1, . . . , 𝑁 }, 𝑁 representing the total number of sentences in the dataset and the 𝑖 representing the i-th sentence. • Each sentence x𝑖 is a sequence of tokens 𝑥𝑗 ∈ x𝑖 where 𝑗 ∈ {1, . . . , 𝐻𝑖 } and 𝐻𝑖 is the length of the sentence. • Y is the set of possible labels. We use the IOB2 annotation scheme [16], thus Y = {B, I, O}, where B marks the beginning of an entity, I marks the inside of an entity, and O marks tokens outside any entity. • y𝑖 assigns each token 𝑥𝑗 ∈ x𝑖 to its corresponding label 𝑦𝑗 . The objective of the BioNER model is to precisely assign the appropriate tag from Y to each token within a given input sentence. 4. Materials In our study, we utilized data provided by the MultiCardioNER challenge, which encompasses various clinical domain corpora, including DisTEMIST and the smaller CardioCCC corpus. Specifically, the DisTEMIST corpus underwent manual annotation by clinical experts, following specific guidelines for annotating diseases in Spanish clinical cases, as outlined in the work of Miranda-Escalada et al. [10]. These guidelines were meticulously developed by clinical experts through multiple cycles of quality control and consistency analysis before the entire dataset received annotations. The training set for recognizing designated entities within the DisTEMIST corpus comprises 1000 recorded clinical cases. Simultaneously, a similar procedure was applied to 508 documents related to cardiological clinical cases within the Corpus CardioCCC. To enhance the annotated dataset, we leveraged the DisTEMIST gazetteer, which contains key terms and synonyms for clinical entities. This tool significantly improves coverage of terminological and semantic variations in cardiological clinical texts using similarity-based approaches, thereby enhancing the quality and accuracy of annotations. 5. Methodology Figure 1 shows an overview of the methodological flow of our solution for the MultiCardioNER track. Starting with the DisteMIST (𝒟) corpus training set, annotated according to the guidelines explained previously, it has been adequately preprocessed to build a new dataset containing the document identifier 𝐼𝐷, the tokens 𝑥𝑗 representing it, and the tags 𝑦𝑗 associated with the token, using the properly mapped BIO scheme. The need to tokenize clinical text sentences x𝑖 into single tokens 𝑥𝑗 arises from the limited number of input samples accepted by each model used. The same process is applied to prepare CardioCCC (𝒞) for the next fine-tuning phase. Biomedical Transformer Backbone DisTEMIST Corpus Embedding NER Tagger Fine Tuned (BIO Scheme) Final NER (BIO Scheme) Gazetteer (DisTEMIST) of DisTEMIST Data Augmentation Biomedical Transformer Backbone NER Tagger (BIO Scheme) CardioCCC NER Corpus Biomedical Transformer Tagger Backbone (BIO Scheme) Cross Domain Transfer Learning Figure 1: Overview of our NER solution for the MulticardioNER track. The proposed architecture employs cross- domain transfer learning to enhance disease recognition in cardiological domain. It begins with preprocessing the DisteMIST corpus data 𝒟, followed by fine-tuning a Biomedical Transformer Backbone on various corpus. Subsequently, an innovative Data Augmentation technique based on contextual similarity 𝐶𝑆 is used to enrich the CardioCCC 𝒞 corpus. The model undergoes further fine-tuning using augmented dataset 𝒟𝒜, ensuring improved accuracy and generalization capabilities in disease recognition. Finally, we merge the annotations generated by various models using a NER Fusion 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 strategy. This strategy effectively removes overlapping entities, prioritizing annotations based on a predefined scheme. 5.1. Cross-domain transfer learning We propose an innovative cross-domain transfer learning solution to enhance disease recognition in cardiology. Our approach leverages a Biomedical Transformer Backbone, which is fine-tuned on various corpora provided by the challenge, to achieve superior predictive performance. Initial Fine-Tuning on DisTEMIST We start by fine-tuning the Biomedical Transformer Backbone using the DisTEMIST corpus 𝒟. This initial step tailors the model to understand the general biomedical language and disease entities present in this dataset. This corpus provides a broad foundation, enabling the model to capture essential biomedical concepts and terminologies. Transfer learning on CardioCCC The fine-tuned model is then trained on the CardioCCC (𝒞) corpus to generate the first set of predictions. This step allows the model to adapt its understanding specifically to cardiological contexts and terminologies. Focusing on this specialized corpus ensures that the model can recognize and interpret data relevant to cardiology. In fact, it generates a second set of predictions. This ensures that the model integrates the cardiology-specific knowledge more deeply. 5.2. Data Augmentation with Context Similarity Frequency Study Prior to implementing data augmentation, a systematic Frequency Study was performed to identify underrepresented entities and contexts within the CardioCCC dataset (𝒞). This Figure 2: Frequency Study. In this analysis, we show the relative distribution of word occurrences. From this representation, it was possible to identify a threshold of 12, which roughly corresponds to the knee point of the curve. This allows for the exclusion of all words with occurrences greater than 12 (indicated by the red line), as they are sufficiently represented in the CardioCCC dataset (𝒞). analysis involved examining the frequency of each medical mention in the CardioCCC dataset (𝒞). Following this analysis, a threshold was determined, corresponding approximately to the knee of the curve (Figure 2) that illustrates the distribution of word frequencies within the dataset. This threshold represents a balance point: words with frequencies above this threshold are sufficiently common and do not necessitate augmentation, while those with frequencies below are infrequent and can benefit from augmentation. This approach ensures that the augmentation process specifically targets these deficiencies, thereby optimizing the benefits of the additional data. Through this analysis, we identified the entities that should be replaced to enhance the diversity and comprehensiveness of the CardioCCC dataset (𝒞). This enhancement improves the model’s generaliza- tion capabilities, enabling it to recognize a broader spectrum of disease entities and ultimately achieve higher accuracy. Entity Replacement using Context Similarity with KNN With the insights gained from the Frequency Study, we leverage the abundant information provided by the Gazetteer (𝒢) to fill gaps in the CardioCCC (𝒞) dataset, thereby enhancing its overall quality and utility for training the Biomedical Transformer Backbone fine-tuned using DisTEMIST (𝒟). To accomplish this objective, we employed a dataset augmentation technique utilizing Context Similarity with K-Nearest Neighbors (KNN) — K being set to 𝐾 = 1 —, as illustrated in Figure 3. This approach involves calculating the similarity between the embeddings of sentences in the CardioCCC dataset 𝐸𝑚𝑏(x𝑖 ) and the embeddings of entities in the Gazetteer 𝐸𝑚𝑏(e𝑖 ). By targeting sentences in CardioCCC (𝒞) annotated with B and I tags from the BIO Scheme, we identify the most contextually similar entities in the Gazetteer 𝒢 and replace the original entities in the CardioCCC sentences X with these similar entities obtaining X ^. To formalize Data Augmentation phase, we utilize a Gazetteer denoted as 𝒢 = {(e𝑖 , ^ y) ∈ E × Y}, where E represents the collection of all entities in the Gazetteer, e𝑖 is the 𝑖-th entity, with 𝑖 ∈ ^ {1, . . . , 𝑁𝐸 }, and Y ^ ∈ {𝐵, 𝐼} represents the set of labels assigned to e𝑖 . Here, 𝑁𝐸 denotes the total number of entities in the Gazetteer. Subsequently, we employ Context Similarity (𝐶𝑆), computed using the K-Nearest Neighbors (KNN) function between the embeddings 𝐸𝑚𝑏(xi ) = xi and 𝐸𝑚𝑏(ei ) = ei . The Context Similarity (𝐶𝑆) is defined as: 𝐶𝑆 : 𝐾𝑁 𝑁 (xi , ei ) (1) where 𝐶𝑆 represents the top-similar entities from 𝒢 that are candidates for augment sentences. The Embedding Extraction CardioCCC Corpus Replacement Data Embedding Augmented Extraction Gazetteer of DisTEMIST Figure 3: Overview of Data Augmentation. The embeddings for each sentence from CardioCCC (𝐸𝑚𝑏(x𝑖 )) and the embeddings of entities in the Gazetteer of DisteMIST (𝐸𝑚𝑏(e𝑖 )) are extracted, and the K-Nearest Neighbors (KNN) algorithm is used to calculate the similarities between these embeddings augmented sentences X ^ are formulated as: ^ : {x^i = 𝐶𝑆(xi ) | ∀xi ∈ X} X (2) Consequently, the augmented dataset (𝒟𝒜) is expressed as: ^ , ∀𝑦𝑖 ∈ Y} 𝒟𝒜 : {(x^i , 𝑦𝑖 ) ∪ (xi , 𝑦𝑖 ) | ∀xi ∈ X, ∀x^i ∈ X (3) where X and Y denote the original sets of sentences and their corresponding labels, respectively. This method augments the dataset by merging both the original sentences (X) and contextually similar sentences (X ^ ), as can be seen from the flow of the data augmentation process shown in Figure 4. 5.3. Transfer learning on CardioCCC Augmented Corpus The Biomedical Transformer Backbone, developed on DisTEMIST (𝒟), is further trained on the Car- dioCCC Augmented (𝒟𝒜) dataset. This final training phase allows the model to generate a third set of predictions, benefiting from the increased diversity and richness of the data. Through this strategy, we are able to enhance the model’s ability to generalize and recognize diseases more accurately. By initially fine-tuning on the DisTEMIST (𝒟) corpus, the model gains a broad understanding of biomedical language, which is essential for accurate disease recognition. Training on the enriched CardioCCC (𝒞) corpus further refines the model to focus on cardiological data, ensuring its predictions are contextually relevant and precise. 5.4. BioNER Fusion To enhance the coverage of entities extracted from clinical notes, we merge the annotations generated by different Biomedical Transformer Backbones. However, during the BioNER Fusion 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 phase, it is essential to define merging strategies to handle any overlapping annotations. To achieve this, we establish a priority level based on the predictive performance of the models, allowing us to correctly select the annotation in case of conflicts. Initially, we perform a fusion operation to remove duplicate extracted entities that are entirely overlapping. Original Sentence from CardioCCC The Patient showed symptoms of angina and was advised to undergo an ECG Context Similarity CardioCCC Gazetteer Corpus Relevant Entities from the Gazetteer Angina: {"chest pain", "coronary artery disease", "myocardial infarction"} + ECG: {"electrocardiogram", "EKG", "heart monitor"} Selecting the number of entity with Data Augmented the highest contextual similarity Augmented Sentence The patient showed symptoms of chest pain and was advised to undergo an electrocardiogram. Figure 4: Example of Data Augmentation with Context Similarity 𝐶𝑆. Starting from the original sentence in ^ ), which CardioCCC (X), relevant entities from the Gazetteer are calculated to obtain the augmented sentence (X is then added to the initial dataset (𝒞). For managing the Fusion of NER tagger generated by cross-domain transfer learning, we handle the overlapping with this function: 𝑂𝐿 : min(𝐸𝑥𝑖 , 𝐸𝑥𝑖′ ) − max(𝑆𝑥𝑖 , 𝑆𝑥𝑖′ ) ≥ 0 (4) where 𝐸𝑥𝑖 and 𝑆𝑥𝑖 represent the End Span and Start Span of the 𝑥𝑖 entity by the first model, while 𝐸𝑥𝑖′ and 𝑆𝑥𝑖′ refer to the second model. Sometimes, the overlap is not complete, but the “start span” and “end span” of one entity partially coincide with those of another entity (even if not identical), resulting in 𝑂𝐿 > 0. In essence, the strategy of BioNER Fusion 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 is defined as: ⎧ ⎨ 𝑥𝐼𝑀 𝑃 : 𝑂𝐿 ≥ 0 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 = (5) (𝑥𝑖 , 𝑥𝑖′ ) : 𝑂𝐿 < 0 ⎩ In such scenarios, overlapping is resolved by prioritizing the entity with the higher priority level 𝑥𝐼𝑀 𝑃 . The priority scheme, assigned to the models, is fixed according to the performance of the models observed on the internal test set. For example, the model with priority level 1 has a higher priority than the model with priority level 2. This approach ensures that all extracted entities are retained in the final clinical note, thereby enhancing the overall entity extraction process. start_span end_span entity model ... ... ... ... overlapping 380 411 enfermedad por coronavirus 2019 1 pre-emption of the 380 421 enfermedad por coronavirus 2019 (COVID-19 2 model choiced 380 406 enfermedad por coronavirus 3 413 421 COVID-19 1 ... ... ... ... Post NER Augmented Fusion start_span end_span text model ... ... ... ... 380 411 enfermedad por coronavirus 2019 1 413 421 COVID-19 1 ... ... ... ... Figure 5: Example of NER Fusion 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 . In this example, there is an overlap between 𝑥𝑖 , 𝑥′𝑖 and, 𝑥′′𝑖 following our BioNER Fusion strategy, the higher priority model prediction (model 1) is chosen as the final annotation. 6. Experiments The performance of the proposed approaches for BioNER was assessed by participating in the Mul- tiCardioNER Shared Task2 as part of the BioASQ 2024 challenge. This section presents the results of our methodology on the final test set, along with preliminary experiments conducted on the training corpus provided by the challenge organizers. 6.1. Experimental Setup 6.1.1. Evaluation Metrics The evaluation is performed by comparing the automatically generated results with those produced by the manual annotations of experts. The primary evaluation metrics for track 1 include micro-averaged precision (MiP), recall (MiR), and the F1 score (MiF1). For the evaluation of the results, the library3 realized by the organizers was used. 6.1.2. Configuration The BioNER system was implemented using the HuggingFace Transformers library (v4.40.2) by exploiting the various Spanish Transformer biomedical networks in the repository. In the Table 1 are shown those selected for the experiment. These models were chosen not only for their effectiveness but also for their relatively small size, mak- ing them executable even on less powerful hardware and thus suitable for low-resource environments. We fine-tuned our models in a Google Colab environment, which provided us with a Tesla T4 GPU. In the phase prior to our submission, we studied the effects of various hyperparameters and the generalization error of our models by dividing the original corpus of clinical cases into three parts: (1) a 2 MultiCardioNER challenge website: https://temu.bsc.es/multicardioner/ 3 MultiCardioNER Evaluation Library: https://github.com/nlp4bia-bsc/multicardioner_evaluation_library 4 https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es 5 https://huggingface.co/lcampillos/roberta-es-clinical-trials-ner 6 https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es 7 https://huggingface.co/StivenLancheros/roberta-base-biomedical-clinical-es-finetuned-ner-CRAFT Table 1 Spanish RoBERTa-based models for clinical biomedical language used for experimentation Model Name Size Document types Biomedical Corpora Year Medical Crawler [17], Clinical Cases, Misc Scielo [18], BARR2[19], Clinical Docs r-b-bio-clinical-es4 [12] 504 MB Wikipedia Life Sciences, 2021 and Reports Patents, EMEA, mespen_Medline [20], Pubmed Clinical Trials, r-es-clinical-trials-ner5 [15] 496 MB CT-EBM-SP corpus [21] 2021 Studies Medical Crawler [17], Clinical Cases, Misc Scielo [18], BARR2 [19], Clinical Docs bsc-bio-ehr-es6 [13] 499 MB Wikipedia Life Sciences, 2022 and Reports Patents, EMEA, mespen_Medline [20], Pubmed The Colorado Clinical Docs r-b-bio-clinical-es-FT7 [12] 502 MB Richly Annotated Full-Text 2022 and Reports (CRAFT) Corpus [14] training set (60% of the original corpus) used to train the model, (2) a validation set (20% of the original corpus) to evaluate the effects of the hyperparameters, and (3) a test set (20% of the original corpus) to assess the models’ ability to generalize to unseen data (Internal Test Set Results). 6.2. Results To conduct the experiments, we initially analyzed how variations in hyperparameters influenced our validation set. Firstly, we adjusted the batch size, determining that an optimal size was 4. Subsequently, we tuned the learning rate, discovering that 8e-5 yielded the best results. Finally, we modified the weight decay, concluding that a value of 0.2 was optimal. After identifying these hyperparameters, we increased the training epochs and implemented an early stopping criterion to halt training if performance on the validation set did not improve for five consecutive epochs. Further details on hyperparameters tuning are provided in the Appendix A. 6.2.1. Cross-domain BioNER Evaluation To construct the optimal cross-domain transfer learning model, we conduct an Internal Test to select the best combination of pre-trained models used at various layers of the cross-domain process as shown in Table 2. The best results are shown in bold and were selected in the MultiCardioNER Test Results released by the challenge organizer, as shown in Table 3. The model bsc-bio-ehr-es, trained on CardioCCC (𝒞), achieved the best results in the external test with an F1 score of 0.7924, due to the specificity of the CardioCCC dataset, which includes terminology and concepts related to cardiology. In comparison, models pre-trained on DisTEMIST (𝒟) and both the combination of DisTEMIST and CardioCCC (𝒟 ∪ 𝒞), as well as DisTEMIST and CardioCCC Augmented (𝒟 ∪ 𝒟𝒜) showed lower performance. This is attributed to the more general nature of the DisTEMIST dataset, which fails to effectively capture the specialized cardiology terms present in the test sets. Therefore, we are also considering the proposed model fusion approach. Table 2 Internal Test Results. Results were obtained by splitting the CarcioCCC data into 80% training and 20% test datasets. The ‘Dataset’ column represents the dataset on which the pre-trained model was trained. Dataset Pre-trained Model MiP MiR MiF1 Accuracy DisTEMIST bsc-bio-ehr-es 0.6871 0.7313 0.7085 0.9628 DisTEMIST r-b-bio-clinical-es 0.6845 0.7203 0.7020 0.9598 DisTEMIST r-b-bio-clinical-es-FT 0.6962 0.7186 0.7073 0.9610 DisTEMIST r-es-clinical-trials-ner 0.7023 0.7284 0.7152 0.9615 CardioCCC bsc-bio-ehr-es 0.8000 0.8122 0.8061 0.9753 CardioCCC r-base-biomedical-clinical-es 0.7943 0.8089 0.8016 0.9731 CardioCCC r-b-bio-clinical-es-FT 0.8014 0.8098 0.8056 0.9741 CardioCCC r-es-clinical-trials-ner 0.7967 0.8052 0.8009 0.9752 DisTEMIST + CardioCCC bsc-bio-ehr-es 0.8082 0.8069 0.8075 0.9754 DisTEMIST + CardioCCC r-base-biomedical-clinical-es 0.7940 0.8152 0.8045 0.9743 DisTEMIST + CardioCCC r-b-bio-clinical-es-FT 0.7959 0.8102 0.8030 0.9735 DisTEMIST + CardioCCC r-es-clinical-trials-ner 0.7983 0.8184 0.8082 0.9751 DisTEMIST + CardioCCC Augmented bsc-bio-ehr-es 0.8951 0.9047 0.8998 0.9886 DisTEMIST + CardioCCC Augmented r-base-biomedical-clinical-es 0.8911 0.8930 0.8920 0.9875 DisTEMIST + CardioCCC Augmented r-b-bio-clinical-es-FT 0.9012 0.9080 0.9046 0.9880 DisTEMIST + CardioCCC Augmented r-es-clinical-trials-ner 0.9125 0.9038 0.9081 0.9888 Table 3 MultiCardioNER Test Results. Results obtained on the official test data. The ‘Dataset’ column represents the dataset on which the pre-trained model was trained and comparison with results of the Challenge Combination Dataset Pre-trained Model MiP MiR MiF1 𝜑𝒟 DisTEMIST r-es-clinical-trials-ner 0.7184 0.7275 0.723 𝜑𝒞 CardioCCC bsc-bio-ehr-es 0.8046 0.7804 0.7924 𝜑𝒟∪𝒞 DisTEMIST + CardioCCC r-es-clinical-trials-ner 0.7784 0.7744 0.7764 𝜑𝒟∪𝒟𝒜 DisTEMIST + CardioCCC Augmented r-es-clinical-trials-ner 0.7784 0.7744 0.7764 . - Median of the Challenge - - - 0.7566 - Average of the Challenge - - - 0.6772 - Highest Precision - 0.8919 - - - Highest Recall - - 0.8243 - - Highest F1 - - 0.8199 6.2.2. BioNER Fusion Evaluation We evaluate the impact of Fusion applied to the best combinations set previously. In Table 4, it is evident that the most promising results stem from the top submission presented during the competition. Specifically, the BioNER Fusion (𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 ) performed on the combination of CardioCCC and bsc- bio-ehr-es model (𝜑𝒞 ), DisTEMIST + CardioCCC with r-es-clinical-trials-ner (𝜑𝒟∪𝒞 ), and DisTEMIST + CardioCCC Augmented with r-es-clinical-trials-ner (𝜑𝒟∪𝒟𝒜 ) exhibited superior predictive performance. This integration, facilitated through our fusion strategy, yielded an enhancement in Recall (MiR), showcasing the system’s heightened ability to accurately identify relevant entities. This outcome implies that fusion enabled the system to offset individual model deficiencies, thereby contributing to an overall improvement in entity extraction effectiveness. Therefore, the fusion coupled with the application of String Matching Cutter exhibits high Precision but low Recall, indicating a conservative tendency of the system to recognize only highly probable entities. Conversely, the integration of String Matching Adder with the fusion is characterized by greater inclusivity, even if at the expense of lower Precision. In conclusion, examining the overall results of the challenge reveals that the leading models (e.g. mdeberta, XLM-RoBERTa, CLIN-X-ES, ...) utilized are at least 3-4 times larger than those employed in our approach. Despite this, the best result achieved by our system nearly matched the performance of the top models, with an F1 score of 0.791 compared to approximately 0.82. This demonstrates that our approach is not only effective but also more efficient in terms of computational resources, making it ideal for practical implementations with hardware constraints. Additionally, the fusion of models from various datasets (CardioCCC, DisTEMIST, and augmented datasets) has demonstrated Table 4 BioNER Fusion results. "(𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟∪𝒟𝒜 )" represents the combination of the models trained on CardioCCC (𝒞), DisTEMIST (𝒟) and DisTEMIST + CardioCCC Augmented (𝒟 ∪ 𝒟𝒜); "(𝜑𝒟𝒜 ×4)" represents the combination of the 4 models used during the experiments and then fine-tuned on Augmented Dataset (𝒟𝒜); "+ String Matching Cutter" and "+ String Matching Adder" represent the application of String Matching using the Gazetteer. Specifically, String Matching Cutter removes entities not present in the Gazetteer, while String Matching Adder merges entities present in the Gazetteer that were not extracted by the model combination. Fusion MiP MiR MiF1 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟∪𝒟𝒜 ) 0.7794 0.803 0.791 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒟𝒜 × 4) 0.7346 0.7799 0.7566 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟∪𝒟𝒜 ) + String Matching Cutter 0.8919 0.4897 0.6323 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒟𝒜 × 4) + String Matching Cutter 0.8886 0.4744 0.6185 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟∪𝒟𝒜 ) + String Matching Adder 0.7079 0.7775 0.7411 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒟𝒜 × 4) + String Matching Adder 0.6568 0.6244 0.6402 Median of the Challenge - - 0.7566 Average of the Challenge - - 0.6772 Highest Precision 0.8919 - - Highest Recall - 0.8243 - Winner F1 - - 0.8199 Table 5 Error Analysis. Error Analysis conducted on several BioNER Fusion strategy (Note: Total concepts in Ground Truth: 7884) Fusion N. Extr. Ent. Correct CFP CFN RLOS 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟𝒜 ) 8123 6331 (80,30%) 684 398 1122 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒟𝒜 × 4) 8370 6149 (77,99%) 934 294 1297 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟𝒜 ) + String Matching Cutter 4329 3861 (48,97%) 206 3745 271 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒟𝒜 × 4) + String Matching Cutter 4209 3740 (47,43%) 229 3886 244 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟𝒜 ) + String Matching Adder 8660 6350 (80,54%) 1187 367 1137 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒟𝒜 × 4) + String Matching Adder 7495 5146(65,27%) 1044 1245 1315 the system’s capability to integrate and balance information from diverse sources, thereby enhancing overall performance and flexibility. 6.3. Error Analysis Inspired by Moscato et al. [22], we conducted a detailed error analysis of the differences among the models by examining the number of correctly retrieved entity mentions (Correct). These errors can be categorized into three possible distinct types: • Complete False Positive (CFP): The model identifies an entity that is not annotated as a named entity. • Complete False Negative (CFN): The model fails to identify an entity that is annotated as a named entity. • Right Label Overlapping Span (RLOS): The model correctly identifies the presence of an annotated named entity, but the span of the entity is incorrect. This categorization allowed us to better understand the strengths and weaknesses of our system. The results are shown in Table 5. The error analysis corroborates previous evaluations, indicating that the fusion combined with string matching significantly reduces the number of false positives but drastically increases the number of false negatives, as it extracts only half of the relevant entities. The possible causes of these results may lie in the quality and size of the Gazetteer used. The best balance between precision and recall, which most effectively satisfies this analysis, is once again achieved by 𝑁 𝐸𝑅𝐹 𝑈 𝑆𝐼𝑂𝑁 (𝜑𝒞 , 𝜑𝒟∪𝒞 , 𝜑𝒟𝒜 ). 7. Conclusion In this study, we presented an innovative approach to address the challenge of BioNER Fusion in the biomedical domain, with a particular focus on cardiology. Our methodology integrates data augmentation techniques and data fusion mechanisms to enhance the robustness and coverage of the generated annotations. By utilizing pre-trained models on biomedical corpora and refining them with domain-specific cardiology data, we achieved significant results, overcoming the limitations related to the scarce availability of domain-specific data. However, there are potential disadvantages to our approach. Data augmentation techniques, while increasing the diversity of the training data, might also introduce noise and potentially irrelevant information, which could hinder the model’s performance. Additionally, the complexity of integrating multiple models through data fusion can increase computational requirements and may pose challenges in real-time applications. The results obtained in the MultiCardioNER competition, part of the BioASQ 2024 challenge, demon- strate the effectiveness of our approach. The key characteristics of our results include their efficacy, computational efficiency, domain adaptation, flexibility, balance between precision and recall, robustness, and innovativeness. These combined elements illustrate how our approach can be a valid and practical solution for entity extraction from biomedical texts, especially in contexts with limited computational resources. We exceeded the median F1 score by 4%, achieving a score of 0.791. This success highlights the potential of the proposed techniques in addressing BioNER challenges in specific biomedical contexts, paving the way for further improvements and applications in various clinical fields. Acknowledgements We acknowledge financial support from (1) the PNRR MUR project PE0000013-FAIR and (2) the Italian ministry of economic development, via the ICARUS (Intelligent Contract Automation for Rethinking User Services) project (CUP: B69J23000270005). References [1] N. D. Nguyen, L. Du, W. L. Buntine, C. Chen, R. Beare, Hardness-guided domain adaptation to recog- nise biomedical named entities under low-resource scenarios, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Association for Computa- tional Linguistics, 2022, pp. 4063–4071. URL: https://doi.org/10.18653/v1/2022.emnlp-main.271. doi:10.18653/V1/2022.EMNLP-MAIN.271. [2] S. Chen, G. Aguilar, L. Neves, T. Solorio, Data augmentation for cross-domain named entity recognition, in: M. Moens, X. Huang, L. Specia, S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Association for Computational Linguistics, 2021, pp. 5346–5356. URL: https://doi.org/10.18653/v1/2021.emnlp-main.434. doi:10.18653/V1/2021. EMNLP-MAIN.434. [3] N. Sasikumar, K. S. I. Mantri, Transfer learning for low-resource clinical named entity recognition, in: T. Naumann, A. B. Abacha, S. Bethard, K. Roberts, A. Rumshisky (Eds.), Proceedings of the 5th Clinical Natural Language Processing Workshop, ClinicalNLP@ACL 2023, Toronto, Canada, July 14, 2023, Association for Computational Linguistics, 2023, pp. 514–518. URL: https://doi.org/10. 18653/v1/2023.clinicalnlp-1.53. doi:10.18653/V1/2023.CLINICALNLP-1.53. [4] M. Zhou, J. Tan, S. Yang, H. Wang, L. Wang, Z. Xiao, Ensemble transfer learning on augmented domain resources for oncological named entity recognition in chinese clinical records, IEEE Access 11 (2023) 80416–80428. URL: https://doi.org/10.1109/ACCESS.2023.3299824. doi:10.1109/ ACCESS.2023.3299824. [5] U. Phan, N. Nguyen, Simple semantic-based data augmentation for named entity recognition in biomedical texts, in: D. Demner-Fushman, K. B. Cohen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP@ACL 2022, Dublin, Ireland, May 26, 2022, Association for Computational Linguistics, 2022, pp. 123–129. URL: https://doi.org/ 10.18653/v1/2022.bionlp-1.12. doi:10.18653/V1/2022.BIONLP-1.12. [6] S. Ghosh, U. Tyagi, S. Kumar, D. Manocha, Bioaug: Conditional generation based data augmentation for low-resource biomedical NER, in: H. Chen, W. E. Duh, H. Huang, M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, ACM, 2023, pp. 1853–1858. URL: https://doi.org/10.1145/3539618.3591957. doi:10.1145/3539618.3591957. [7] Q. Sun, P. Bhatia, Neural entity recognition with gazetteer based fusion, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, Association for Computational Linguistics, 2021, pp. 3291–3295. URL: https://doi.org/10.18653/v1/2021.findings-acl. 291. doi:10.18653/V1/2021.FINDINGS-ACL.291. [8] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz, G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [9] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [10] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of distemist at bioasq: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 179–203. URL: https://ceur-ws.org/Vol-3180/paper-11.pdf. [11] I. Bartolini, V. Moscato, M. Postiglione, G. Sperlì, A. Vignali, COSINER: context similarity data augmentation for named entity recognition, in: T. Skopal, F. Falchi, J. Lokoc, M. L. Sapino, I. Bartolini, M. Patella (Eds.), Similarity Search and Applications - 15th International Conference, SISAP 2022, Bologna, Italy, October 5-7, 2022, Proceedings, volume 13590 of Lecture Notes in Computer Science, Springer, 2022, pp. 11–24. URL: https://doi.org/10.1007/978-3-031-17849-8_2. doi:10.1007/978-3-031-17849-8\_2. [12] C. P. Carrino, J. Armengol-Estapé, A. Gutiérrez-Fandiño, J. Llop-Palao, M. Pàmies, A. Gonzalez- Agirre, M. Villegas, Biomedical and clinical language models for spanish: On the benefits of domain-specific pretraining in a mid-resource scenario, CoRR abs/2109.03570 (2021). URL: https: //arxiv.org/abs/2109.03570. arXiv:2109.03570. [13] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained biomedical language models for clinical NLP in spanish, in: D. Demner-Fushman, K. B. Cohen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP@ACL 2022, Dublin, Ireland, May 26, 2022, Association for Computational Linguistics, 2022, pp. 193–199. URL: https://doi.org/ 10.18653/v1/2022.bionlp-1.19. doi:10.18653/V1/2022.BIONLP-1.19. [14] K. B. Cohen, A. Lanfranchi, M. J. Choi, M. Bada, W. A. B. Jr., N. Panteleyeva, K. Verspoor, M. Palmer, L. E. Hunter, Coreference annotation and resolution in the colorado richly annotated full text (CRAFT) corpus of biomedical journal articles, BMC Bioinform. 18 (2017) 372:1–372:14. URL: https://doi.org/10.1186/s12859-017-1775-9. doi:10.1186/S12859-017-1775-9. [15] L. C. Llanos, A. Valverde-Mateos, A. Capllonch-Carrión, A. Moreno-Sandoval, A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine, BMC Medical Informatics Decis. Mak. 21 (2021) 69. URL: https://doi.org/10.1186/s12911-021-01395-z. doi:10.1186/S12911-021-01395-Z. [16] L. A. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in: D. Yarowsky, K. Church (Eds.), Third Workshop on Very Large Corpora, VLC@ACL 1995, Cambridge, Mas- sachusetts, USA, June 30, 1995, 1995, pp. 82–94. URL: https://aclanthology.org/W95-0107/. [17] C. P. Carrino, J. Silveira-Ocampo, A. Gonzalez-Agirre, A. Gutiérrez-Fandiño, M. Krallinger, M. Vil- legas, Spanish biomedical crawled corpus, 2022. URL: https://doi.org/10.5281/zenodo.5513237. doi:10.5281/zenodo.5513237. [18] A. Intxaurrondo, Scielo-spain-crawler, 2019. URL: https://doi.org/10.5281/zenodo.2541681. doi:10. 5281/zenodo.2541681. [19] A. Intxaurrondo, M. Pérez-Pérez, G. P. Rodríguez, J. A. López-Martín, J. Santamaría, S. de la Peña, M. Villegas, S. A. Akhondi, A. Valencia, A. Lourenço, M. Krallinger, The biomedical abbreviation recognition and resolution (BARR) track: Benchmarking, evaluation and importance of abbreviation recognition systems applied to spanish biomedical abstracts, in: R. Martínez, J. Gonzalo, P. Rosso, S. Montalvo, J. C. de Albornoz (Eds.), Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Language Processing (SEPLN 2017), Murcia, Spain, September 19, 2017, volume 1881 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 230–246. URL: https://ceur-ws.org/Vol-1881/Overview1.pdf. [20] M. Villegas, A. Intxaurrondo, A. Gonzalez-Agirre, M. Krallinger, Mespen_parallel-corpora, 2019. URL: https://doi.org/10.5281/zenodo.3562536. doi:10.5281/zenodo.3562536. [21] L. Campillos-Llanos, A. Valverde-Mateos, A. Capllonch-Carrión, A. Moreno-Sandoval, CT-EBM-SP - Corpus of Clinical Trials for Evidence-Based-Medicine in Spanish, 2022. URL: https://doi.org/10. 1186/s12911-021-01395-z. doi:10.1186/s12911-021-01395-z. [22] V. Moscato, M. Postiglione, C. Sansone, G. Sperlí, Taughtnet: Learning multi-task biomedical named entity recognition from single-task teachers, IEEE J. Biomed. Health Informatics 27 (2023) 2512–2523. URL: https://doi.org/10.1109/JBHI.2023.3244044. doi:10.1109/JBHI.2023.3244044. A. Hyperparameters Tuning We analyzed how variations in hyperparameters influence our validation set. Specifically, we have experimented each model’s batch size, learning rate, and weight decay gradually to examine how well it performed in terms of precision, recall, and F1 score. We changed the batch size by first choosing from among potential candidates, and then we selected the value that corresponded to the best performance. We then experimented with various learning rates after fixing the batch size value; also in this case, we selected the value that yielded the highest scores. After setting the learning rate and batch size, we examined a small variation in the rate of weight decay and determined the ideal value based on earlier logic. Batch size During training, we varied the batch size, initially set at 16, and then adjusted it to 8, 4, and 2. The results, as shown in the table 6 indicate that the optimal batch size is 4. Table 6 NER hyperparameter selection - varying batch size Models Batch Size MiP MiR MiF1 r-b-bio-clinical-es 16 0.7599 0.8003 0.7796 r-b-bio-clinical-es 8 0.7676 0.8047 0.7857 r-b-bio-clinical-es 4 0.7899 0.7963 0.7931 r-b-bio-clinical-es-FT 8 0.7690 0.7988 0.7836 r-b-bio-clinical-es-FT 16 0.7707 0.7992 0.7847 r-b-bio-clinical-es-FT 4 0.7798 0.7970 0.7883 bsc-bio-ehr-es 8 0.7534 0.7975 0.7749 bsc-bio-ehr-es 16 0.7514 0.7932 0.7717 bsc-bio-ehr-es 4 0.7865 0.7979 0.7922 r-es-clinical-trials-ner 8 0.7609 0.7997 0.7798 r-es-clinical-trials-ner 16 0.7602 0.7975 0.7784 r-es-clinical-trials-ner 4 0.7877 0.7979 0.7927 Learning rate After determining the optimal batch size, we varied the initial learning rate used by the AdamW optimizer, setting it between 2e-5 and 8e-5. The best results, as indicated in the as indicated in table 7, show that the optimal combination involves a learning rate of 8e-5. Table 7 NER hyperparameter selection - varying learning rate Models Learning Rate MiP MiR MiF1 r-b-bio-clinical-es 2E-05 0.6983 0.7768 0.7355 r-b-bio-clinical-es 8E-05 0.7899 0.7963 0.7931 r-b-bio-clinical-es-FT 2E-05 0.6924 0.7675 0.7280 r-b-bio-clinical-es-FT 8E-05 0.7798 0.7970 0.7883 bsc-bio-ehr-es 2E-05 0.6957 0.7733 0.7325 bsc-bio-ehr-es 8E-05 0.7865 0.7979 0.7922 r-es-clinical-trials-ner 2E-05 0.7018 0.7814 0.7395 r-es-clinical-trials-ner 8E-05 0.7877 0.7979 0.7927 Weight decay Finally, we adjusted the weight decay applied to all layers except the bias and Layer- Norm weights in the AdamW optimizer, starting with a value of 0.1 and then increased it to 0.2, which proved to be the best solution, as reported in table 8. Table 8 NER hyperparameter selection - varying weight decay Models Weight Decay MiP MiR MiF1 r-b-bio-clinical-es 0.1 0.7384 0.7920 0.7643 r-b-bio-clinical-es 0.2 0.7899 0.7963 0.7931 r-b-bio-clinical-es-FT 0.1 0.7439 0.7872 0.7649 r-b-bio-clinical-es-FT 0.2 0.7798 0.7970 0.7883 bsc-bio-ehr-es 0.1 0.7367 0.7889 0.7619 bsc-bio-ehr-es 0.2 0.7865 0.7979 0.7922 r-es-clinical-trials-ner 0.1 0.7405 0.7923 0.7655 r-es-clinical-trials-ner 0.2 0.7877 0.7979 0.7927 As a result of these analyses, we determined the optimal hyperparameters as follows: a batch size of 4, a learning rate of 8e-5, and a weight decay of 0.2. We selected the value ’5’ for the initial epochs based on preliminary studies indicating that the pattern tended to converge rapidly. Furthermore, we observed that after the fifth epoch, performance no longer improved significantly. Therefore, to avoid overtraining and optimize training time, we chose to stop at the 5 epoch.