BIT.UA at MultiCardioNER: Adapting a Multi-head CRF for Cardiology Notebook for the BioASQ Lab at CLEF 2024 Richard A. A. Jonker1,* , Tiago Almeida1 and Sérgio Matos1 1 IEETA/DETI, LASI, University of Aveiro, Aveiro, Portugal Abstract This paper presents the participation of the University of Aveiro Biomedical Informatics and Technologies (BIT.UA) group in the MultiCardioNER task at BioASQ 12, specifically in the CardioDis subtrack, which focuses on adapting Named Entity Recognition (NER) systems to Spanish cardiology case reports. We aimed to address two primary research questions: 1) the generalizability of a NER model trained on general medical concepts to the specialized sub-domain of cardiology, and 2) the robustness of our Multi-Head CRF model. Our team achieved the top result in the competition with an F1 score of 81.99, using the Multi-Head CRF model. Our findings indicate that task-specific data is beneficial to the overall performance of the model, although a model without this data can still be competitive. Additionally, our Multi-Head CRF model demonstrated consistent reliability and robustness, performing well on single-class NER tasks. Keywords Named Entity Recognition, Spanish Clinical Procedures, Transformers, Data Augmentation, Multi-head CRF, Robust ML 1. Introduction Named Entity Recognition (NER) is a fundamental task in the field of natural language processing, especially crucial in the medical domain where it aids significantly in structuring unstructured text for enhanced patient care and medical research. While general NER technologies have seen considerable advancement, their application to medical texts presents unique challenges due to the complexity and specificity of the medical language. To address these challenges, numerous competitions have been organized to foster the development of NER systems specifically tailored to the biomedical domain. Our team has continually engaged in these competitions, gaining experience through participation in BioCreative events such as the NLM-Chem Track [1, 2] and the BioRED Track [3, 4], both of which were focused on English biomedical articles. These challenges, allowed us to build a solid foundation in NER methodologies, initially leveraging state-of-the-art BERT-based models and masked Conditional Random Fields (CRF) [2, 5] over BIO-tagged sequences [6, 7]. We subsequently expanded our efforts to include Spanish medical texts, participating in challenges like MedProcNER [8] and SYMPTEMIST [9], where we secured first and second places respectively in the NER evaluations. This extensive experience has led to the creation of our versatile Multi-Head CRF model [10], a highly competitive NER solution that encapsulates our accumulated expertise. This year, as part of the BioASQ challenge [11], the Text Mining Unit (TEMU) at Barcelona Super- computing Center (BSC), introduced the MultiCardioNER [12, 13] challenge. This challenge addresses the need for better recognition of clinical variables in cardiology, given the high mortality rate from cardiovascular diseases (CVDs), which cause approximately 17.9 million deaths annually [14]. It includes two subtracks: CardioDis, which adapts NER systems to Spanish cardiology case reports, and MultiDrug, which tests these systems on medication mentions in English, Spanish, and Italian. The dataset includes a training set of 1,000 general clinical case reports, a development set of 258 cardiology cases, and a test CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ richard.jonker@ua.pt (R. A. A. Jonker); tiagomeloalmeida@ua.pt (T. Almeida); aleixomatos@ua.pt (S. Matos)  0000-0002-3806-6940 (R. A. A. Jonker); 0000-0002-4258-3350 (T. Almeida); 0000-0003-1941-3983 (S. Matos) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings set of 250 cardiology reports. Systems are evaluated on micro-averaged Precision, Recall, and F-measure to determine their adaptability and accuracy in diverse medical settings. This paper details the participation of the Biomedical Informatics and Technologies of University of Aveiro (BIT.UA) in the MultiCardioNER challenge, where we utilize our Multi-Head CRF model [10]. Due to time constraints, we focused our efforts solely on the CardioDis subtrack. Unlike other challenges, the CardioDis subtrack is designed to adapt general concept recognition systems, trained on the DISTEMIST dataset, to cardiology case reports. This leads to our first research question: 1. How effectively can a model trained on general concepts adapt to the specialized domain of cardiology cases? Additionally, we are utilizing this challenge to test the robustness of our Multi-Head CRF model, prompting our second research question: 2 Is our Multi-Head CRF model capable of delivering competitive out-of-the-box performance in a specialized clinical setting? The remainder of the paper is organized as follows: Section 2 provides a review of related work, focusing on the latest advancements in biomedical Named Entity Recognition. Section 3 describes our methodology, detailing the application of our Multi-Head CRF model to address the specific challenges of the MultiCardioNER competition. Section 4 presents our validation results and the official challenge evaluations, showcasing the performance of our model. In Section 5, we discuss the outcomes in relation to our initial research questions, providing insights into the adaptability and robustness of our approach. Section 6 concludes the paper, summarizing our key findings. 2. Related Work Named Entity Recognition (NER) in the biomedical domain presents unique challenges due to the limited availability of annotated data. The annotation process is both time-consuming and requires a high level of expertise, making it expensive [15, 16]. Most research in biomedical NER has been focused on the English language [17], but there is a growing need to extend this work to other languages. Several competitions have aimed to address clinical NER in the Spanish language, focusing on var- ious entity types such as compounds and drugs (PharmaCoNER [18]), diseases (DisTEMIST [19]), tumor morphology (CANTEMIST [20]), medical procedures (MedProcNER [21]), and symptoms (SympTEMIST [22]). All of these competitions utilize the Spanish Clinical Case Corpus (SPACCC), which comprises 1,000 clinical case reports from Spanish medical publications (SciELO). Recent advancements in NER predominantly utilize transformer-based models for sequence labelling, which have proven effective in managing complex entity recognition tasks [23, 8, 24, 25]. These models often leverage pretrained language-specific versions of BERT [26] and RoBERTa [27], such as BETO, a BERT model trained on a Spanish corpus [28], and bsc-bio-es, a RoBERTa model tailored to Spanish biomedical vocabulary [29]. Further enhancements have been observed by integrating masked Conditional Random Fields (CRFs)[2, 5] atop the transformer backbone, a technique that our research group has profoundly explored [1, 2, 3, 4, 8, 9]. In these works, we also demonstrate strong transfer-learning capabilities, effectively adapting a Spanish NER model trained on clinical notes [30] to the diverse target domains. Another significant development is the introduction of the SpanMarker model, which utilizes the novel Packed Levitated Markers (PL-Marker) approach [31]. This model enhances NER performance by employing a neighborhood-oriented packing strategy to accurately model entity boundaries and a subject-oriented strategy for complex span pair classification tasks. Concurrently, innovative methods like those in the AIONER system, which prepends special tokens to input texts, enable the adaptation to annotated corpora lacking comprehensive coverage of all entity classes [32]. Similarly, HunFlair2 utilizes a multi-class BIO tagging scheme, enhancing its ability to distinguish between various entity types such as genes and diseases, thereby showcasing the adaptability and versatility of these advanced NER systems [33]. 3. Methodology In this section, we describe the dataset, the evaluation metrics used, and provide a brief overview of the methodology used. 3.1. Dataset The MultiCardioNER challenge utilizes two datasets: the previously released DisTEMIST dataset and the newly released CardioCCC dataset. The DisTEMIST dataset comprises 1,000 documents from the SPACCC corpus, annotated with disease mentions. For the validation and evaluation of the system, the CardioCCC dataset was created. This collection consists of 508 cardiology clinical case reports, split into 258 documents for development and 250 for testing. The goal of the task is to train a generic system capable of classifying diseases and to evaluate it within the more specific cardiology domain. Whilst the goal of the competition is to evaluate the adaption of the model, we utilize the validation set in order to train some models1 . 3.2. Evaluation Metrics The official metrics used in this work are the standard micro-averaged precision, recall and F1-scores. • Precision (P): The ratio of true positive (TP) predictions to the total number of positive predictions (TP + FP). It is defined as: 𝑇𝑃 𝑃 = . 𝑇𝑃 + 𝐹𝑃 • Recall (R): The ratio of true positive (TP) predictions to the total number of actual positives (TP + FN). It is defined as: 𝑇𝑃 𝑅= . 𝑇𝑃 + 𝐹𝑁 • F1-Score (F1): The harmonic mean of precision and recall. It is defined as: 𝑃 ·𝑅 𝐹1 = 2 · . 𝑃 +𝑅 3.3. System The system utilizes the Multi-Head CRF model as a basis, in order to test the secondary objective of this work, the robustness of the multi-head-CRF model [10]. All the work presented in this work is done utilising the same code and methods from that work. Whilst the multi-head CRF architecture is designed, and tested for performing multi-class NER, in this work we configure the architecture to use only one head for single class classification as illustrated in Figure 1. Whilst the general architecture is the same, in order to keep this work self-contained, we provide a brief overview of the Multi-Head-CRF architecture. The architecture was inspired by several existing works [1, 2, 8, 9], achieving competitive results in various challenges. The main idea behind the architecture is to utilize several CRF heads, one per entity class, with a shared transformer as a base, in this case using a Spanish RoBERTa model [30]. By having several classification heads, we can solve the problem of overlapping entity classes, since each entity is trained separately. However, since each head shares the same transformer, significant overhead is reduced compared to training several individual classifiers. Going more in-depth, the work utilizes the well-known BIO tagging schema, where each entity has its own tagging schema assigned to it. The CRF classification heads utilize several dense layers, a classification layer, and a CRF layer. Each of these heads then produces a series of labels corresponding to the BIO tagging for the specific entity of the head. The model is trained using a joint loss function, aggregated from each classification head. 1 The event organizers explicitly mentioned: “Participants are encouraged to experiment with the documents and annotations as they see fit.” Figure 1: Multi-head CRF architecture, adapted for MultiCardioNER. The model also employs a document splitting system to overcome the maximum context length of the transformer. Each document is split in a sliding window fashion, with each piece of the document being encapsulated with a fixed length context. The work also utilizes some data augmentation techniques, namely random token replacement and a variation, random token replacement with unknown. In the first technique, a random input token is replaced with a random token from the vocabulary, while in the latter, the token is replaced with a special token ‘[UNK]’, however in this work we follow the conclusions of the original work, and utilize only random token replacement, as it performed better. To better control the augmentation, two hyperparameters were put in place: one determines the chance of selecting a sample for augmentation (the augmentation probability), and the other determines how many tokens within the sample get augmented (percentage tags). Finally, following the work of our previous NER submission [9, 8], we employed an entity-level ensemble to merge the outputs from various models, which proved to improve overall results. The entity-level ensemble is a majority voting approach over the exact entities predicted by the models, where each entity is added to the final submission if enough support is present for the given entity. 4. Results In this section we will present the results obtained with the proposed system. Initially, we evaluate the performance of the model on a validation set, before discussing the results on the final test set of the competition. 4.1. Validation results In order to find the optimal hyperparameters for the models to be submitted, we performed basic hyperparameter tuning, investigating varying amounts of training epochs, different augmentation configurations, and adjusting the context size and number of hidden layers. The validation set used for this work was the 258 document development set, containing cardiac data. Figures describing this basic hyperparameter search can be seen in Figures 2 - 4. The best-performing model configurations on validation can be seen in Table 1. Looking first at Figure 2, we can see that the performance difference between random augmentation and no augmentation is significant, and the use of augmentation improves the overall performance of the system. This is inline with the conclusions drawn from the original multi-head CRF model, however we did not investigate the use of the ‘unk’ augmentation technique. We also note that training for more epochs does not necessarily result in significant performance gains, especially for models with random augmentation. Examining the optimal augmentation configuration in Figure 3, we observe that lower values of percentage tags perform better, especially with increasing augmentation probabilities. This corresponds to selecting a large number of documents and augmenting a small number of tokens. While this is not the exact same configurations used in the original Multi-Head CRF work, the configuration is relatively similar, with similar conclusions being drawn. Figure 2: Line graph describing the best Figure 3: Heatmap showcasing the best augmentation technique against no augmentation augmentation configuration. with varying number of epochs. Table 1 Top 5 model configurations on validation data. Ctx. represents the context size of the model. HL represents the number of hidden layers used in the CRF head, aug. is either random augmentation or None, PT is the percentage tags and AP is the augmentation probability. Ctx. HL Epoch Aug. PT AP F1 64 1 90 random 0.25 0.75 74.87 64 1 60 random 0.25 0.75 74.56 32 3 90 random 0.25 0.75 74.40 Figure 4 64 3 90 None 0.2 0.5 74.30 Heatmap showing the average performance of 64 3 60 random 0.25 0.75 74.21 models with varying context size and Hidden Layers in the CRF head. Finally, discussing the context size and number of hidden layers as described in Figure 4, we note that higher context size performs better on average, with the optimal number of hidden layers being either 1 or 3, which is the same as the optimal model in the original paper. We also note that the performance for our 123 model search ranged from 69.47 to 74.87, with an average of 72.89 and a median of 72.96. 4.2. Competition results Below we present the official results of our systems in the competition. The competition uses the F1-score as the official metric, with the test set containing 250 documents. For the competition, we submitted five different systems. We followed two separate approaches. The first approach was to keep the validation set separate in order to test the adaptability of a system trained exclusively on diseases to directly identify cardiology-related concepts. Our second approach was to utilize the validation set to train a model, in combination with the generic disease data, with the intuition that more data is always beneficial for training a model. A summary of our submitted systems is below, with there performance of the models being displayed in Table 2. • run0: This submission used an ensemble of our top 5 validation models (ranging from 74.20-74.87), trained using all data including the validation set. • run1: This submission used an ensemble of top 17 runs, trained using all data including the validation set. (all above 74) • run2: This submission used our best performing model on validation, trained without validation data. • run3: This submission used an ensemble of our top 24 runs, all trained without validation data. • run4: This represents an ensemble of all submissions containing 41 runs. Table 2 Official competition results. The relative rank corresponds to the ranking of the systems against our best system, as the competition organizer did not provide an official ranking, with only the best and median systems availble as comparison. System Precision Recall F1 Score Relative Rank run0-top5-full 81.10 81.81 81.45 2 run1-all-full 81.55 82.43 81.99 1 run2-best-val 74.80 75.42 75.11 5 run3-all-val 75.44 75.88 75.66 4 run4-all 79.81 78.27 79.03 3 Best 89.19 82.43 81.99 - Median - - 72.29 - Looking at the results, we obtained the best submission in the competition. This was achieved by using a large ensemble with multiple runs that included validation data for training. It is not surprising that run1 outperformed run0. Generally, from previous work, we have observed that larger ensembles tend to perform better. We also have no assurance that the top 5 models on validation would achieve the best results on the test set. Next, we note that the performance of the models trained using the validation data greatly outperformed those that were not. While the performance was 6 percentage points higher, the base system still performed relatively well, considering it was not directly trained for the cardiac domain. Similarly, to the comparison of run0 and run1, we see a slight performance increase with run3 over run2 with the increase in the number of models. However, the performance gains were not as significant. Finally, given these results, our final submission, run4, performed as expected—somewhere between our best and worst models—given the large discrepancy in data performances. Overall, all our submissions achieved F1 scores above the median, with our best submission obtaining the top F1 score and recall in the competition. 5. Discussion In this work, we proposed two research questions: 1. investigating the impact of training with validation data for domain-adaptation and 2. investigating the robustness of the multi-head CRF model. With regards to the first research question, we can look to our submission results. We utilized two different approaches to see the performance difference when using the validation data in training. The conclusions we drew indicate that the models have a significant performance gain when using the validation data. This was an expected outcome; however, the models that did not have access to this data still obtained competitive performance, being well above the average submission. Discussing the second research question, we can see the robustness of the model by looking at both the overall performance of the model in the competition and our validation results. Our model performed well overall, obtaining the top performance within the competition. Looking at the validation we performed, we obtained comparative results to the original work, indicating the overall robustness and reliability of the model, showcasing its ability to perform well on not only multi-class NER but also single-class NER. 6. Conclusion This study aimed to address two research questions using the MultiCardioNER challenge as a basis. The first question, related to the task, was whether an NER model trained on a specific domain—in this case, Diseases—could generalize to a sub-domain, in this case, cardiology data. The conclusions we drew from our work indicate that while the system does perform well in generalization, better performance is always achieved when using task-specific data. Our second research question was aligned with our previous work and focused on testing the robustness of our Multi-Head CRF architecture [10]. In this work, we demonstrated that the architecture is indeed robust, achieving top performance in the competition, which utilizes only a single entity, as opposed to previous work which utilizes multi-class NER. We further observed the robustness of the model within our validation tests, where we drew many similar conclusions to those of the original work. Overall, we believe that this Multi-Head CRF architecture stands as a solid basis for future work in NER. Acknowledgments This work was funded by the Foundation for Science and Technology (FCT) in the context of the project doi.org/10.54499/UIDB/00127/2020. Tiago Almeida is funded by the grant doi.org/10.54499/2020.05784. BD. Richard A. A. Jonker is funded by the grant PRT/BD/154792/2023. This work was funded by FCT I.P. under the project Advanced Computing Project 2023.10766.CPCA.A0, platform Vision at University of Évora. References [1] T. Almeida, R. Antunes, J. F. Silva, J. R. Almeida, S. Matos, Chemical detection and indexing in pubmed full text articles using deep learning and rule-based methods, CDR 1500 (2021) 15943. [2] T. Almeida, R. Antunes, J. F. Silva, J. R. Almeida, S. Matos, Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics, Database 2022 (2022) baac047. URL: https://doi.org/10.1093/database/baac047. doi:10.1093/database/baac047. [3] T. Almeida, R. A. A. Jonker, D. da Silva, J. Almeida, S. Matos, BIT.UA at Biocreative VIII track 1: A joint model for relation classification and novelty detection, 2023. URL: https://doi.org/10.5281/ zenodo.10117952. doi:10.5281/zenodo.10117952. [4] T. Almeida, R. A. A. Jonker, R. Antunes, J. R. Almeida, S. Matos, Towards Discovery: An End-to-End System for Uncovering Novel Biomedical Relations, Database (to appear) 2024 (2024). [5] T. Wei, J. Qi, S. He, S. Sun, Masked conditional random fields for sequence labeling, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, Association for Computa- tional Linguistics, Online, 2021, pp. 2024–2035. URL: https://aclanthology.org/2021.naacl-main.163. doi:10.18653/v1/2021.naacl-main.163. [6] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: S. Stevenson, X. Carreras (Eds.), Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), Association for Computational Linguistics, Boulder, Colorado, 2009, pp. 147–155. URL: https://aclanthology.org/W09-1119. [7] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition, in: K. Knight, A. Nenkova, O. Rambow (Eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, 2016, pp. 260–270. URL: https://aclanthology.org/N16-1030. doi:10.18653/v1/N16-1030. [8] T. Almeida, R. A. Jonker, R. Poudel, J. M. Silva, S. Matos, Bit. ua at medprocner: discovering medical procedures in spanish using transformer models with mcrf and augmentation, Working Notes of CLEF (2023). [9] R. A. A. Jonker, T. Almeida, S. Matos, J. Almeida, Team BIT.UA @ BC8 SympTEMIST Track: A Two-Step Pipeline for Discovering and Normalizing Clinical Symptoms in Spanish., 2023. URL: https://doi.org/10.5281/zenodo.10103360. doi:10.5281/zenodo.10103360. [10] R. A. A. Jonker, T. Almeida, R. Antunes, J. R. Almeida, S. Matos, Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes, Database (to appear) 2024 (2024). [11] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [12] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz, G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [13] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Krallinger, MultiCardioNER Corpus: Multilingual Adaptation of Clinical NER Systems to the Cardiology Domain, 2024. URL: https: //doi.org/10.5281/zenodo.11368861. doi:10.5281/zenodo.11368861. [14] World Health Organization, Cardiovascular diseases (cvds), 2021. URL: https://www.who.int/ health-topics/cardiovascular-diseases#tab=tab_1, accessed: 08-06-2024. [15] Z. Li, S. Zhang, Y. Song, J. Park, Extrinsic factors affecting the accuracy of biomedical ner, 2023. arXiv:2305.18152. [16] D. Demner-Fushman, W. W. Chapman, C. J. McDonald, What can natural language processing do for clinical decision support?, J. Biomed. Inform. 42 (2009) 760–772. [17] E. French, B. T. McInnes, An overview of biomedical entity linking throughout the years, Journal of biomedical informatics 137 (2023) 104252. [18] A. Gonzalez-Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger, Pharma- coner: Pharmacological substances, compounds and proteins named entity recognition track, in: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019, pp. 1–10. [19] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of distemist at bioasq: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources., in: CLEF (Working Notes), 2022, pp. 179–203. [20] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results., IberLEF@ SEPLN (2020) 303–323. [21] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at bioasq 2023, Working Notes of CLEF (2023). [22] S. Lima-López, E. Farré-Maduell, L. Gasco-Sánchez, J. Rodríguez-Miret, M. Krallinger, Overview of symptemist at biocreative viii: corpus, guidelines and evaluation of systems for the detection and normalization of symptoms, signs and findings from text, in: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models, 2023. [23] S. Vassileva, G. Grazhdanski, S. Boytcheva, I. Koychev, Fusion @ bioasq medprocner: Transformer- based approach for procedure recognition and linking in spanish clinical text, in: M. Alian- nejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 190–205. URL: https: //ceur-ws.org/Vol-3497/paper-017.pdf. [24] E. Zotova, A. G. Pablos, M. Cuadros, G. Rigau, VICOMTECH at medprocner 2023: Transformers- based sequence-labelling and cross-encoding for entity detection and normalisation in spanish clinical texts, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 206–218. URL: https://ceur-ws.org/Vol-3497/paper-018.pdf. [25] G. Grazhdanski, S. Vassileva, I. Koychev, S. Boytcheva, Team Fusion@SU @ BC8 SympTEMIST track: Transformer- based Approach for Symptom Recognition and Linking, 2023. URL: https: //doi.org/10.5281/zenodo.10103750. doi:10.5281/zenodo.10103750. [26] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423. [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [28] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [29] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained biomedical language models for clinical NLP in Spanish, in: D. Demner-Fushman, K. B. Cohen, S. Ananiadou, J. Tsujii (Eds.), Proceedings of the 21st Workshop on Biomedical Language Processing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL: https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/ v1/2022.bionlp-1.19. [30] L. Campillos-Llanos, A. Valverde-Mateos, A. Capllonch-Carrión, A. Moreno-Sandoval, A clinical trials corpus annotated with umls© entities to enhance the access to evidence-based medicine, BMC Medical Informatics and Decision Making 21 (2021) 1–19. [31] D. Ye, Y. Lin, P. Li, M. Sun, Packed levitated marker for entity and relation extraction, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 4904–4917. URL: https://aclanthology.org/2022.acl-long.337. doi:10.18653/v1/2022.acl-long.337. [32] L. Luo, C.-H. Wei, P.-T. Lai, R. Leaman, Q. Chen, Z. Lu, Aioner: all-in-one scheme-based biomedical named entity recognition using deep learning, Bioinformatics 39 (2023) btad310. [33] M. Sänger, S. Garda, X. D. Wang, L. Weber-Genzel, P. Droop, B. Fuchs, A. Akbik, U. Leser, Hunflair2 in a cross-corpus evaluation of named entity recognition and normalization tools, arXiv preprint arXiv:2402.12372 (2024).