Team NOVA LINCS @ BIOASQ12 MultiCardioNER Track: Entity Recognition with Additional Entity Types Rodrigo Gonçalves1,* , André Lamúrias1 1 NOVA LINCS, NOVA School of Science and Technology, Lisbon, Portugal Abstract This paper presents the contribution of the NOVA LINCS team to the task MultiCardioNER at BioASQ12. BioASQ is a long-running challenge that focuses mostly on biomedical semantic indexing and question answering (QA). The specific task of MultiCardioNER focuses on the multilingual adaptation of clinical NER systems to the cardiology domain. We leverage a state-of-the-art spanish pre-trained model to perform NER on the DisTEMIST and DrugTEMIST datasets provided by this task, aided by the additional clinical cases corpus, CardioCCC, for validation. Experiments were done both with just the entity type to which each of those datasets refers to, and with the additional entity types from other datasets that use the same documents, in order to determine if any advantage could be obtained by leveraging the knowledge of the additional entities. The models trained on the combined dataset achieved a very slight and not significant boost in F1-score when compared to their one-entity counterparts, with all of them lacking heavily in recall, due to pre- and post-processing errors. However, one run achieved the highest precision of the task (0.9242). Code to reproduce our submission is available at https://github.com/Rodrigo1771/BioASQ12-MultiCardioNER-NOVALINCS. Keywords Cardiology, Clinical Cases, Named Entity Recognition, Language Models, Transfer Learning 1. Introduction The MultiCardioNER [1] task, part of the BioASQ 2024 challenge [2], focuses on the automatic recogni- tion of two key clinical concept types: diseases and medications. In this task, the extraction of those types of entities is specifically performed over cardiology clinical case documents. This represents a worthwhile effort because, with cardiovascular diseases being the world’s leading cause of death, it’s imperative that better automatic semantic annotation resources and systems for high impact clinical domains such as cardiology are developed. Furthermore, this task also represents an effort to aid in the development of systems that can perform in multiple languages, not just English, which is much needed as the prevalence and impact of cardiovascular diseases are global, requiring robust and adaptable tools that can cater to diverse linguistic and clinical environments. The datasets provided by each of the two MultiCardioNER subtasks, namely DisTEMIST and DrugTEMIST, both make use of and annotate over the same set of documents, the SPACCC corpus [3]. Additionally, this corpus is also used by both MedProcNER [4] and SympTEMIST [5]. What this means in practice is that the SPACCC corpus is annotated with four different types of entities: diseases for DisTEMIST, medications for DrugTEMIST, medical procedures for MedProcNER, and symptoms for SympTEMIST. Thus, we focused on training two models per each subtask: one solely with the entity type of that same subtask, and another one with all four entity types. That way, we could evaluate the benefit, or lack thereof, of having annotations regarding additional entity types. For the second subtask, we considered only the documents in Spanish. 2. Data This shared task utilizes two main training datasets, as previously mentioned: the DisTEMIST and DrugTEMIST datasets. Each of these corpora is composed by the same 1000 documents that belong to CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ rmg.goncalves@campus.fct.unl.pt (R. Gonçalves); a.lamurias@fct.unl.pt (A. Lamúrias) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Statistics of the unprocessed dataset. Dataset Number of examples/sentences Number of annotated entities DisTEMIST 16160 10664 DrugTEMIST 16160 2778 SympTEMIST 16160 12196 MedProcNER 16160 14684 CardioCCC-DisTEMIST 17869 10348 CardioCCC-DrugTEMIST 17869 2510 Table 2 Statistics of the processed dataset (the final training and validation files). Dataset Number of examples/sentences Number of annotated entities CombinedDataset-Train 27223 39805 DisTEMIST-Train 27223 16675 DrugTEMIST-Train 27223 4199 DisTEMIST-Dev 6806 4262 DrugTEMIST-Dev 6806 1088 the Spanish Clinical Case Corpus (SPACCC), a manually classified collection of clinical case reports written in the Spanish language. While DisTEMIST is focused on extracting disease mentions, thus having ENFERMEDAD as its sole and primary label, DrugTEMIST focuses on drug and medication extraction, with FARMACO being its entity type. Also previously mentioned is the fact that two former subtasks, MedProcNER and SympTEMIST, also make use of the SPACCC corpus to extract medical procedures and symptoms, having PROCEDIMIENTO and SINTOMA as their entity types, respectively. Additionally, the organization also provided a separate cardiology clinical case reports dataset, CardioCCC, to be used for the domain adaptation part of the task. It contains a total of 508 documents, split into 258 documents initially intended for validation, and 250 for testing. As expected, these documents only include annotations for disease and medication mentions, the entity types considered for this competition. Nevertheless, we used this dataset both for training and validation of the model. Some statistics on all of these datasets are shown in Table 1. After combining the four mentioned datasets (apart from CardioCCC) into a single dataset that includes the annotations of all four entity types, and given the task objective of adapting clinical models to the cardiology domain specifically, we joined this new combined dataset with CardioCCC, and split it so that 80% of examples (sentences) were used for training and 20% for validation. This way, we can include cardiology clinical case reports in the training of the model. Table 2 shows some statistics on every training and validation set. Note that the three training sets are all equal between themselves in terms of the example sentences that they contain. Likewise, the two validation sets have the same exact sentences between them. The only thing that changes is the annotated entity types in those examples. 3. Methodology and Experiments For this task, we leveraged the capabilities of the pre-trained model bsc-bio-ehr-eb1 [6], a RoBERTa- based model [7] trained on a biomedical-clinical corpus in Spanish collected from several sources totalling more than 1 billion tokens. This model has previously been fine-tuned on similar shared tasks like CANTEMIST [8] and PharmaCoNER [9], achieving remarkable results. We conducted two experiments. The first experiment compares the model when fine-tuned on the 1 https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es Table 3 Internal validation set results (not obtained using the official evaluation library, calculated with the token-labels.). Run Precision Recall F1-Score 1_bsc-bio-ehr-es_distemist_4 0.7795 0.8278 0.8029 2_bsc-bio-ehr-es_distemist_1 0.8067 0.8067 0.8067 3_bsc-bio-ehr-es_drugtemist_4 0.9348 0.9485 0.9416 4_bsc-bio-ehr-es_drugtemist_1 0.9430 0.9586 0.9508 combined dataset with the four entities against the model when fine-tuned on just the DisTEMIST dataset. The second experiment also compares the model when fine-tuned on the combined dataset, but this time against the model when fine-tuned on just the DrugTEMIST dataset. This brings the total of runs submitted to four: • Track 1 (DisTEMIST): – 1_bsc-bio-ehr-es_distemist_4: bsc-bio-ehr-es trained on the combined dataset with the 4 entity types (ENFERMEDAD, FARMACO, PROCEDIMIENTO, SINTOMA), and validated on the validation set for DisTEMIST (only with ENFERMEDAD). – 2_bsc-bio-ehr-es_distemist_1: trained and validated on the training and validation sets, respectively, for DisTEMIST (only with ENFERMEDAD). • Track 2 (DrugTEMIST), Spanish subset: – 3_bsc-bio-ehr-es_drugtemist_4: bsc-bio-ehr-es trained on the combined dataset with the 4 entity types (ENFERMEDAD, FARMACO, PROCEDIMIENTO, SINTOMA), and validated on the validation set for DrugTEMIST (only with FARMACO). – 4_bsc-bio-ehr-es_drugtemist_1: trained and validated on the training and validation sets, respectively, for DrugTEMIST (only with FARMACO). During training, the best checkpoint was kept, when evaluated on the validation set after each epoch. The model’s hyperparameters for all four runs were the following: • Learning rate: 5e-05 • Total train batch size: 16 • Epochs: 10 4. Results and Discussion The results for each run obtained on our validation set, both during the development of our ap- proach and after submitting it (the latter obtained using the official evaluation library2 , after it was released) are presented in Tables 3 and 4, respectively. Additionally, the official gold stan- dard test set results are shown in Table 5. The naming convention for each run is as follows: {run_id}\_{model_name}\_{dataset}\_{number_of_entity_types_in_training}. First of all it is important to point out the way in which we obtained our results on the validation set (Table 3), as well as the final submitted predictions, for the runs where the model was trained on the combined dataset (specifically runs 1 and 3), as those models naturally predict more entities than the entity relevant for the track. Thus, on those particular runs, after obtaining the predictions (be it the final test set predictions or the predictions used to further calculate the validation results), the ones that corresponded to entity types that did not match the entity type of the track were ignored. This way, only predictions of the entity type related to the track were considered. For example, in 2 https://github.com/nlp4bia-bsc/multicardioner_evaluation_library Table 4 Official validation set results (obtained using the official evaluation library, calculated with the entity labels). Run Precision Recall F1-Score 1_bsc-bio-ehr-es_distemist_4 0.7436 0.4308 0.5455 2_bsc-bio-ehr-es_distemist_1 0.7769 0.4305 0.554 3_bsc-bio-ehr-es_drugtemist_4 0.8762 0.5919 0.7065 4_bsc-bio-ehr-es_drugtemist_1 0.8812 0.6002 0.7141 Table 5 Official test set results. Run Precision Recall F1-Score 1_bsc-bio-ehr-es_distemist_4 0.8018 0.3525 0.4897 2_bsc-bio-ehr-es_distemist_1 0.8183 0.3398 0.4802 3_bsc-bio-ehr-es_drugtemist_4 0.9242 0.4965 0.646 4_bsc-bio-ehr-es_drugtemist_1 0.9076 0.4919 0.638 run 1, the predictions related to the entity types FARMACO, PROCEDIMIENTO and SINTOMA were ignored and only ENFERMEDAD was included in the final metric assessment seen on Table 3 and in the submitted predictions. Likewise, in run 3, the predictions related to the entity types ENFERMEDAD, PROCEDIMIENTO and SINTOMA were ignored and only FARMACO was considered. The significant difference between the recall scores on our internal evaluation and on the official evaluation of the runs is evident. Usually, in a situation of high precision and low recall, it means that the system is only retrieving a small portion of the relevant entities (low recall), but when it does retrieve one, it identifies it correctly the vast majority of the time (high precision). Our first hypothesis to try to explain what happened was that this was due to our use of the BIO tagging schema. In this tagging schema, a sequence containing an entity, for example, of type ENFERMEDAD like "hipertensión pulmonar severa", should be classified as "B-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD". Our reasoning was that the model could not generalize for entities that were not present in the training data, specially since we picked the checkpoint that obtained the best token classification and observed some overfitting when comparing the training and validation losses, shown in Figure 1. This could lead to the model under or over-classifying entities: • Under classification of "hipertensión pulmonar severa" - "B-ENFERMEDAD I-ENFERMEDAD O" (missing the "severa"). • Over classification of "estenosis píloro - duodenal de carácter extrínseco" - should be correctly classified as "B-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD O O O" but is overclassified as "B-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD" (including the additional information "de carácter extrínseco"). This under and over-classification could then lead to a low recall, where the correct entities are not extracted and the number of false negatives grows. However, this would also lead to low precision, as both under and over-classification of entities would also lead to the extraction of incomplete entities which in turn leads to more false positives, and our models achieved good and even great precision. To achieve great precision and bad recall, our models would have to extract a low number of entities as to not increase the number of false positives, and they do as demonstrated by Tables 6 and 7. Then, through further error analysis, we realized that this might be happening due to our data parsing approach. Because of how we parsed the datasets (same approach as PlanTL-GOB-ES’s when parsing 1_bsc-bio-ehr-es_distemist_4 2_bsc-bio-ehr-es_distemist_1 Training Loss 0.175 Training Loss Validation Loss Validation Loss 1.0 Best Model Epoch 0.150 Best Model Epoch 0.8 0.125 0.100 0.6 Loss Loss 0.075 0.4 0.050 0.2 0.025 0.0 0.000 2 4 6 8 10 2 4 6 8 10 Epoch Epoch 3_bsc-bio-ehr-es_drugtemist_4 4_bsc-bio-ehr-es_drugtemist_1 1.75 0.006 1.50 0.005 1.25 0.004 1.00 Training Loss Validation Loss Loss Loss Best Model Epoch 0.003 0.75 0.50 0.002 0.25 0.001 Training Loss Validation Loss 0.00 0.000 Best Model Epoch 2 4 6 8 10 2 4 6 8 10 Epoch Epoch Figure 1: Training (blue) and validation (green) losses for each run. The red dashed line indicates the epoch in which the model achieved the best F1-Score during the training and validation phase, and therefore represents the final chosen checkpoint. Table 6 The number of predictions (extracted entities) and number of gold standard annotations (entities that should be extracted) by run, when evaluated on the validation set using the official evaluation library. Run Number of Predictions Number of Gold Standard Annotations 1_bsc-bio-ehr-es_distemist_4 2469 4262 2_bsc-bio-ehr-es_distemist_1 2362 4262 3_bsc-bio-ehr-es_drugtemist_4 735 1088 4_bsc-bio-ehr-es_drugtemist_1 741 1088 Table 7 The number of predictions (extracted entities) and number of gold standard annotations (entities that should be extracted) by run, when evaluated on the test set. Run Number of Predictions Number of Gold Standard Annotations 1_bsc-bio-ehr-es_distemist_4 3466 7884 2_bsc-bio-ehr-es_distemist_1 3274 7884 3_bsc-bio-ehr-es_drugtemist_4 923 1718 4_bsc-bio-ehr-es_drugtemist_1 931 1718 CANTEMIST 3 or PharmaCoNER 4 for NER in order to fine-tune the same model that we used [6]), the model was not only trained, but also locally evaluated on individual sentences, not entire documents. This approach differs from the way that the official evaluation library evaluates runs, as the models 3 https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es-cantemist 4 https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es-pharmaconer Table 8 The average start span of positive entities and of false negative entities, by run, when evaluated on the validation set using the official evaluation library. Run Avg. Start Span Pos. Entities Avg. Start Span False Neg. Entities 1_bsc-bio-ehr-es_distemist_4 1015 3908 2_bsc-bio-ehr-es_distemist_1 996 3910 3_bsc-bio-ehr-es_drugtemist_4 1247 4807 4_bsc-bio-ehr-es_drugtemist_1 1250 4889 Table 9 The average start span of positive entities and of false negative entities, by run, when evaluated on the test set. Run Avg. Start Span Pos. Entities Avg Start Span False Neg. Entities 1_bsc-bio-ehr-es_distemist_4 1028 4767 2_bsc-bio-ehr-es_distemist_1 990 4706 3_bsc-bio-ehr-es_drugtemist_4 1084 5577 4_bsc-bio-ehr-es_drugtemist_1 1088 5538 are expected to predict on the full content of each test document, making the spans for each extracted entity relative to the whole document, not just to individual sentences. This conflict of approaches might explain the discrepancy between the quality of the results, par- ticularly the recall, obtained on our validation set with our evaluation method, and the quality of the results obtained with the official evaluation library, both on our validation set and on the test set: as just mentioned, the examples that were fed to the model during training were individual sentences, meaning that they were overall much shorter (average length of 21 tokens, 124 chars) than the ones fed to the model when evaluating it on the test set (entire documents, average length of 1013 tokens, 5754 chars). Thus, the small size of the training examples might have induced the model to only extract entities up until a certain span when fed longer examples (i.e. documents). Furthermore, and possibly more relevant, the maximum input sequence length of this model is 512 tokens, which means that when obtaining the test set predictions to submit, with the entire documents as input, the tokens that went over this limit were not classified by our models. These two reasons explain the low number of entities extracted, as the majority of the misses occur near the end of the documents. On the other hand, the results using our evaluation method, during the model’s training, did not show this low recall because the model was also evaluated on individual sentences. The information presented in Tables 8 and 9 supports our claims, as the average start span of (true and false) positive, i.e retrieved entities, is much lower than the average span of false negatives, i.e. relevant but not retrieved entities. We believe that this is the reason why our approach showed consistent results across the board on our internal validation, but failed to replicate them on the official testing: basically the input length of the model when evaluating it on a document basis being smaller than many documents, with the additional nuance that the difference in length between the training and the test examples (the first much shorter than the second) may have induced the model to only predict entities until a certain span. We plan to re-classify the dev and tests sets using sentences instead of the full document in order to verify if the recall would improve and become closer to our internal evaluation. Nevertheless, and focusing on the objective of experiment itself, we can observe that training with the combined dataset did not show any substantial improvements regarding the F1-Score. In fact, both tasks showed a barely significant disadvantage on the validation set: for task 1, a drop of 0.38pp on our evaluation method and of 0.85pp using the official evaluation library, and on task 2 a drop of 0.92pp for our evaluation method and of 0.76pp using the official library. On the test set, while the precision for all runs is quite good, even achieving the best precision score for any team on task 2 with run 3 (0.9242), the recall achieved by all runs is very low, which results in poor F1-scores, more specifically 20.64pp and 16.62pp below the mean for tasks 1 and 2 respectively, for our best run from each task. Furthermore, when checking for the main objective of the experiments, we observe again no substantial advantage for training with the combined dataset, only increasing the F1-Score by 0.95pp for task 1 and 0.80pp for task 2. 5. Conclusion and Future Work We employ the state-of-the-art transformer model bsc-bio-ehr-es and explore the hypothesis of if training with more entity types within biomedical domain would be beneficial for extracting a specific type, with the results indicating it does not make a significant difference. Furthermore, with the scores obtained on the test set, we show that this model can achieve results with high precision and with minimal fine-tuning. However, the model’s poor recall on the test set is noteworthy and likely caused by the the way in which the data is parsed for training and evaluation. In the future, we intend to perform hyperparameter optimization on this model, to try to improve the scores that we obtained for the MultiCardioNER task (both on the DisTEMIST and DrugTEMIST subtracks), and apply English and Italian biomedical pre-trained language models to the corresponding subsets of the DrugTEMIST dataset. 6. Acknowledgements This work is supported by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020) and ref. UIDP/04516/2020 (https://doi.org/10.54499/UIDP/04516/2020) with the financial support of FCT.IP. References [1] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz, G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [2] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [3] A. Intxaurrondo, M. Krallinger, SPACCC, 2018. URL: https://zenodo.org/doi/10.5281/zenodo.1563762. doi:10.5281/ZENODO.1563762. [4] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at bioasq 2023, Working Notes of CLEF (2023). [5] S. Lima-López, E. Farré-Maduell, L. Gasco-Sánchez, J. Rodríguez-Miret, M. Krallinger, Overview of SympTEMIST at BioCreative VIII: corpus, guidelines and evaluation of systems for the detection and normalization of symptoms, signs and findings from text, 2023. URL: https://zenodo.org/doi/10. 5281/zenodo.10104547. doi:10.5281/ZENODO.10104547. [6] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained Biomedical Language Models for Clin- ical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Language Pro- cessing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL: https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19. [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. URL: http://arxiv.org/abs/1907. 11692, arXiv:1907.11692 [cs]. [8] A. Miranda-Escalada, E. Farré-Maduell, M. Krallinger, Named Entity Recognition, Concept Normal- ization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results, 2020. doi:10.5281/zenodo.3773228. [9] A. G. Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger, PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track, in: Pro- ceedings of The 5th Workshop on BioNLP Open Shared Tasks, Association for Computational Linguistics, Hong Kong, China, 2019, pp. 1–10. URL: https://www.aclweb.org/anthology/D19-5701. doi:10.18653/v1/D19-5701.