Overview of BioNNE Task on Biomedical Nested Named Entity Recognition at BioASQ 2024 Vera Davydova1 , Natalia Loukachevitch2,* and Elena Tutubalina3,4,* 1 HSE University, Russia 2 Lomonosov Moscow State University, Russia 3 Kazan Federal University, Russia 4 Artificial Intelligence Research Institute, Russia Abstract Recognition of nested named entities, which may contain each other, can enhance the coverage of found named entities. This capability is particularly useful for tasks such as relation extraction, entity linking, and knowledge graph population. This paper presents the organizers’ report on the BioNNE competition, which focused on nested named entity recognition systems in medical texts for both English and Russian. The competition includes three subtasks: Bilingual, English-oriented, and Russian-oriented. Training and validation sets were derived from a subset of the NEREL-BIO dataset, a corpus of PubMed abstracts. For the BioNNE evaluation, eight of the most common medical entity types were selected from the original dataset. Additionally, a novel test set was developed for the shared task, consisting of 154 abstracts in both English and Russian. Held within the framework of the BioASQ workshop, the competition aims to advance research in nested NER within the biomedical domain. Keywords BioNLP, Nested Named Entity Recognition, Biomedical Text Mining, Domain-specific Language Models 1. Introduction Nested Named Entity Recognition extends the capabilities of standard Named Entity Recognition (NER) task by addressing the challenge of overlapping and nested entities within texts. While more complex, it provides a richer and more nuanced understanding of the entities in a document, making it invaluable for domains requiring precise and hierarchical entity extraction. One such domain is biomedical scientific texts, where nested structures are common. Identification of named entities in scientific texts is essential for extracting valuable information and progressing biomedical research. NNER involves recognizing entities that are nested within other entities. It brings additional complexity by handling the hierarchical and overlapping nature of entities within the biomedical field. Traditional NER systems often fail to adequately capture this nested structure, leading to a loss of crucial information. Recent studies [1, 2, 3] involve leveraging sequence-to-sequence models and reinforcement learning to handle these nested structures more efficiently. While most studies are focused on flat (non-nesting) NER tasks, there are few general-domain datasets for nested entities [4, 5, 6]. The GENIA corpus [7], a popular NER dataset within the biomedical domain, includes over 100K annotations across 47 entity types, yet only 17% of the entities in the GENIA corpus are nested within another entity [8]. A recent dataset, NEREL [5], is annotated with over 56K named entities of 29 types, while it’s biomedical extension, NEREL-BIO [9], is annotated with over 70k entities of 37 types. This paper provides a comprehensive overview of the Biomedical Nested Named Entity Recognition (BioNNE) task, predominantly focusing on the challenges and advancements in nested NER across annotated PubMed abstracts. The shared task includes both English and Russian biomedical texts and was part of the BioASQ Workshop 2024 [10]. The BioASQ shared tasks aim to advance systems that exploit diverse and extensive online information to meet the information needs of biomedical scientists. CLEF 2024: Conference and Labs of the Evaluation Forum, 9-12 September 2024, Grenoble, France * Corresponding author. $ veranchos@gmail.com (V. Davydova); louk_nat@mail.ru (N. Loukachevitch); tutubalinaev@gmail.com (E. Tutubalina)  0009-0003-0845-2386 (V. Davydova); 0000-0002-1883-4121 (N. Loukachevitch); 0000-0001-7936-0284 (E. Tutubalina) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Sample of annotations of nested entities in the Russian abstract (PMID: 27456564). Figure 2: Sample of annotations of nested entities in the English abstract (PMID: 27456564). To this end, we setup both monolingual and bilingual sub-tasks based on the NEREL-BIO dataset. This paper (i) introduces the corpus (Section 3) and (ii) reports the results of the BioNNE Shared Task 4. All the materials can be found on BioNNE GitHub page1 and on Codalab2 competition page. 2. BioNNE shared Task The main task is to extract and classify biomedical nested named entities mentioned in the unstructured medical abstract text. The main task consists of three tracks: (i) Bilingual, (ii)English-oriented, and (iii) Russian-oriented. • Bilingual: participants in this track were required to train a single multi-lingual NER model using training data for both Russian and English languages. The model was supposed to be used to generate prediction files for each language’s dataset. Predictions from any mono-lingual model were not allowed in this track. • English-oriented: participants in this track were required to train a nested NER model for English scientific abstracts in the biomedical domain. • Russian-oriented: participants in this track were required to train a nested NER model for Russian scientific abstracts in the biomedical domain. The same predictions from track (i) were not allowed in tracks (ii) and (iii). Participants were allowed to train any model architecture on any publicly available data to achieve the best performance. 3. Dataset Training and validation sets for the BioNNE competition were based on a subset of NEREL-BIO dataset [9]. NEREL-BIO is a corpus of PubMed abstracts written in Russian and English. It enhances the NEREL [5] dataset, originally designed for the general domain, by incorporating biomedical entity 1 https://github.com/nerel-ds/NEREL-BIO/tree/master/bio-nne/ 2 https://codalab.lisn.upsaclay.fr/competitions/16464 Table 1 The list of entity types from NEREL-BIO dataset that were used for BioNNE task. Entity Label Explanation Examples FINDING does not have direct correspondence in keratoconus, stabilize the glau- UMLS, it conveys longer hospital stay, coma process stopped the progression of the results of scientific study described in the abstract PHYS biological function or process in organ- blood flow, childbirth, uterine ism including organism attribute (temper- contraction, arterial pressure, ature) and excluding mental processes body temperature INJURY_POISONING damage inflicted on the body as the direct overdosing, burn, drowning, of or indirect result external force including poison- ing falling, childhood trauma DISO any deviations from normal state of organ- appendicitis, haemorrhoids, ism: diseases, symptoms, abnormality of magnesium deficiency dysfunc- organ, excluding injuries or poisoning tions, Diabetes Mellitus, spine pain, complication, bone cyst, acute inflammation, deep vein thrombosis LABPROC testing body substances and other diagnos- biochemical analysis, poly- tic procedures merase chain reaction test, such as ultrasonography elec- trocardiogram, histological ANATOMY comprises organs, body part, cells and cell eye, bone, brain, lower limb, oral components cavity, blood, body substances anterior lens capsule, right ven- tricle, lymphocyte CHEM chemicals including legal and illegal drugs, opioid, lipoprotein, iodine, biological molecules adrenalin, memantine, molecules methylprednisolone DEVICE manufactured objects used for medical catheter, prosthesis, tonometer, purposes tomograph removable prosthe- sis, stent, metal stent types. Biomedical entity types in NEREL-BIO are annotated according to UMLS definitions of relevant concepts. All the abstracts are annotated in the BRAT format [11]. Figures 1 and 2 present parallel examples of nested named entities in NEREL-BIO for one abstract. Table 1 provides a comprehensive list of entity types, along with their explanations and examples. Compared to the original NEREL-BIO dataset, we fixed some annotators’ errors, merged PRODUCT and DEVICE type classes into DEVICE class and selected the eight most common medical entities from the dataset: FINDING, DISO, INJURY_POISONING, PHYS, DEVICE, LABPROC, ANATOMY, CHEM. The resulting dataset comprises 662 annotated PubMed abstracts in Russian and 104 parallel abstracts in Russian and English. 104 parallel abstracts were randomly split for training and validation sets for each subtask. A novel test set was developed for the shared task, consisting of 154 abstracts in English and Russian. To avoid manual annotation, 346 extra files were added for each language, resulting in 500 abstracts for each of the target languages. These supplementary files were excluded from the final evaluation. Table 2 shows the number of entities represented in each part of the data set. Observations can be summarized as follows. Entities labeled as DISO and ANATOMY are the most frequent across all sets, with DISO being particularly prevalent in both training and test sets. Categories such as DEVICE and INJURY_POISONING have a much lower number of entities compared to others, highlighting potential areas where entity recognition might be more challenging due to data sparsity. The number of entities in the English test set (EN_test) and the Russian test set (RU_test) are relatively comparable, although slight variations are observed, particularly in the ANATOMY and DISO categories. Table 2 Number of entities in each subset of the BioNNE 2024 dataset. Refined NEREL-BIO Novel BioNNE test set Entity Label RU_train EN_train RU_dev EN_dev EN_test RU_test number of documents 716 54 50 50 154 154 number of entities DISO 11169 2043 914 1012 2995 2730 ANATOMY 8346 911 869 897 2221 2157 PHYS 5742 397 354 379 1569 1570 CHEM 4741 579 527 575 1308 1203 FINDING 4808 456 350 348 1365 1388 LABPROC 1618 190 138 154 401 405 DEVICE 734 20 33 28 168 165 INJURY_POISONING 643 90 25 20 129 120 Total 37369 4686 3210 3413 10156 9738 4. Experiments 4.1. Evaluation Metric F1 was used as the main evaluation metric. It is calculated according to the following formula: 1 ∑︁ 𝐹1 = 𝐹1𝑟𝑒𝑙𝑐 (1) 𝑛 𝑐∈𝐶 where 𝐶 ={“FINDING”, “DISO”, “INJURY_POISONING”, “PHYS”, “DEVICE”, “LABPROC”, “ANATOMY”, “CHEM”}, 𝑛 is the size of 𝐶, 𝐹1𝑟𝑒𝑙𝑐 is macro F1 -score averaged over all entity classes. 4.2. Baseline Solution We leveraged the BINDER model [12] as a baseline solution for the BioNNE task. BINDER utilizes two encoders to map text and entity types into a shared vector space. It efficiently reuses vector representations of entity types for various text spans (or vice versa), leading to accelerated training and inference speeds. Leveraging bi-encoder representations, BINDER introduces a contrastive learning framework for NER. This framework facilitates similarity between the representations of entity types and their corresponding mentions while encouraging dissimilarity with non-entity text spans. Additionally, BINDER introduces a dynamic thresholding loss in contrastive learning. During testing, it employs candidate-specific dynamic thresholds to differentiate entity spans from non-entity ones. For our backbone model, we utilized the multilingual BERGAMOT model3 [13], which is pre-trained on the Unified Medical Language System (UMLS) (version 2020AB) using a Graph Attention Network (GAT) encoder [14]. The best-performing results were achieved with the following hyperparameters: a learning rate of 3e-5 and 5 training epochs. AdamW was used as the optimizer [15]. To address the cross-lingual transfer problem, the baseline model was trained and evaluated on various language variations: RU, EN, and RU+EN. The highest scores were achieved on the combined RU and EN subsets (see Tab. 3). Additionally, models that were trained on one language showed comparatively high results when evaluated on the other language, with training on Russian data proving to be more effective. This effectiveness can be attributed to the difference in the size of the training data. Thus, combined with the results from the participants’ models, we can conclude that cross-lingual techniques can be effectively applied to the NNER task. Table 3 Results (F1 scores on the test sets) of bilingual and monolingual subtasks. The best result in each task is bolded. Model Both (Track 1) English (Track 2) Russian (Track 3) fulstock 0.7044 0.6181 0.6981 hasin.rehana 0.5053 0.5636 0.6007 wenxinzh - 0.3480 - vampire - 0.1203 - baseline EN_train 0.4061 0.5561 0.4041 baseline RU_train 0.4866 0.4041 0.5849 baseline RU+EN_train 0.6430 0.5655 0.6732 4.3. Official BioNNE Results We observed a strong interest in the shared task, with 26 teams registered in CodaLab. We have received 155 submissions from 5 teams. One team opted to withdraw their results from the official publication. We summarized performance for all tracks in Table 3. Below, we give an overview of these approaches. Team fulstock achieved the best results by using the BINDER model. In contrast with the baseline architecture, it has XLM-RoBERTa [16] as a backbone model. The participant experimented with different ways of entity type description (prompts) for BINDER learning. The following variants of prompts were used: keyword (name of the entity type); 2, 5 or 10 the most frequent component words for entity type in the training data, contextual prompt (an example of a sentence with the target entity), lexical prompt (an example of a sentence, in which the target entity is masked with the entity label) [17]. The model was trained during 64 epochs. Results have shown that contextual Russian named entity type description proved to be the best option for the bilingual track (achieving 0.704 in F1-score), while for the Russian-only track, one worked the best (F1-score is 0.698). In the English track, the 10 most frequent English components prompt resulted in the 0.6181 F1-score. Thus, these prompts benefited from getting first place in the BioNNE competition on all three tracks. Team wenxinzh [18] used the combination of a pre-trained Mixtral model [19] and en_ner_bc5cdr_md, a spaCy NER model trained on the BC5CDR corpus [20]. They also adapted and customized rules based on semantic types of UMLS (Unified Medical Language System). First, the system uses Mixtral and en_ner_bc5cdr_md to extract potential entities for each category from the text. Then, the system finds the UMLS semantic type associated with the entities to determine their final entity types. The team applied the system to the English subtask and achieved third place in the overall results, with an F1 score of 0.348. Team hasin.rehana [21] processed the BioNNE dataset by splitting each abstract into sentences and mapping the corresponding annotations to these sentences. Then, they implemented the BIO-tagging scheme, a well-known method for named entity recognition encoding. Tokens were encoded as B-TYPE for the beginning of an entity, I-TYPE for subsequent tokens of the same entity, and O for tokens that do not belong to any entity class. Overall, six levels of BIO-tagging were applied to the BioNNE dataset. The core of the model for English NNER is the pre-trained PubMedBERT, which provides contextualized word embeddings [22]. For the Russian NNER task, the team used a pre-trained SBERT-Large-NLU-RU model4 . For Bilingual NNER, they have employed BERT-Base-Multilingual-uncased[23]. A series of six classification layers were added to the base model. Each layer was designed to output a specific level of NER tags, with each linear layer taking the hidden states from PubMedBERT and mapping them to the required number of labels for that layer. Although the original number of classes in the BioNNE dataset is eight, the total number of output classes for each classification layer is 17 to support the preprocessed BIO-tagged dataset. This includes “B-Class” and “I-Class” for each of the eight original classes, as well as “O” class for any token that does not belong to any entity class. To enrich the NER process, the team leveraged the UMLS Metathesaurus for vocabulary expansion. They utilized the MRCONSO.RRF data file within UMLS to extract relevant concepts and their child concepts based on the concept IDs 3 https://huggingface.co/andorei/BERGAMOT-multilingual-GAT 4 https://huggingface.co/ai-forever/sbert_large_nlu_ru provided by the BioNNE challenge organizers, which broadened the model’s ability to recognize entities by incorporating synonyms and related terms. For these experiments, the team has employed 6 NVIDIA Tesla V100 GPUs with 32GB of HBM2 VRAM each. 5. Conclusion In this paper, we present the organizers’ report on the competition for nested named entity recognition systems in the biomedical domain (BioNNE). The competition included three subtasks: Bilingual, English-oriented, and Russian-oriented. The participants were asked to extract the eight most common medical entities, both in Russian and English, which can contain each other, from PubMed abstracts. The best results in the evaluation were achieved by using the BINDER model based on bi-encoder representations and a contrastive learning framework. The winner experimented with different ways of entity type description (prompts) for BINDER learning. We hope that the outcomes of the competition will foster further research and development in nested NER for healthcare applications. Acknowledgments The work of E.T. has been supported by the Russian Science Foundation grant # 23-11-00358. We would like to thank all the participating teams who contributed to the success of the shared task through their interesting approaches and experiments. References [1] Y. Yang, X. Hu, F. Ma, S. Li, A. Liu, L. Wen, P. S. Yu, Gaussian prior reinforcement learning for nested named entity recognition, arXiv preprint arXiv:2305.12003 (2023). URL: https://arxiv.org/ abs/2305.12003. [2] P. Wajsburt, Y. Taillé, X. Tannier, Effect of depth order on iterative nested named entity recognition models, arXiv preprint arXiv:2104.00542 (2021). URL: https://arxiv.org/abs/2104.00542. [3] U. Yaseen, P. Gupta, H. Schütze, Linguistically informed relation extraction and neural architectures for nested named entity recognition in bionlp-ost 2019, arXiv preprint arXiv:1910.03549 (2019). URL: https://arxiv.org/abs/1910.03549. [4] H. Ming, J. Yang, L. Jiang, Y. Pan, N. An, Few-shot nested named entity recognition, arXiv preprint arXiv:2212.00968 (2022). URL: https://arxiv.org/abs/2212.00968. [5] N. Loukachevitch, E. Artemova, T. Batura, P. Braslavski, V. Ivanov, S. Manandhar, A. Pugachev, I. Rozhkov, A. Shelmanov, E. Tutubalina, et al., Nerel: a russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links, Language Resources and Evaluation (2023) 1–37. [6] E. Artemova, M. Zmeev, N. Loukachevitch, I. Rozhkov, T. Batura, V. Ivanov, E. Tutubalina, Runne- 2022 shared task: Recognizing nested named entities, Komp’juternaja Lingvistika i Intellektual’nye Tehnologii 2022 (2022) 33 – 41. doi:10.28995/2075-7182-2022-21-33-41. [7] J.-D. Kim, T. Ohta, Y. Tateisi, J. Tsujii, Genia corpus—a semantically annotated corpus for bio- textmining, Bioinformatics 19 (2003) i180–i182. [8] A. Katiyar, C. Cardie, Nested named entity recognition revisited, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, 2018. [9] N. Loukachevitch, S. Manandhar, E. Baral, I. Rozhkov, P. Braslavski, V. Ivanov, T. Batura, E. Tu- tubalina, NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Enti- ties, Bioinformatics (2023). URL: https://doi.org/10.1093/bioinformatics/btad161. doi:10.1093/ bioinformatics/btad161, btad161. [10] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [11] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, brat: a web-based tool for NLP-assisted text annotation, in: Proceedings of the Demonstrations Session at EACL 2012, Association for Computational Linguistics, Avignon, France, 2012. [12] S. Zhang, H. Cheng, J. Gao, H. Poon, Optimizing bi-encoder for named entity recognition via contrastive learning, in: The Eleventh International Conference on Learning Representations, 2022. [13] A. Sakhovskiy, N. Semenova, A. Kadurin, E. Tutubalina, Biomedical entity representation with graph-augmented multi-objective transformer, in: Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics, Mexico City, Mexico, 2024. [14] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, arXiv preprint arXiv:1710.10903 (2018). [15] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: International Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7. [16] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116. [17] I. Rozhkov, N. Loukachevitch, Prompts in few-shot named entity recognition, Pattern Recognition and Image Analysis 33 (2023) 122–131. [18] W. Zhou, Biomedical Nested NER with Large Language Model and UMLS Heuristics, in: CLEF Working Notes, 2024. [19] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mixtral of experts, 2024. arXiv:2401.04088. [20] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, Z. Lu, Biocreative V CDR task corpus: a resource for chemical disease relation extraction, Database J. Biol. Databases Curation 2016 (2016). URL: https://doi.org/10.1093/database/baw068. doi:10.1093/database/baw068. [21] H. Rehana, B. Bansal, N. Bengisu Çam, J. Zheng, Y. He, A. Özgür, J. Hur, Nested Named Entity Recognition using Multilayer BERT-based Model, in: CLEF Working Notes, 2024. [22] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domain- specific language model pretraining for biomedical natural language processing, ACM Transactions on Computing for Healthcare (HEALTH) 3 (2021) 1–23. [23] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805. arXiv:1810.04805.