-

Recognition of Biodiversity-related Named Entities by Fine-tuning General-domain BERT-based Language Models

Geilah T. Tabanao

geilahtabanao67@gmail.com 0 2

Andrew Miguel V. Pagdanganan

avpagdanganan@up.edu.ph 0 2

Riza Batista-Navarro

riza.batista@manchester.ac.uk 0 1 3

Roselyn S. Gabud

rsgabud@up.edu.ph 0 2 3

. Named Entity Recognition models

0 2

Named Entity Recognition, Biodiversity, Transformers, Information Extraction

0 15th International SWAT4HCLS Conference 1 Department of Computer Science, University of Manchester , UK 2 Department of Computer Science, University of the Philippines Diliman , Quezon City , Philippines 3 Institute of Computer Science, University of the Philippines Los Baños , Laguna , Philippines

2024

Named Entity Recognition (NER) is crucial for various Natural Language Processing (NLP) tasks, including uncovering insights from vast textual datasets. We evaluated Bidirectional Encoder Representations from Transformers (BERT) models pre-trained on general data, fine-tuning them on the COPIOUS dataset for biodiversity NER. Achieving the most optimal performance, our DeBERTa NER model was employed in a biodiversity Information Extraction pipeline, which was applied on the forestry compendium of the Centre for Agricultural and Biosciences International Digital Library. We demonstrate that the pipeline enables the enrichment of descriptive information on reproductive conditions and habitats of tree species.

that this is the one dataset where pre-training a BERT model on domain-specific data, did not lead to any improved performance, thus prompting the question of whether other BERT-based models could perform better, even when pre-trained on general-domain data only. Amongst our fine-tuned models, DeBERTa obtained the best performance, with an F1-score of 84.18%. This is impressive, considering that this model was not pre-trained on domain-specific data. 2. Knowledge Graph Curation A popular application of NER is the extraction of fine-grained information from text, that can then be leveraged to populate or curate structured databases. In this vein, we set out to explore the extent to which an Information Extraction pipeline underpinned by NER and relation extraction (RE), can curate a biodiversity-focused database, based on information buried within textual descriptions of various tree species in the Centre for Agricultural and Biosciences International (CABI) Digital Library.1 Specifically, we integrated our best performing NER model into the pipeline, and applied an existing RE model to extract information on the habitats and reproductive conditions of species in the CABI Library forestry compendium.

Taking a corpus of CABI textual descriptions, our pipeline: ( 1 ) applies NER to extract mentions of geographic locations, habitats and temporal expressions; ( 2 ) applies RE to identify related habitats and geographic locations (i.e., habitat-geographic location relations) and related reproductive conditions and temporal expressions (i.e., reproductive condition-temporal expression relations); and ( 3 ) populates a graph database to store the related entities, to allow for querying and visualisation.

[1]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019 . URL: http://arxiv.org/abs/ 1810 .04805. doi: 10 .48550/arXiv. 1810 . 04805 , arXiv: 1810 .04805 [cs].

[2]

K. S.

Kalyan ,

Rajasekharan , S. Sangeetha, AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing , 2021 . URL: http://arxiv.org/abs/2108. 05542. doi: 10 .48550/arXiv.2108.05542, arXiv: 2108 .05542 [cs].

[3]

N. T.

Nguyen ,

R. S.

Gabud , S. Ananiadou, COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature , Biodiversity Data Journal ( 2019 ) e29626 . URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351503/. doi: 10 .3897/BDJ.7.e29626.

[4]

Abdelmageed ,

Löfler ,

König-Ries , BiodivBERT: a Pre-Trained Language Model for the Biodiversity Domain , in: A. Yamaguchi , A.

Splendiani , M. S.

Marshall , C.

Baker , J. T.

Bolleman , A.

Burger , L. J.

Castro , O.

Eigenbrod , S.

Österle , M.

Romacker , A . Waagmeester (Eds.), 14th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences (SWAT4HCLS 2023 ), Basel, Switzerland, February 13-16 , 2023 , volume 3415 of CEUR Workshop Proceedings, CEUR-WS.org , 2023 , pp. 62 - 71 . URL: https://ceur-ws. org/ Vol- 3415 /paper-7.pdf.