1. Introduction

X (L. Pamio);

Comparing CRF vs BERT Models for Named Entity Recognition and Relation Extraction

Lorenzo Pamio

Giorgio Maria Di Nunzio

0 0 Department of Information Engineering, University of Padua , Padova , Italy

2025

000 0 0001

This paper presents our participation in the CLEF 2025 GutBrainIE challenge, addressing tasks in Named Entity Recognition (NER) and Relation Extraction (RE) on biomedical texts related to the gut-brain axis. We explored both traditional and modern approaches, including Conditional Random Fields (CRFs) with hand-engineered features and fine-tuned BERT-based models. For RE, we focused on a simplified pipeline using BiomedBERT, coupled with NER outputs to extract binary and ternary relations. Our experiments revealed the limitations of CRFs in this domain and highlighted the variability and sensitivity of BERT-based models to training stability and dataset noise. While our NER performance was mid-ranked, we achieved competitive results in RE, particularly in ternary tag-based extraction. We also reflect on the efects of model selection, loss function design, and data configurations, ofering insights for future work in biomedical IE.

eol>CRF model BERT model Fine tuning NER RE

1. Introduction 2. Related Work

Named Entity Recognition is a task of Natural Language Processing that has as objective classifying entities inside text, we will refer to this type of task with the acronym of NER. Early approaches to solve this task were using rule-based systems and feature engineering often using models like Conditional Random Fields (CRFs) [ 3 ]. With the rise of deep learning, neural network architectures have become dominant. More specifically, transformer-based models like BERT [ 4 ] have really improved the performance in this task. Relation extraction similarly focuses on identifying relationships between entities, during this paper we will refer to this tasks with the acronym of RE. Recent advancements in deep learning have significantly improved performance in this task as well. In particular, models such as BiomedBERT [ 5 ], developed by Microsoft, have contributed substantially to progress in the biomedical domain. The work presented in this paper builds upon the foundation established by BiomedBERT, which serves as a core component of our approach.

3. Methodology

We will now define briefly the subtasks about NER and RE, to get more about

3.1. Named Entity Recognition

Given a set of labels, an ordered sequence of tokens of size and a set of functions that can map a token into a label, we formally define the problem of NER as: : → ,

where ((1, ..., )) = (1, ..., ), ∈

The objective of this task is to assign a label to each token in a given token set, minimizing the overall loss with respect to a known ground truth. This process should ideally consider not only individual tokens but the entire token sequence for context-aware predictions.

ℒ() = ∑︁ loss(( ()), ())

* = argmin ∈ ℒ()

The loss can be defined in various ways. Ideally, it could be expressed as the negative of a reward function, allowing us to optimize for the function that yields the best overall performance.

The ideal approach to the task assumes that the loss can be computed eficiently for a given label set . However, in real-world scenarios, this loss function is often not directly computable due to the inherent ambiguity in assigning a token to a specific label, as well as the subjective nature of human annotation that may label the same entity diferently. In practice, applying this approach requires defining the initial token set as a sequence of tokens that, when combined, reconstruct the document. Similarly, the label domain is constrained by the task’s scope, and the number of labels is limited to a ifnite, positive integer.

= {(1, 2)|1, 2 ∈ , if exists a relation between 1, 2} = {(1, , 2)|1, 2 ∈ , ∈ , if exists a relation between 1, 2} = {(1, , 2, 1, 2)|1, 2 ∈ , ∈ , if exists a relation between 1, 2} (1) (2) (3) (4)

3.2. Relation Extraction

The RE task works given a set of entities , with their attributes text span, position and label, and a set of predicates , the objective of the task is to associate possible relations between entities.

∈ 1, 2 ∈ 1, 2 ∈ The labels 1, 2 are defined as the labels associated with the entities 1, 2 respectively. The task can be specified in diferent ways depending on the subtask. In subtask 6.2.1, the objective is to determine whether a relation exists between the two given labels. Subtask 6.2.2 extends this requirement by also identifying the specific predicate that characterizes the relation. Subtask 6.2.3 further requires the extraction of the text spans corresponding to the related entities, in addition to identifying the predicate.

In the following formula will refer to subtask 6.2.1 about binary tag-based RE, will refer to subtask 6.2.2 about ternary tag-based RE and finally will refer to subtask 6.2.3 about ternary mention-based RE.

In the equation defining the spans 1, 2 are defined as the spans in the text associated with respect to the labels and entities 1, 2

3.3. CRF model

We began by developing a model based on Conditional Random Fields (CRFs), aiming to build it from scratch and evaluate its performance in the specific domain of the GutBrainIE task. CRF models are statistical modeling methods that incorporate contextual information, making them well-suited for sequence labeling tasks like NER [ 3 ]. CRFs rely heavily on feature design and transition probabilities. Since the predictions are derived from input features, feature engineering plays a crucial role in determining the model’s capabilities [ 3 ]. For this challenge, we designed a custom feature set tailored to the biomedical domain and the structure of the provided texts.

The CRF model was modified in diferent ways from the default configuration and its hyperparameters has been tweaked to obtain diferent types of performances. Specifically trained models were based on the package sklearn_crfsuite2 which provides diferent training algorithms like lbfgs [ 6 ], l2sgd [ 7 ], ap [ 8 ], pa [ 9 ], arow[ 10 ]. Among these algorithms, the best performances have been obtained with the lbfgs method, which has therefore been chosen for being integrated in the final model.

In addition to the training algorithm, several important parameters were tuned to control the model’s regularization behavior and feature handling: • c1, value responsible for the LASSO regression [ 11 ] • c2, value responsible for the RIDGE regression [ 12 ] • all_possible_transitions, boolean value responsible for evaluating even the non-present transition in the training dataset • min_freq, value responsible for evaluating the minimal frequency in which a feature needs to be present to be taken into account by the model

The feature engineering applied to these CRF models involves a standard set of features used to label tokens and extract relations. The core idea behind feature engineering is to process an entire document token by token, extracting specific features for each token as well as information about its surrounding context [ 3 ]. In the specific case of this challenge, to represent the current token, we used the following information: 1. ’word.lower()’: the lowercase representation of the token 2. ’word[-3:]’: last 3 chars of the token 3. ’word[-2:]’: last 2 chars of the token 4. ’word.isupper()’: if the token is uppercase 5. ’word.istitle()’: if the token is title 6. ’word.hasCapital()’: if the token has capital letter 7. ’word.isdigit()’: if the token is a digit 8. ’word.isGene()’: a custom implementation, if the token is a scientific representation of a gene 9. ’postag’: postag of token 10. ’postag[:2]’: first 2 chars of postag 11. ’word.length()’: length of token 12. ’word.pos()’: postion of the token in the phrase We also incorporated, whenever possible, features derived from the preceding and following tokens to enrich the representation of the current token. These contextual features consist of a subset of those used for the current token itself, specifically features 1, 4, 5, 9, and 10, i.e., word.lower, word.isupper, word.istitle, postag, and postag[:2]. 2https://sklearn-crfsuite.readthedocs.io/en/latest/

3.4. BERT models

In addition to CRF models, we also adopted an approach based on fine-tuning pre-trained models. This ifne-tuning process aimed to improve the base performance of various models available through the HuggingFace library3. Several types of models were considered, and each was specifically trained to achieve the best possible performance within the subtasks constraints. The models we fine-tuned and subsequently submitted to the challenge were: • scibert-scivocab-uncased4 • biobert-base-cased-v1.25 • BiomedNLP-BiomedBERT-base-uncased-abstract6 • biosyn-sapbert-bc2gn7 • NuNER-v2.08 All of them (except NuNER-v2.0) were specifically pre-trained on scientific and/or bio-related corpora of documents that enhanced the performance in our specific domain.

4. Experimental Setup 4.1. Dataset

The datasets provided by the competition organizers are composed of entities and relationships between them inside titles and abstracts of PubMed abstracts.

Regarding the challenge, the provided datasets include: • Entity Mentions: Text spans classified into predefined categories. • Relations: Associations between entities, specifying that a particular relationship holds between two entities.

In the specific instance of the GutBrainIE challenge, the corpus of documents was annotated in diferent ways: • Platinum collection, highest-quality annotations, expert-curated and reviewed by external biomedical specialists. • Gold collection, high-quality annotations, expert-curated. • Silver collection, mid-quality annotations, created by trained students under expert supervision. • Bronze collection, automatically generated annotations.

• Dev collection, used as test set.

Working on the 6.1 subtask about NER, our setup was split into two main working pipeline regarding respectively a CRF model (Section 3.3) trained from scratch and a pipeline to fine tune BERT models (Section 3.4).

4.2. Named Entity Recognition

The setup used in this subtask is mainly related to the hyperparameters of the models themselves. We also tweaked the domain and format of the training set used for the task, although the main focus in this part of the challenge was placed more on the models than on data processing. 3https://huggingface.co/docs/huggingface_hub/guides/overview 4https://huggingface.co/allenai/scibert_scivocab_uncased 5https://huggingface.co/dmis-lab/biobert-base-cased-v1.2 6https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract 7https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn 8https://huggingface.co/numind/NuNER-v2.0

4.2.1. CRF models 4.2.2. BERT models

The main diferences between the setup and the overall models produced with CRF are shown in Table 2. We mainly adjusted values associated with regularization functions, specifically L1 ( c1_value) and L2 (c2_value). The min_freq parameter was kept at 0 to ensure that every feature present in the training dataset was captured. We also varied the amount and type of data used for training. BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model pretrained on large corpora using a masked language modeling objective [ 4 ]. Its success in a wide range of NLP tasks has made it a natural choice for sequence classification and token-level prediction tasks. A key concern with BERT models and the training pipeline was the stability of the process. Indeed, in some training iterations, the loss function fluctuated significantly, leading to considerable variation in the results. To address this and improve stability, we adjusted the unstable models’ hyperparameters.

We also decided to use only one model implementing the CustomWeight loss function, as most of the domain-specific scientific or biomedical models did not yield the performance improvements we had hoped for.

4.3. Relation Extraction

For the RE subtasks, we relied heavily on a single model; the BiomedNLP-BiomedBERT-base-uncasedabstract model, focusing more on optimizing one single model for all the RE subtasks. To extract relations from the text, the RE model had to be paired with a NER model capable of identifying the entities to be used in the subsequent steps.

We decided to experiment with the following fine-tuned NER models 9: • biosyn-sapbert-bc2gn-1210 • scibert-27 11 • NuNerv2.0-22-CW-xtreme12 The biosyn-sapbert-bc2gn-12 model has been chosen because it was expected to have the best theoretical performance due to its scientific and biorelated pre-training.

The scibert-27 model has been chosen because the 47-epoch version seemed like a model that could have overfitted over some of the data.

The NuNerv2.0-22-CW-xtreme model has been chosen because it had the most generic domain training background, it had the best performance in unseen data and because it was relying on our CustomWeight loss function.

During the development of these RE models, we defined a metric that was used as the main varying parameter. This parameter has been called norel_ratio.

norel_ratio = | | | | Where is a set of relations that are labeled as negative, denoting a non-existing link between two entities in the text. The set consists of existing relation between entities in the text. In the specific instance of this study, we always used the entirety of the positive instances of relation as a starting point to compute the set of non-existing relations. To create the set of negative instances we have used a random approach, extracting and inserting in this set relationships that didn’t exist between random entities. These models actually have been trained with 3 iterations of the BiomedBERT RE model, where the norel_ratio has been tweaked and ranged from 1 to 3

5. Results

The total number of submitted runs was 37. Out of these 37, 10 were related to the first subtasks about NER (3.1), and the remaining 27 were distributed equally over the 3 RE (3.2) subtasks. As shown in table 5, the results on the NER subtask 6.1 show that BERT models had the best performance overall (considering the micro-f1 score as the reference metric).

The customCRF models, trained from scratch (see Section 3.3), did not perform as well as the other approaches. Similarly, the Custom Weight scheme, which was applied to the BERT models through a custom loss function and initially showed promising results during early evaluation, ultimately ranked lower both in terms of average position and Micro F1-score when compared to other BERT-based models. This result was expected, as we anticipated that the most general-purpose configuration would yield the weakest performance among the BERT variants.

Concerning RE (3.2) subtasks, average performances of proposed models are similar. Analyzing models’ behaviors reported in Tables 6,7,8, we can see that overall the best micro-f1 score has been obtained with models having a higher ratio of no_relation over efective relation in the training dataset.

Even though the overall F1-score distribution was variable, it is worth noting that, in Task 6.2.2, some models trained with a ratio of 1 achieved a high macro-F1 score. This indicates strong performance across all relation classes, suggesting that these models were efective in distinguishing between diferent types of relations. (5) 9These models have been fine-tuned in the NER subtask 10Base model at https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn 11Base model at https://huggingface.co/allenai/scibert_scivocab_uncased 12Base model at https://huggingface.co/numind/NuNER-v2.0

6. Conclusions and Future Work

Our participation in this task showed that for the NER subtasks, although we explored diferent approaches, our results were not among the top performers. However, the trend was diferent for the RE subtasks. We achieved satisfying results in subtask 6.2.2, and overall, our performances in the 6.2 subtasks were better than in subtask 6.1, this results are summarized in Table 9.

Promising directions for future work include the evaluation of larger models performance gains they may bring in this specific domain. Additionally, we aim to investigate the optimal no_rel ratio and how changes to this parameter afect model performance, clarifying whether this value has a generally applicable threshold or if it is domain-dependent. In addition, we aim to integrate a semantic perspective grounded in linguistic analysis to enrich the linguistic and conceptual interpretation of extracted terms and relations. Specifically, we would like to apply semic analysis, which decomposes terms into minimal semantic units, as a structured approach to uncovering the internal organization of meaning in medical terminology [ 13, 14 ]. Incorporating this technique may enhance our ability to align terminological outputs with underlying conceptual structures, improving not only model interpretability but also the precision of the extraction of named entities and objects in domain-specific biomedical contexts.

Acknowledgments

This work is partially supported by the HEREDITARY Project, as a part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074.

Declaration on Generative AI

The authors have not employed any Generative AI tools. Atoms of Meaning to Study Polysemy and Polyreferentiality, Languages 9 (2024) 121. URL: https: //www.mdpi.com/2226-471X/9/4/121. doi:10.3390/languages9040121, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [14] V. Bonato, G. M. Di Nunzio, F. Vezzani, Preliminary Considerations on a Systematic Approach to Semic Analysis: The Case Study of Medical Terminology, Umanistica Digitale (2021) 211–234. URL: https://umanisticadigitale.unibo.it/article/view/12621. doi:10.6092/issn.2532-8816/12621, number: 10.

[1]

Nentidis ,

Katsimpras ,

Krithara ,

Krallinger ,

Rodríguez-Ortega ,

Rodriguez-López ,

Loukachevitch ,

Sakhovskiy ,

Tutubalina ,

Dimitriadis , G. Tsoumakas,

Giannakoulas ,

Bekiaridou ,

Samaras ,

G. M.

Di Nunzio ,

Ferro ,

Marchesin ,

Martinelli , G. Silvello, G. Paliouras, Overview of BioASQ 2025 : The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering , volume TBA of Lecture Notes in Computer Science , Springer, 2025 , p. TBA.

[2]

Martinelli ,

Silvello ,

Bonato ,

G. M.

Di Nunzio ,

Ferro ,

Irrera ,

Marchesin ,

Menotti ,

Vezzani , Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), CLEF 2025 Working Notes , 2025 .

[3]

J. D.

Laferty ,

McCallum ,

F. C. N.

Pereira , Conditional random fields: Probabilistic models for segmenting and labeling sequence data , in: Proceedings of the Eighteenth International Conference on Machine Learning , ICML ' 01 , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001 , pp. 282 - 289 .

[4]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.

[5]

Gu ,

Tinn , H. Cheng, M. Lucas,

Usuyama ,

Liu ,

Naumann ,

Gao ,

Poon , Domain-specific language model pretraining for biomedical natural language processing , CoRR abs/ 2007 .15779 ( 2020 ). URL: https://arxiv.org/abs/ 2007 .15779. arXiv: 2007 .15779.

[6]

D. C.

Liu ,

Nocedal , On the limited memory bfgs method for large scale optimization , Mathematical programming 45 ( 1989 ) 503 - 528 .

[7]

Bottou , Stochastic Gradient Descent Tricks, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012 , pp. 421 - 436 . URL: https://doi.org/10.1007/978-3- 642 -35289-8_ 25 . doi: 10 .1007/ 978-3- 642 -35289-8_ 25 .

[8]

Collins , Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms , in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002 ), Association for Computational Linguistics, 2002 , pp. 1 - 8 . URL: https://aclanthology.org/W02-1001/. doi: 10 .3115/1118693.1118694.

[9]

Crammer ,

Dekel ,

Keshet ,

Shalev-Shwartz ,

Singer , Online passive-aggressive algorithms , Journal of Machine Learning Research 7 ( 2006 ) 551 - 585 . URL: http://jmlr.org/papers/v7/ crammer06a.html.

[10]

Crammer ,

Kulesza ,

Dredze , Adaptive regularization of weight vectors , in: Y. Bengio , D.

Schuurmans , J.

Laferty , C.

Williams , A . Culotta (Eds.), Advances in Neural Information Processing Systems , volume 22 , Curran

Associates

, Inc., 2009 . URL: https://proceedings.neurips.cc/ paper_files/paper/2009/file/8ebda540cbcc4d7336496819a46a1b68-Paper.pdf.

[11]

Tibshirani , Regression shrinkage and selection via the lasso , Journal of the Royal Statistical Society. Series B (Methodological) 58 ( 1996 ) 267 - 288 . URL: http://www.jstor.org/stable/2346178.

[12]

A. E.

Hoerl ,

R. W.

Kennard , Ridge regression: Biased estimation for nonorthogonal problems , Technometrics 42 ( 2000 ) 80 - 86 . URL: http://www.jstor.org/stable/1271436.

[13]

Bonato ,

G. M.

Di Nunzio ,

Vezzani , A Novel Approach to Semic Analysis: Extraction of