1. Introduction

1613-0073

CLEF 2025: Gut-Brain Interplay Information Extraction

of GutBrainIE@

Marco Martinelli

0 2

Gianmaria Silvello

0 2

Vanessa Bonato

vanessa.bonato@unipd.it 1 2

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0 2

Nicola Ferro

nicola.ferro@unipd.it 0 2

Ornella Irrera

ornella.irrera@unipd.it 0 2

Stefano Marchesin

stefano.marchesin@unipd.it 0 2

Laura Menotti

laura.menotti@unipd.it 0 2

Federica Vezzani

federica.vezzani@unipd.it 1 2

Workshop

2 0 Department of Information Engineering, University of Padua , Italy 1 Department of Linguistic and Literary Studies, University of Padua , Italy 2 Information Extraction (IE) , Named Entity Recognition (NER), Relation Extraction, RE

2025

Recent studies link the gut microbiota to mental health conditions and to neurodegenerative diseases such as Parkinson's and Alzheimer's. However, the rapid speed at which this research field is evolving presents a significant challenge for clinicians and researchers who have to keep pace with an ever-expanding volume of biomedical literature. In this context, automatic tools for extracting and structuring information from scientific texts are becoming essential to support the understanding of the gut-brain axis.

1. Introduction

CEUR

ceur-ws.org

In response to this challenge, the GutBrainIE-2025 task, part of the BioASQ Lab [ 5, 6 ] and inserted in the context of the EU-funded project HEREDITARY,1 introduces a Natural Language Processing (NLP) challenge focused on extracting structured information from PubMed abstracts related to the gut–brain axis. The task aims to foster the development of robust and efective Information Extraction (IE) systems that support experts in analyzing the scientific literature, thereby contributing to biomedical knowledge discovery and, in the long term, informed clinical decision-making.

In its first edition, GutBrainIE proposes four subtasks of increasing complexity: • Subtask 6.1 - Named Entity Recognition (NER): participants are asked to identify and classify specific text spans (entity mentions) into one of the 13 predefined categories (e.g., bacteria, chemical, microbiota). • Subtask 6.2.1 - Binary Tag-based Relation Extraction (BT-RE): participants are provided with a set of predefined relation types, each defined by a combination of compatible head and tail entities (e.g., Chemical → Microbiome via Impact or Produced by), and are asked to identify which entities are in relation within a document, without specifying the exact predicate or entity mentions involved. • Subtask 6.2.2 - Ternary Tag-based Relation Extraction (TT-RE): this subtask extends BT-RE by requiring participants to predict the specific relation predicate connecting each head-tail entity pair. • Subtask 6.2.3 - Ternary Mention-based Relation Extraction (TM-RE): this is the most challenging subtask, demanding to identify the exact entity mention involved in a relation and assign the correct relation predicate.

All subtasks target PubMed abstracts, leveraging a corpus of biomedical documents related to the gut-brain axis. Each document contains a title and abstract, both annotated with entity mentions and relations. Specifically, the GutBrainIE-2025 dataset consists of over 1500 annotated documents, split into Training, Development, and Test sets. A noteworthy feature of the dataset is its tiered annotation quality, organized as follows: • Platinum Annotations: highest-quality annotations, expert-curated and internally reviewed; • Gold Annotations: high-quality annotations and expert-curated; • Silver Annotations: mid-quality annotations, created by trained students under expert supervision; • Bronze Annotations: automatically generated annotations with no manual correction.

In particular, the Development and Test sets contain only expert annotations (Platinum- and GoldStandard Annotations).

Submissions are evaluated using standard macro- and micro-averaged Precision, Recall, and F1 metrics. Results are compared against a baseline system shared with participants at the beginning of the challenge to provide a reference baseline.

This paper provides a comprehensive overview of the GutBrainIE-2025 task. Section 2 presents the subtasks and their structure; Section 3 introduces the dataset structure and annotation schema; Section 4 presents participating teams and evaluation procedures; Section 5 reports the results and leaderboards across subtasks; Section 6 describes the systems, models, and approaches employed by participating teams; finally, Section 7 concludes the paper and proposes future directions.

2. Task Overview

In its first edition, GutBrainIE-2025 featured four subtasks: 1. Named Entity Recognition (NER). 2. Binary Tag-based Relation Extraction (BT-RE). 3. Ternary Tag-based Relation Extraction (TT-RE). 4. Ternary Mention-based Relation Extraction (TM-RE).

Participants were free to develop their systems without constraints on architecture, training methodology, or external resources, aiming to achieve the best possible performance. Overall, 17 teams submitted a total of 395 runs. In the remainder of this section, we describe each task in detail.

2.1. Subtask 1: Named Entity Recognition (NER)

The NER subtask focuses on classifying entity mentions into one of the 13 predefined categories. Participants were provided with PubMed abstracts related to the gut-brain axis and asked to identify specific text spans corresponding to one of the 13 categories defined in Table 1.

Each entity mention consists of the following elements: • Location, indicating whether the entity mention appears in the title or in the abstract. • Start and end indices, denoting character ofsets of the entity mention within the text. • Text span, representing the actual string of text corresponding to the mention. • Label, specifying the entity label assigned to the mention.

A predicted entity mention is considered correct only if all its fields exactly match an entry in the ground truth.

2.2. Subtask 2: Binary Tag-based Relation Extraction (BT-RE) Subtask

The BT-RE subtask is one of the three GutBrainIE-2025 subtasks dealing with RE. In this subtask, participants have to determine which pairs of entities are in relation within a document, considering the set of relations defined in Table 2.

Within BT-RE, participants are not required to predict a relation predicate. Therefore, a predicted relation for this subtask will be a pair (subjectEntityLabel; objectEntityLabel), where entity labels are taken from the ones reported in Table 1.

2.3. Subtask 3: Ternary Tag-based Relation Extraction (TT-RE) Subtask

The TT-RE subtask complements BT-RE by requiring participants to predict, along with the pair of entities in relation, the predicate of the relation holding among them. As in BT-RE, the set of relations to be considered is reported in Table 2.

Predicted relations for TT-RE will be triples (subjectEntityLabel; relationPredicate; objectEntityLabel).

2.4. Subtask 4: Ternary Mention-based Relation Extraction (TM-RE) Subtask

The TM-RE subtask is, among the three RE subtasks, the one most aligned with the standard NLP task of Relation Extraction [ 7 ]. Here, participants are required to identify the entity mentions involved in a relation, predict their entity labels, and specify the relation predicate that links them.

Predicted relations for TM-RE will be tuples (subjectEntityTextSpan; subjectEntityLabel; relationPredicate; objectEntityTextSpan; objectEntityLabel).

3. Dataset

The released dataset for GutBrainIE-2025 consists of titles and abstracts of biomedical articles retrieved from PubMed, focusing on the gut-brain axis and its implications in neurological and mental health. Articles were manually annotated, either by experts or trained students,2 for entity mentions (i.e., text 2The students we are referring to are enrolled in the Master Degree in Modern Languages for International Communication and Cooperation of the University of Padua. They received a specific training on medical terminology during the course of Translation-Oriented Terminography. spans mapped to one of the categories defined in Table 1) and relations (i.e., associations between entities defined in Table 2).

3.1. Dataset Creation

To build the GutBrainIE-2025 dataset, we first retrieved documents from PubMed using two separate queries: "gut microbiota" AND "Parkinson" and "gut microbiota" AND "Mental Health". The ifrst retrieval was performed on 09/05/2024 and yielded 828 documents. A second retrieval using the same queries was conducted on 31/10/2024, resulting in 834 additional documents not included in the ifrst batch. We then filtered out documents from the years 2014–2019 (for the “Mental Health” query) and 2013–2020 (for the “Parkinson” query) due to the limited volume of relevant literature in those periods, discarding 16 documents in total. The final collection includes 1,647 documents.

Before starting manual annotation, documents were pre-annotated for NER using GLiNER [ 8 ] in a zero-shot setting, aiming to speed up and facilitate the annotation process. We decided not to pre-annotate documents for RE since, in a zero-shot setting, the likelihood of introducing noise was

Animal NCIT_C14182 Biomedical Technique NCIT_C15188 Bacteria NCBITaxon_2 Chemical CHEBI_59999 Dietary Supplement MESH_68019587 Disease, Disorder, or Finding (DDF) Drug Food Gene Human Microbiome

NCIT_C7057 CHEBI_23888

NCIT_C1949 SNOMEDCT_67261001

NCBITaxon_9606

OHMI_0000003

Statistical Technique NCIT_C19044

Explanation

Named locations of or within the body. A non-human living organism that has membranous cell walls, requires oxygen and organic foods, and is capable of voluntary movement, as distinguished from a plant or mineral. Research concerned with the application of biological

and physiological principles to clinical medicine.

One of the three domains of life (the others being Eu

karya and ARCHAEA), also called Eubacteria. They are unicellular prokaryotic microorganisms which generally possess rigid cell walls, multiply by cell division, and exhibit three principal forms: round or coccal, rodlike or bacillary, and spiral or spirochetal.

A chemical substance is a portion of matter of constant composition, composed of molecular entities of the same type or of diferent types. This category also includes metabolites, which in biochemistry are the intermediate or end product of metabolism, and neurotransmitters, which are endogenous compounds used to transmit information across the synapses.

Products in capsule, tablet or liquid form that provide

dietary ingredients, and that are intended to be taken by mouth to increase the intake of nutrients. Dietary supplements can include macronutrients, such as proteins, carbohydrates, and fats; and/or micronutrients, such as vitamins; minerals; and phytochemicals.

A condition that is relevant to human neoplasms and

non-neoplastic disorders. This includes observations, test results, history and other concepts relevant to the characterization of human pathologic conditions.

Any substance which when absorbed into a living organ

ism may modify one or more of its functions. The term is generally accepted for a substance taken for a therapeutic purpose, but is also commonly used for abused substances.

A substance consumed by humans and animals for nutritional purpose. A functional unit of heredity which occupies a specific position on a particular chromosome and serves as the template for a product that contributes to a phenotype or a biological function. Members of the species Homo sapiens. This term refers to the entire habitat, including the mi

croorganisms (bacteria, archaea, lower and higher eukaryotes, and viruses), their genomes (i.e., genes), and the surrounding environmental conditions.

A method of calculating, analyzing, or representing statistical data.

significantly higher than that of adding valid relations. Excessive noise in pre-annotations could lead to biases among annotators, ultimately impacting the quality of the final annotated dataset [ 9 ].

Articles were then distributed between expert and student annotators. In total, 7 experts and 26 students annotated documents. Documents from the first retrieval were annotated exclusively by experts, while those from the second retrieval were assigned to students.

The annotation process was conducted in two phases, each followed by iterative refinement. At the end of each phase, expert annotators conducted a meeting to review progress, discuss critical challenges noted during the annotation phase, and make any necessary adjustments to the guidelines. These guidelines, publicly available at https://hereditary.dei.unipd.it/challenges/gutbrainie/2025/files/ GutBrainIE_2025_Annotation_Guidelines.pdf, were also shared with task participants so they could better tailor and tune their systems.

Once manual annotation was completed, we fine-tuned GLiNER [ 8 ] for NER and ATLOP [ 10 ] for RE using the annotated entities and relations and used them to annotate the remaining unannotated documents from both batches of the original retrieval. More detailed information about the fine-tuning of these models can be found in Section 4.4.

3.2. Dataset Folds

The training set is divided into four parts: 1. Platinum Collection: highest-quality annotations, expert-curated and revised internally by a subgroup of annotators to ensure consistency, uniformity, and alignment with the final annotation guidelines; 2. Gold Collection: high-quality annotations, expert-curated and produced after the finalization of the annotation guidelines. No subsequent revision performed; 3. Silver Collection: mid-quality annotations, created by trained students under expert supervision.

Students were divided into two clusters: • StudentA, including those with more consistent annotation performance, • StudentB, including those with less consistent annotation performance. 4. Bronze Collection: automatically generated annotations obtained using fine-tuned GLiNER (for

NER) [ 8 ] and fine-tuned ATLOP (for RE) [ 10 ]. No manual revision was performed on this subset.

The development and test sets are held-out selections of documents from the gold and platinum collections, selected to ensure full representativeness and coverage of all entity and relation types. 3.3. Dataset Format Annotations are provided in JSON format. Each entry corresponds to a PubMed article, keyed by its PubMed ID (PMID), and contains the following fields: • Metadata: Article-level information including: – title, author, journal, year, abstract; – annotator_id: one of expert_1–expert_7, student_A, student_B, or distant (automatically generated). Participants may decide to filter or weight examples diferently based on the annotator. • Entities: An array of objects, each with: – start , end : character ofsets of the text span associated to the entity mention; – location: “title” or “abstract”; – text_span: the actual text span of the mention; – label: the annotated entity label (such as bacteria, microbiome). • Relations: An array of objects representing relations, each with: – subject_start , subject_end, subject_location, subject_text_span, subject_label: the subject entity mention; 3.3.1. Alternative Dataset Formats For users preferring tabular data, each field above is also provided in both CSV and TSV formats: • metadata.csv — metadata.tsv • entities.csv — entities.tsv • relations.csv — relations.tsv • binary_tag_relations.csv — binary_tag_relations.tsv • ternary_tag_relations.csv — ternary_tag_relations.tsv • ternary_mention_relations.csv — ternary_mention_relations.tsv

CSV files use the pipe symbol ( |) as a delimiter, while TSV files use the tab character ( \t).

4. Participation and Evaluation

This section provides a concise overview of the teams that participated in GutBrainIE-2025. A comprehensive description of the submitted systems can be found in Section 6 and in the participants’ individual papers reported in Table 5.

Teams could participate in any of the four subtasks independently and submit up to 25 runs per subtask.

Although 85 teams from 29 diferent countries registered for the challenge, the final number of teams submitting at least one run was 17, resulting in 395 submitted runs. Among these, 15 teams also submitted a participant paper describing their methodologies, approaches, and systems. However, the discussion presented in Section 6 includes all 17 teams that submitted at least one run. Table 4 summarizes participation across the various subtasks.

The task began on February 3, 2025, with the release of the training and development sets. The test set was made available on April 28, and final submissions were due by May 10.

4.1. Guidelines

Participating teams were required to satisfy the following guidelines: • Runs should be submitted in the JSON format described below; • Each team can submit a maximum of 25 runs per subtask. 4.1.1. Subtask 1 (NER) Run Format Runs must be submitted as a JSON file ( .json) with the following structure: where: }, { } "start_idx": 75, "end_idx": 82, "location": "title", "text_span": "patients", "label": "human" "start_idx": 250, "end_idx": 270, "location": "abstract", "text_span": "intestinal microbiome", "label": "microbiome" • The top‐level key (e.g. “34870091”) is the PubMed ID of the document. • entities is a list of entity objects. • Each entity object represents a predicted entity and contains: – start_idx and end_idx: character ofsets of the span, – location: “title” or “abstract”, – text_span: the actual text, – label: the entity type. 4.1.2. Subtask 2 (BT-RE) Run Format where: where: • The top‐level key (e.g. “34870091”) is the PubMed ID of the document. • binary_tag_based_relations is a list of relation objects. • Each relation object represents a predicted binary tag-based relation and contains: – subject_label: the entity type of the relation’s subject, – object_label: the entity type of the relation’s object. 4.1.3. Subtask 3 (TT-RE) Annotation Format Submissions must be provided as a JSON file ( .json) with the following structure: "34870091": { "ternary_tag_based_relations": [ { "subject_label": "microbiome", "predicate": "located in", "object_label": "human" • The top-level key (e.g. “34870091”) is the PubMed ID of the document. • ternary_tag_based_relations is a list of relation objects. • Each relation object represents a predicted ternary tag-based relation and contains: – subject_label: the entity type of the relation’s subject, – predicate: the relation type between the subject and object, – object_label: the entity type of the relation’s object. 4.1.4. Subtask 4 (TM-RE) Annotation Format Submissions must be provided as a JSON file ( .json) with the following structure: • The top-level key (e.g. “34870091”) is the PubMed ID of the document. • ternary_mention_based_relations is a list of relation objects. • Each relation object represents a predicted ternary mention-based relation and contains: – subject_text_span: the exact character sequence of the subject mention, – subject_label: the entity type of the subject mention, – predicate: the relation type between the subject and object, – object_text_span: the exact character sequence of the object mention, – object_label: the entity type of the object mention. 4.1.5. Submission Upload All runs must be submitted as a single ZIP archive named <teamID>_GutBrainIE_2025.zip. Within this archive, each run has to be placed in its own folder named <teamID>_<taskID>_<runID>_<systemDesc> (without spaces or special characters), where: • <teamID> is the name of the participating team; • <taskID> is the identifier of the subtask the run is being submitted to (one of T61 for NER, T621 for BT-RE, T622 for TT-RE, or T623 for TM-RE); • <runID> is a unique alphanumeric string (a–z, A–Z, 0–9) chosen by the team to distinguish among their runs; • <systemDesc> is an optional short label describing the system.

Each run folder is required to contain exactly two files: • <teamID>_<taskID>_<runID>_<systemDesc>.json • <teamID>_<taskID>_<runID>_<systemDesc>.meta The .json file holds the team’s predictions for the specified subtask on the test set. The accompanying .meta file must include the following information: • Team ID, Task ID, and Run ID; • Type of training applied; • Pre‐processing methods; • Training data used; • Relevant details of the run; • A link to a public repository enabling reproducibility.

4.2. Participants

A total of 85 teams registered for the GutBrainIE2025 task, of which 17 submitted at least one run and thus participated in the evaluation.

In total, 391 runs were submitted: 101 for NER, 100 for BT-RE, and 95 each for TT-RE and TM-RE. Table 4 shows which tasks each team participated in and how many runs they submitted, while Table 5 reports their afiliations, countries of origin, and associated resources.

4.3. Evaluation

All submitted runs are evaluated using standard IE metrics of precision ( ), recall ( ), and F1‐score ( 1), assessed with both macro‐ and micro‐averaging. The same metrics apply to all four subtasks.

Let ℓ, ℓ, and ℓ denote, respectively, the number of true positives, false positives, and false negatives for label ℓ. We define the label set ℒ as: • for subtask 1 (NER): the set of entity types; • for subtask 2 (BT-RE): the set of pairs (subject label, object label); • for subtasks 3 and 4 (TT-RE and TM-RE): the set of triples (subject label, predicate, object label). The macro-averaged metrics are computed as: macro = 1 ℓ

∑ |ℒ | ℓ∈ℒ ℓ + ℓ The micro‐averaged metrics aggregate counts before division: macro =

1 ∑ ℓ |ℒ | ℓ∈ℒ ℓ + ℓ micro = micro = 1,micro =

∑ℓ∈ℒ ℓ ∑ℓ∈ℒ( ℓ + ℓ) ∑ℓ∈ℒ ℓ

, ∑ℓ∈ℒ( ℓ + ℓ) 2 micro micro . micro + micro , (1b) (1c) (2a) (2b) (2c)

For each subtask, the micro‐averaged F1‐score (Eq. 2c) is adopted as the reference metric for the final leaderboard.

4.4. Baseline

To support participants and provide a reference for performance evaluation, we developed a baseline system for all four GutBrainIE subtasks. This system is the same one used to generate the automatic annotations included in the Bronze fold of the training set (see Section 3.2).

The system consists of two independent modules: a NER module based on GLiNER [ 8 ], and a RE module based on ATLOP [ 10 ]. The NER module employs GLiNER, a bidirectional transformer encoder trained for instruction-based named entity recognition [ 8 ]. We used the NuNERZero checkpoint [24] and fine-tuned the model on the Platinum, Gold, and Silver portions of the training data, applying a confidence threshold of 0.6. After inference, we merged predicted entities having adjacent or overlapping spans.

The RE module uses ATLOP, a document-level relation extraction model that employs localized context pooling and adaptive thresholding [ 10 ]. ATLOP receives the document text and the entities predicted by the NER module and predicts relational triples within each document. The resulting relations are filtered to exclude any relation not listed in Table 2. For fine-tuning, the Platinum, Gold, and Silver collections as manually annotated sets, and the Bronze collection as the distantly supervised annotated set.

Table 6 reports, for each participating team, the number of submitted runs that surpassed the baseline system out of the total number of runs submitted for each subtask (considering the micro-averaged F1 score as the reference metric).

The code implementing the baseline system is available at the following GitHub repository: https: //github.com/MMartinelli-hub/GutBrainIE_2025_Baseline.

5. Results

This section presents the performance results for each subtask, based on the evaluation metrics described in Section 4.3.

For each subtask, we report the leaderboard tables showing the best-performing run per team, ranked by micro-averaged F1 score. Complete scores for every submitted run can be found in the appendix.

5.1. Subtask 1 (NER) Results

Most participating teams in the NER subtask adopted supervised fine-tuning or transformer-based models pre-trained on large-scale biomedical corpora, with the most employed ones being PubMedBERT [25], BioBERT [26], BioLinkBERT [27], and ELECTRA [28]. In addition to these, several teams employed Number of submitted runs surpassing the baseline system. For each team and subtask, the table reports the number of submitted runs that achieved a higher micro-averaged F1 score than the baseline system out of the total number of runs submitted.

team_id

Performance metrics of each team’s top run for NER. For each evaluation metric, the best result is in bold, the team_id GutUZH Gut-Instincts NLPatVCU ICUE LYX-DMIIP-FDU ata2425ds greenday Graphwise-1 BASELINE ataupd2425-gainer DS@GT-bioasq-task6 DS@GT-BioNER ataupd2425-pam Schemalink BIU-ONLP lasigeBioTM run_id 2 5eedev ensemble1 ensemble5 run1 trf 1 13 Organizers ma 1 run2 3 1 3 R1 system_desc AugEnsemble ensemble1 th10 EnsembleBERT tranformer llmner NERWise NuNerZero-Finetuned trainplatinumandgold glinerbiomed pubmedbert biosyn-sapbert-bc2gn-12 SchemaBasedMultiPrompt 3_gliner_large_bio-v0.1 BENTMistral GLiNER [ 8 ] fine-tuned on the training data. Ensemble approaches were widely utilized to improve efectiveness, often combining models trained with diferent data, seeds, and configurations.

While the majority of teams used the platinum, gold, and silver folds, a few also included the noisier bronze data, applying cleaning or re-weighting strategies. Some systems also incorporated additional knowledge from external corpora or pseudo-labeled texts to enhance training coverage.

A smaller number of teams experimented with prompt-based or zero-shot methods using Large Language Models (LLMs). These approaches avoided traditional supervised learning and relied on structured prompting and schema-guided extraction.

Overall, systems that combined strong biomedical backbones with fine-tuning and ensemble strategies tended to outperform others.

Performance metrics of each team’s top run for BT-RE. For each evaluation metric, the best result is in bold, the

5.2. Subtask 2 (BT-RE) Results

A discussion of the systems and methodologies employed for BT-RE is provided in the section dedicated to the TM-RE subtask (see Section 5.4), which ofers an overview valid across all RE subtasks.

5.3. Subtask 3 (T T-RE) Results

second-best is underlined.

Performance metrics of each team’s top run for TT-RE. For each evaluation metric, the best result is in bold, the team_id run_id system_desc Gut-Instincts 6229eedev3re ataupd2425-pam B7 RE-BiomedNLP-3NoRel-1epoch-COMPLETE_DATASET ONTUG union ElectraCLEANR Graphwise-1 105 AtlopOnto ICUE run22 biolinkbertl_pp BIU-ONLP 4 RobertaLarge BASELINE Organizers Atlop-Finetuned NLPatVCU C19 mixedCNNWLabModel4Preds LYX-DMIIP-FDU run1 BioLinkBERT Schemalink 1 gpt4re ataupd2425-gainer td trainplatinumandgold ToGS hermes8bragreorder CLEANR lasigeBioTM R1 ConstParsing

A discussion of the systems and methodologies employed for TT-RE is provided in the section dedicated to the TM-RE subtask (see Section 5.4), which ofers an overview valid across all RE subtasks.

5.4. Subtask 4 (TM-RE) Results

second-best is underlined.

Performance metrics of each team’s top run for TM-RE. For each evaluation metric, the best result is in bold, the team_id run_id system_desc Gut-Instincts 6239eedev3re Graphwise-1 107 AtlopOnto ICUE run23 biolinkbertl_pp LYX-DMIIP-FDU run1 BioLinkBERT ONTUG union ElectraCLEANR BASELINE Organizers Atlop-Finetuned Schemalink 1 gpt4re ataupd2425-pam C7 RE-BiomedNLP-3NoRel-1epoch-COMPLETE_DATASET ataupd2425-gainer tms trainplatinumandgold NLPatVCU C11 ensembleWLabModel4Preds BIU-ONLP 4 RobertaLarge ToGS hermes3bloraragreorder CLEANR lasigeBioTM R1 ConstParsing

Most participating teams approached RE as a supervised classification task, using fine-tuned biomedical transformers such as BioBERT [26], BioLinkBERT [27], PubMedBERT [25], and BioMedElectra [28]. Entity pairs were detected via upstream NER modules, explicitly marked in input texts and used to generate relation-specific training instances.

Some teams tackled RE at the document level, incorporating sampling strategies (e.g., negative sampling, class-weighted losses) and architectural enhancements (e.g., query-based encoders, hypergraph neural networks) to better capture long-tail relations. Data augmentation, input filtering, and relation predicate-based constraints were also employed to refine candidate relation sets.

Ensemble techniques, including majority voting and model fusion, were used by several topperforming teams to improve systems’ efectiveness across the three RE subtasks.

Few teams experimented with prompt-based or zero-shot approaches using LLMs guided by structured templates or relation schemas, without any form of supervised training or fine-tuning.

Overall, the most efective submissions combined strong biomedical encoders with supervised finetuning and ensemble mechanisms.

6. Discussion

This section provides an overview of the approaches adopted by participating teams in the GutBrainIE2025 task. We organize the discussion into two subsections: one dedicated to NER (subtask 1) and another covering the RE subtasks (subtasks 2, 3, and 4).

6.1. Subtask 1 (NER) Discussion

Han et al. [18] (Team GutUZH) fine-tuned a BioMedBERT model [ 29] augmented with a Conditional Random Field (CRF) layer to improve label dependency modeling [30]. Titles and abstracts were processed separately, with special tokens ([TITLE], [ABSTRACT]) used to mark structural components of the documents.

The team experimented with multiple runs involving data augmentation and model ensembling. In one setup, they pseudo-labeled 500 additional abstracts using ensemble predictions, integrating them into a second training phase. Another variant trained on the full labeled set, including also bronzequality annotations, while a final run retrained the top-performing model using only manual annotations (platinum, gold, silver sets) to reinforce the patterns learned from the most reliable examples.

Training employed weighted loss functions for class imbalance, mixed-precision optimization, and early stopping based on entity-level F1 score [31, 32]. Inference relied on Viterbi decoding [33], with evaluation using the seqeval library [34].

Andersen et al. [17] (Team Gut-Instincts) built a large ensemble system integrating multiple biomedical transformers, including BioLinkBERT [27], BioMedBERT [29], and BioMedElectra [28], with diferent decoding heads (dense layers, CRFs, LSTM-CRFs). In their runs, they combined from 3 to 17 models.

All available training data were used, including a cleaned version of the silver and bronze sets and, in some runs, also the development set.

Preprocessing included boundary corrections using manually crafted dictionaries, while training involved class-weighted losses to give more importance to high-quality data during optimization and a custom learning rate scheduler. Post-processing rules were used to merge overlapping or adjacent entities.

Taylor et al. [22] (Team NLPatVCU) submitted ensembles of fine-tuned GLiNER models specialized for biomedical NER [35]. These models difered in pretraining sources, training subsets, and configuration parameters.

Training data included all annotation tiers, and some models were additionally pretrained on external corpora such as BC5CDR [36]. To improve training stability, the team adopted GLiNER’s probabilistic masking mechanism [ 8 ], selectively ignoring potentially mislabeled non-entity spans during training. In addition, focal loss was used to emphasize harder examples and counter class imbalance.

Ensemble predictions were constructed by combining the outputs of the three models. Model 1, based on GLiNER-BioMed [35], was trained on all annotation tiers and served as the primary model; Model 2 introduced a two-stage training pipeline with initial fine-tuning on BC5CDR [ 36] to improve performance on disease-like entities; Model 3 reused the same training data as Model 1 but employed diferent focal loss parameters to adjust class sensitivity. Post-processing involved per-entity confidence thresholds and merging rules derived heuristically from the development set.

Lee et al. [19] (Team ICUE) explored both token classification and span-based approaches. Their primary models were transformer-based classifiers using IOB2 tagging [ 37] and ensembled predictions across 11 models trained separately with variations in architectural choices and span manipulation strategies.

Training data comprised platinum, gold, and silver sets, with preprocessing involving token alignment, label assignment, and filtering based on entity presence.

The team employed BioLinkBERT [27] and PubMedBERT [25] as models, while span strategies included union-span and bigger-span [38]. Some configurations further integrated PubTator annotations as training data [39].

Liu [21] (Team LYX-DMIIP-FDU) used a majority-vote ensemble of BioMedBERT [29], BioLinkBERT [27], and a clinical variant of XLM-RoBERTa [40]. Each model was fine-tuned in a multi-task learning setup, treating each entity class as a distinct prediction objective.

Input annotations were converted to PubTator format before training [39]. Models were trained on platinum, gold, silver, and development sets. During inference, span-level voting was applied to determine final entity labels. Specifically, after separate inference using each model, they used the average predicted probability of each token as the probability of each entity span, and filtered the predicted entity spans based on the total probability across all models

Team ata2425ds trained spaCy-based NER models using both static word embeddings and transformer backbones [41].

Two main pipelines were implemented: one using en_core_web_lg with tok2vec + NER layers, the other based on en_core_web_trf with RoBERTa as the underlying encoder [42, 43]. Models were trained on the full dataset, including bronze-quality annotations, with diferent tokenization and input cleaning configurations.

Preprocessing involved HTML tag removal using BeautifulSoup [44] and tokenization adjustments to preserve annotated spans.

Gupta et al. [16] (Team greenday) proposed a generation-based NER model by fine-tuning GPT-4.1mini [45] to perform entity annotation using inline text markers, following the approach adopted by the GPT-NER framework [46].

Training was conducted via OpenAI’s API on platinum and gold subsets, using specific prompts that directly instructed entity tagging. The team experimented with zero- and few-shot settings, utilizing a FAISS-based vector database of training examples for retrieval-augmented few-shot prompting [47, 48].

Post-processing involved recovering token-level entity spans from the annotated output by resolving discrepancies and misalignments introduced by inline annotations and hallucinations.

Datseris et al. [15] (Team Graphwise-1) developed an ensemble approach combining fine-tuned biomedical transformers, GLiNER [ 8 ], and data-augmentation strategies. Their pipeline integrates BioBERT [26], ELECTRA-based models [28], and GLiNER [ 8 ] fine-tuned on the full annotated dataset.

To mitigate data imbalance in low-resource categories, they applied a data augmentation strategy based on distant supervision. Specifically, they queried the PubMed API using MeSH-based queries tailored to each entity type. Retrieved abstracts were then annotated using multiple NER systems, including GLiNER [ 8 ] and BioBERT [26], and incorporated into an expanded bronze-quality collection.

To further improve system robustness, the team experimented with spaCy pipelines enhanced with domain-specific gazetteers [ 49].

Ensemble predictions were constructed by selecting the best-performing model for each entity type. Post-processing rules were applied to adjust entity boundaries based on systematic validation error analysis.

Piron et al. [ 11 ] (Team ataupd2425-gainer) trained GliNER-based models initialized from the NuNer_Zero checkpoint [24]. Training variants explored diferent dataset combinations: platinum+gold, platinum+gold+dev, and platinum+gold+silver.

Preprocessing involved concatenating titles and abstracts, applying the DeBERTa-v3-large tokenizer [50], and mapping entity ofsets across fields. Training used cosine learning rate scheduling, fixed batch size (2), and variable training steps (6k-12k depending on the setting).

Mehta [ 14 ] (Team DS@GT-bioasq-task6) submitted a single run using the GLiNER-biomed checkpoint ifne-tuned on platinum, gold, and silver annotations [ 35].

Post-processing involved a dictionary-based refinement using external biomedical lexicons to correct low-confidence or invalid predictions.

Team DS@GT-BioNER submitted three runs based on BioBERT [26] and PubMedBERT [25] models ifne-tuned on platinum, gold, and silver folds. All annotations were converted to BIO format before training [51].

The first and second runs used BioBERT and PubMedBERT individually, while the third run ensembled their outputs. Models were trained with HuggingFace’s default settings.

Pamio et al. [ 12 ] (Team ataupd2425-pam) explored CRF- and transformer-based models across ten runs. Transformer models included BioBERT [26], BioMedBERT [29], NuNER [24], SapBERT [52], and SciBERT [53].

Some models were trained with class-weighted loss functions to address label imbalance. CRF-based models used custom F1/F2 loss weighting strategies. For most of their submitted runs, models were trained on the full dataset (all training and development sets), with data preprocessed by parsing entities into token-label sequences.

Team Schemalink applied a schema-driven in-context learning approach using OpenAI’s GPT-4o [45]. No supervised training was employed.

A LinkML schema derived from the ontology provided in the challenge materials was used to guide the LLM [54], along with the incorporation of few-shot examples in the prompt. For each entity class, they generated a separate prompt and used OpenAI’s response_format field to enforce structured extraction. UTF-8 normalization was applied as a preprocessing step to improve model input compatibility.

Keinan et al. [ 13 ] (Team BIU-ONLP) fine-tuned five variants of GLiNER [ 8 ] on platinum, gold, and silver tiers. Preprocessing included lowercasing and space normalization.

Models difered by GLiNER backbone (e.g., domain-specific or multilingual). All were trained with the same hyperparameters: 384-tokens input, learning rate of 5e-5, batch size of 8, and 3k training epochs. The confidence threshold has been fixed to 0.9 to retain only highly reliable predictions.

Conceição et al. [20] (Team LasigeBioTM) submitted two zero-shot runs using Mistal-7B [55].

The first run used the BENT tool [ 56] to insert inline entity annotations with unique IDs and label types, which were then passed to Mistral for processing. The second run applied Mistral directly to raw texts without tagging. No fine-tuning or labeled data was used in either run.

6.2. Subtasks 2,3, and 4 (RE) Discussion

Andersen et al. [17] (Team Gut-Instincts) extended their ensemble-based approach to all three RE subtasks. Their approach combined fine-tuned transformers (BioLinkBERT [ 27], BioMedBERT [29], BioMedElectra [28]) with specific adaptations to accommodate task-specific output structures.

To improve training quality, they cleaned the silver and bronze datasets by correcting or removing entity spans with misalignments and filtering out documents having more than 100 relations annotated. Candidate entity pairs were marked in input texts, and a 10:1 negative sampling ratio was used to balance the training data.

Training used class-weighted loss and a custom learning rate schedule. Final predictions were generated via ensemble voting across the three top-performing models per configuration.

Kantz et al. [23] (Team ToGS) submitted runs for all RE subtasks using a hybrid system combining retrieval-augmented generation (RAG) [57], LoRA fine-tuning [ 58], transformer-based models such as BioMedElectra [28] and Hermes-3 (LLaMA-3.2 3B and LLaMA-3.1 8B variants) [59], and prompting with GPT-4o-mini [45].

Prompts were dynamically built using training examples retrieved from a VectorDB and reordered to prioritize high-quality (platinum and gold) annotations. In addition to prompting, LoRA-based ifne-tuning was applied to improve models’ specialization eficiently [ 58].

Furthermore, Teams Togs [23] and Graphwise-1 [15] submitted collaborative runs as Team ONTUG. Here, the BiomedElectra [28] model was first fine-tuned on the binary relation extraction task (BT-RE) and subsequently adapted for the mention-level task (TM-RE), leveraging shared entity representations across subtasks.

All models were trained for 100 epochs with the same hyperparameters: batch size of 2, gradient accumulation of 2 steps, learning rate of 5e-5, and a warm-up ratio of 0.06. Two diferent output fusion strategies (union and intersection) were evaluated to assess the impact of conservative and inclusive inference fusion.

Datseris et al. [15] (Team Graphwise-1) participated in all three RE subtasks, exploring transformerand encoder-based classifiers.

For encoder models, BioMedElectra [28] and XLM-RoBERTa [40] were fine-tuned sequentially for BT-RE, TT-RE, and TM-RE using consistent settings (up to 200 epochs, learning rate of 5e-5). Some variants experimented with masked language modeling pre-training [60].

They also employed fine-tuned REBEL-large [ 61] to perform end-to-end relation generation.

Pamio et al.[ 12 ] (Team ataupd2425-pam) submitted models for all RE subtasks using transformer classifiers trained on relation-centric instances.

Entity mentions were extracted via upstream NER (e.g., SapBERT [52], NuNER [24]) and injected into text using marker tokens. These instances were then used to fine-tune multiple RE models, trained for one epoch on the full dataset (platinum, gold, silver, bronze, and development sets). No significant run-specific modifications or hyperparameter variations were reported across submissions.

Keinan et al. [ 13 ] (Team BIU-ONLP) submitted twelve runs across all RE subtasks based on fine-tuning ATLOP [ 10 ] on diferent language models, including SapBERT [ 52], BioBERT [26], and RoBERTa [43].

Each model was trained using a standardized configuration (learning rate of 5e-5, batch size of 4, 500 training epochs, warmup ratio of 0.06).

Only the platinum, gold, and silver folds were used. Preprocessing involved lowercasing and whitespace normalization. No ensemble, augmentation, or post-processing strategies were applied.

Liu [21] (Team LYX-DMIIP-FDU) used a unified binary classification approach across all RE subtasks. Entity pairs were filtered by type compatibility and distance (<200 characters) and formatted in PubTator style with markers and contextual windows [39].

BioLinkBERT [27] was employed as the backbone, fine-tuned using platinum, gold, silver, and development sets. The same model and pipeline were reused across all RE subtasks, with no taskspecific variation or augmentation.

Taylor et al. [22] (Team NLPatVCU) explored two families of models: sentence-level CNN classifiers [62] and document-level Hypergraph Neural Networks (HGNN) [63].

CNNs were trained on sentences labeled with relations and sampled sentences with no relation, using platinum, gold, and silver training datasets. Entity spans were derived from prior NER submissions, and final outputs were aggregated via ensemble logic.

HGNNs modeled entities and their interactions as nodes and hyperedges, using BioBERT embeddings [26] and a hypergraph convolution layer [64]. The outputs obtained with these approaches supported BT-RE and TT-RE predictions, but did not address TM-RE predictions.

Team Schemalink used prompting-based approaches via OpenAI’s GPT-4o [45], operating in a fully zero-shot setting.

Entity mentions identified by GLiNER [ 8 ] were inserted into sentence-level prompts using custom tags. Prompts included few-shot examples from the platinum set and targeted predefined relation patterns (e.g., [bacteria] LOCATED IN [host]). The same system was applied to all subtasks with no fine-tuning or augmentations.

Piron et al. [ 11 ] (Team ataupd2425-gainer) submitted runs for all three RE subtasks using PubMedBERT [25] and BioBERT [26] trained via HuggingFace’s classification pipeline.

Entity spans were marked using [E1] and [E2] tokens. Sentences were tokenized to a max length of 256 or 356, and negative sampling (0.2 or 0.3) was applied.

Across runs, models were fine-tuned for 5 to 8 epochs on stratified 80/20 train–validation splits with batch sizes between 8 and 12, and learning rates ranging from 1e-5 to 2e-5. Training data spanned diferent combinations of the platinum, gold, silver, and dev sets. No ensembles or post-processing were used.

Lee et al. [19] (Team ICUE) participated in all RE subtasks by framing RE as binary classification over entity combinations using a query-based BioLinkBERT model [27]. Inputs were constructed by inserting tagged entities and a natural language query representing the candidate relations.

Balanced sampling was used to mitigate class imbalance. Some runs included second-stage reasoning with a distilled LLM trained on synthetic binary-choice prompts: given a candidate relation and supporting context, the LLM is asked whether the relation holds, choosing between a positive or negative restatement. The LLM confidence was then fused with classified logits.

Ensemble strategies were also explored, with final outputs selected based on majority voting across models trained on distinct splits or using diferent sampling thresholds.

Conceição et al. [20] (Team LasigeBioTM) participated in TT-RE and TM-RE using a zero-shot approach combining BENT for entity tagging [56] and Mistral-7B for relation extraction [55].

Tagged inputs contained nested entity labels and IDs. In some runs, syntactic features (dependency paths, constituency parses) were added using spaCy [49]. All configurations relied uniquely on prompting and required no model fine-tuning or training data.

7. Conclusions and Future Works

GutBrainIE-2025 marked the first edition of a shared task dedicated to information extraction on the gut–brain axis, a research area of growing relevance in both neuroscience and microbiology.

This first edition saw 85 teams registering and 17 teams submitting a total of 395 runs. Participants tackled a diverse set of subtasks, from Named Entity Recognition to increasingly fine-grained Relation Extraction, with results highlighting the efectiveness of ensemble-based methods and biomedical transformers fine-tuned on domain-specific data.

The released dataset, including over 1600 annotated PubMed abstracts and stratified into annotation quality tiers, represents a valuable resource for training and evaluating biomedical NLP systems.

As future work, we plan to further improve the overall quality of the dataset by manually reviewing and annotating the current bronze fold, currently composed of fully automatic and not revised annotations. Additionally, we will leverage the pool of submitted predictions to identify possible annotation errors, such as wrongly annotated entities or relations as well as missing annotations that may have been overlooked during the annotation process. Finally, we aim to extend the task by incorporating entity linking. This will enable the inclusion of two additional subtasks: entity linking itself, and the classical NLP task of Relation Extraction framed at the concept level rather than at the mention level.

Acknowledgments

This project has received funding from the HEREDITARY Project, as part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074.

Declaration on Generative AI

During the preparation of this work, the author used GPT-4o and Grammarly in order to: Grammar and spelling check. After using these tools, the author reviewed and edited the content as needed and takes full responsibility for the publication’s content. Dictionary-Based Post-processing for BioASQ 2025 task 6, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [15] A. Datseris, M. Kuzmanov, I. Nikolova-Koleva, D. Taskov, S. Boytcheva, Graphwise @ CLEF-2025 GutBrainIE: Towards Automated Discovery of Gut-Brain Interactions: Deep Learning for NER and Relation Extraction from PubMed Abstracts, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [16] H. P. Gupta, R. Banerjee, LLMs for Biomedical NER, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [17] L. R. Andersen, M. I. Gardshodn, M. H. Dolmer, J. M. Rodriguez, D. Dell’Aglio, Trusting Gut Instincts: Transformer-Based Extraction of Structured Data from Gut-Brain Axis Publications, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [18] J. Han, Y. Liu, GutUZH at CLEF2025 BioASQ Task 6: a method of SOTA performance with the best results at GutBrainIE NER subtask 1, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [19] C. Lee, S. Doneva, M. Rodriguez-Cubillos, E. Castagnari, A. Lain, J. Posma, T. I. Simpson, Understanding Gut-Brain Interplay in Scientific Literature: A Hybrid Approach from Classification to Generative LLM Reasoning, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [20] S. I. R. Conceição, P. R. C. Lopes, F. M. Couto, lasigeBioTM at BioASQ25 Task GutBrainIE - Lean Large language models with syntactic features, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [21] Y. Liu, LYX_DMIIP_FDU at BioASQ 2025: Utilizing BERT embeddings for biomedical text mining, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [22] S. Taylor, C. Dil, A. Shah, Jannat, C. Oldham, A. Upadhyay, J. Varughese, N. Yazbeck, B. T. McInnes, NLP@VCU at BioASQ2025: Information Extraction on the GutBrainIE dataset, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [23] B. Kantz, P. Waldert, S. Lengauer, T. Schreck, Constrained Linked Entity ANnotation using RAG (CLEANR), in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025. [24] S. Bogdanov, A. Constantin, T. Bernard, B. Crabbé, E. Bernard, NuNER: Entity Recognition Encoder

Pre-training via LLM-Annotated Data, 2024. arXiv:2402.15343. [25] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domainspecific language model pretraining for biomedical natural language processing, ACM Transactions on Computing for Healthcare (HEALTH) 3 (2021) 1–23. [26] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: pre-trained biomedical language representation model for biomedical text mining, arXiv preprint arXiv:1901.08746 (2019). [27] M. Yasunaga, J. Leskovec, P. Liang, LinkBERT: Pretraining Language Models with Document

Links, in: Association for Computational Linguistics (ACL), 2022. [28] S. Alrowili, V. Shanker, BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA, in: Proceedings of the 20th Workshop on Biomedical Language Processing, Association for Computational Linguistics, Online, 2021, pp. 221–227. URL: https: //www.aclweb.org/anthology/2021.bionlp-1.24. [29] S. Chakraborty, E. Bisong, S. Bhatt, T. Wagner, R. Elliott, F. Mosconi, BioMedBERT: A pre-trained biomedical language model for QA and IR, in: Proceedings of the 28th international conference on computational linguistics, 2020, pp. 669–679. [30] C. Sutton, A. McCallum, et al., An introduction to conditional random fields, Foundations and

Trends® in Machine Learning 4 (2012) 267–373. [31] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston,

O. Kuchaiev, G. Venkatesh, et al., Mixed precision training, arXiv preprint arXiv:1710.03740 (2017). [32] Z. Ji, J. Li, M. Telgarsky, Early-stopped neural networks are consistent, Advances in Neural

Information Processing Systems 34 (2021) 1805–1817. [33] A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE transactions on Information Theory 13 (2003) 260–269. [34] H. Nakayama, seqeval: A Python framework for sequence labeling evaluation, 2018. URL: https://github.com/chakki-works/seqeval, software available from https://github.com/chakkiworks/seqeval. [35] A. Yazdani, I. Stepanov, D. Teodoro, GLiNER-biomed: A Suite of Eficient Models for Open Biomedical Named Entity Recognition, 2025. URL: https://arxiv.org/abs/2504.00676. arXiv:2504.00676. [36] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C.

Wiegers, Z. Lu, BioCreative V CDR task corpus: a resource for chemical disease relation extraction, Database 2016 (2016). [37] L. A. Ramshaw, M. P. Marcus, Text chunking using transformation-based learning, in: Natural language processing using very large corpora, Springer, 1999, pp. 157–176. [38] C. Liu, H. Fan, J. Liu, Span-based nested named entity recognition with pretrained language model, in: Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, Springer, 2021, pp. 620–628. [39] C.-H. Wei, H.-Y. Kao, Z. Lu, PubTator: a web-based text mining tool for assisting biocuration,

Nucleic acids research 41 (2013) W518–W522. [40] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [41] H. Shelar, G. Kaur, N. Heda, P. Agrawal, Named entity recognition approaches and their comparison for custom ner model, Science & Technology Libraries 39 (2020) 324–337. [42] R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, et al., Ontonotes release 5.0 ldc2013t19, Linguistic Data Consortium, Philadelphia, PA 23 (2013) 20. [43] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,

Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [44] L. Richardson, Beautiful soup documentation, 2007. [45] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,

S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023). [46] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, G. Wang, Gpt-ner: Named entity recognition via large language models, arXiv preprint arXiv:2304.10428 (2023). [47] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini,

H. Jégou, The faiss library, arXiv preprint arXiv:2401.08281 (2024). [48] M. Dong, Z. Cheng, C. Luo, T. He, Retrieval-Augmented Generation for Large Language Model based Few-shot Chinese Spell Checking, in: Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 10767–10780. [49] Y. Vasiliev, Natural language processing with Python and spaCy: A practical introduction, No

Starch Press, 2020. [50] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing, 2021. arXiv:2111.09543. [51] L. Ramshaw, M. Marcus, Text Chunking using Transformation-Based Learning, in: Third Workshop on Very Large Corpora, 1995. URL: https://aclanthology.org/W95-0107/. [52] F. Liu, E. Shareghi, Z. Meng, M. Basaldella, N. Collier, Self-alignment pretraining for biomedical entity representations, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 4228–4238. URL: https://aclanthology.org/2021. naacl-main.334. doi:10.18653/v1/2021.naacl- main.334. [53] I. Beltagy, K. Lo, A. Cohan, SciBERT: Pretrained Language Model for Scientific Text, in: EMNLP, 2019. arXiv:arXiv:1903.10676. [54] S. A. Moxon, H. Solbrig, D. R. Unni, D. Jiao, R. M. Bruskiewich, J. P. Balhof, G. Vaidya, W. D.

Duncan, H. Hegde, M. Miller, et al., The Linked Data Modeling Language (LinkML): A GeneralPurpose Data Modeling Framework Grounded in Machine-Readable Semantics, ICBO 3073 (2021) 148–151. [55] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7B, 2023. URL: https://arxiv.org/abs/2310.06825. arXiv:2310.06825. [56] P. Ruas, F. M. Couto, NILINKER: attention-based approach to NIL entity linking, Journal of

Biomedical Informatics 132 (2022) 104137. [57] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474. [58] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models, ICLR 1 (2022) 3. [59] R. Teknium, J. Quesnelle, C. Guang, Hermes 3 technical report, arXiv preprint arXiv:2408.11857 (2024). [60] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, D. Kiela, Masked language modeling and the distributional hypothesis: Order word matters pre-training for little, arXiv preprint arXiv:2104.06644 (2021). [61] P.-L. Huguet Cabot, R. Navigli, REBEL: Relation extraction by end-to-end language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2370–2381. URL: https: //aclanthology.org/2021.findings-emnlp.204. [62] A. Jarrahi, R. Mousa, L. Safari, SLCNN: Sentence-level convolutional neural network for text classification, arXiv preprint arXiv:2301.11696 (2023). [63] Y. Feng, H. You, Z. Zhang, R. Ji, Y. Gao, Hypergraph neural networks, in: Proceedings of the AAAI conference on artificial intelligence, volume 33, 2019, pp. 3558–3565. [64] S. Bai, F. Zhang, P. H. Torr, Hypergraph convolution and hypergraph attention, Pattern Recognition 110 (2021) 107637.

A. Subtask 6.1 (NER) Overall Results

team_id ICUE ICUE ICUE ICUE LYX-DMIIP-FDU NLPatVCU NLPatVCU NLPatVCU NLPatVCU NLPatVCU Schemalink ata2425ds ata2425ds ata2425ds ataupd2425-gainer ataupd2425-gainer ataupd2425-gainer ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam ataupd2425-pam greenday lasigeBioTM lasigeBioTM run_id single3 single4 single5 single6 run1 ensemble1 ensemble2 ensemble3 model4 model6 1 HTMLremoval hyperparams trf ma md ms 10 1 2 3 4 5 6 7 8 9 1 R1 R1 system_desc biolinkbert pubmedbertb pubmedbertb biolinkbertpubtator EnsembleBERT ensemble1 ensemble2 ensemble3 model4 model6 SchemaBasedMultiPrompt tranformer trainplatinumandgold trainplatinumgolddev trainplatinumgoldsilver customCRF biobert-base-cased-v1.2-14-CW-xtreme biosyn-sapbert-bc2gn-8 biosyn-sapbert-bc2gn-12 BiomedNLP-BiomedBERT NuNerv2.0-22-CW-xtreme scibert-47 scibert-27 customCRF-LowF customCRF-LowF llmner BENTMistral MistralBaseline

B. Subtask 6.2.1 (BT-RE) Overall Results

team_id run_id system_desc ToGS hermes3breorder CLEANR ToGS hermes8b CLEANR ToGS hermes8bragreorder CLEANR ToGS hermes8breorder CLEANR ToGS openai4omini CLEANR ToGS openai4ominirag CLEANR ToGS openai4ominiragreorder CLEANR ToGS openai4ominireorder CLEANR ataupd2425-gainer ba1 trainplatinumgolddev ataupd2425-gainer ba2 trainplatinumgoldsilver ataupd2425-gainer ba trainplatinumandgold ataupd2425-gainer bd1 trainplatinumgolddev ataupd2425-gainer bd2 trainplatinumgoldsilver ataupd2425-gainer bd trainplatinumandgold ataupd2425-gainer bp1 trainplatinumgolddev ataupd2425-gainer bp2 trainplatinumgoldsilver ataupd2425-gainer bp trainplatinumandgold ataupd2425-gainer bs1 trainplatinumgolddev ataupd2425-gainer bs2 trainplatinumgoldsilver ataupd2425-gainer bs trainplatinumandgold ataupd2425-pam A0 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam A1 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam A2 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam A3 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam A4 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam A5 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam A6 RE-BiomedNLP-3NoRel-1epoch ataupd2425-pam A7 RE-BiomedNLP-3NoRel-1epoch ataupd2425-pam A8 RE-BiomedNLP-3NoRel-1epoch

C. Subtask 6.2.2 (T T-RE) Overall Results

team_id run_id system_desc ToGS openai4ominiragreorder CLEANR ToGS openai4ominireorder CLEANR ataupd2425-gainer ta1 trainplatinumgolddev ataupd2425-gainer ta2 trainplatinumgoldsilver ataupd2425-gainer ta trainplatinumandgold ataupd2425-gainer td1 trainplatinumgolddev ataupd2425-gainer td2 trainplatinumgoldsilver ataupd2425-gainer td trainplatinumandgold ataupd2425-gainer ts1 trainplatinumgolddev ataupd2425-gainer ts2 trainplatinumgoldsilver ataupd2425-gainer ts trainplatinumandgold ataupd2425-pam B0 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam B1 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam B2 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam B3 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam B4 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam B5 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam B6 RE-BiomedNLP-3NoRel-1epoch ataupd2425-pam B7 RE-BiomedNLP-3NoRel-1epoch ataupd2425-pam B8 RE-BiomedNLP-3NoRel-1epoch lasigeBioTM R1 BENTMistral lasigeBioTM R1 BENTMistralSemantic lasigeBioTM R1 Baseline lasigeBioTM R1 ConstParsing

D. Subtask 6.2.3 (TM-RE) Overall Results

team_id run_id system_desc ToGS openai4ominiragreorder CLEANR ToGS openai4ominireorder CLEANR ataupd2425-gainer tma1 trainplatinumgolddev ataupd2425-gainer tma2 trainplatinumgoldsilver ataupd2425-gainer tma trainplatinumandgold ataupd2425-gainer tmd1 trainplatinumgolddev ataupd2425-gainer tmd2 trainplatinumgoldsilver ataupd2425-gainer tmd trainplatinumandgold ataupd2425-gainer tms1 trainplatinumgolddev ataupd2425-gainer tms2 trainplatinumgoldsilver ataupd2425-gainer tms trainplatinumandgold ataupd2425-pam C0 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam C1 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam C2 RE-BiomedNLP-1NoRel-1epoch ataupd2425-pam C3 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam C4 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam C5 RE-BiomedNLP-2NoRel-1epoch ataupd2425-pam C6 RE-BiomedNLP-3NoRel-1epoch ataupd2425-pam C7 RE-BiomedNLP-3NoRel-1epoch ataupd2425-pam C8 RE-BiomedNLP-3NoRel-1epoch lasigeBioTM R1 BENTMistral lasigeBioTM R1 Baseline lasigeBioTM R1 ConstParsing lasigeBioTM R1 BENTMistralSemantic

F1 0.0034 0.0000 0.1552 0.1492 0.2051 0.2035 0.1965 0.2437 0.2078 0.2012 0.2542 0.2447 0.2439 0.2395 0.2602 0.2607 0.2553 0.2716 0.2738 0.2645 0.0101 0.0026 0.0000 0.0000

[1]

Appleton , The gut-brain axis: influence of microbiota on mood and mental health , Integrative Medicine: A Clinician's Journal 17 ( 2018 ) 28 .

[2]

Carabotti ,

Scirocco ,

M. A.

Maselli ,

Severi , The gut-brain axis: interactions between enteric microbiota, central and enteric nervous systems , Annals of gastroenterology: quarterly publication of the Hellenic Society of Gastroenterology 28 ( 2015 ) 203 .

[3]

J. F.

Cryan , K. J. O'Riordan , K.

Sandhu , V.

Peterson , T. G. Dinan,

The gut microbiome in neurological disorders , The Lancet Neurology 19 ( 2020 ) 179 - 194 .

[4]

Ghaisas ,

Maher ,

Kanthasamy , Gut microbiome in health and disease: Linking the microbiome-gut-brain axis and environmental factors in the pathogenesis of systemic and neurodegenerative diseases , Pharmacology & therapeutics 158 ( 2016 ) 52 - 62 .

[5]

Nentidis ,

Katsimpras ,

Krithara ,

Krallinger ,

Rodríguez-Ortega ,

Rodriguez-López ,

Loukachevitch ,

Sakhovskiy ,

Tutubalina ,

Dimitriadis , G. Tsoumakas,

Giannakoulas ,

Bekiaridou ,

Samaras ,

G. M.

Di Nunzio ,

Ferro ,

Marchesin ,

Martinelli , G. Silvello, G. Paliouras, Overview of BioASQ 2025 : The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering , volume TBA of Lecture Notes in Computer Science , Springer, 2025 , p. TBA.

[6]

Nentidis ,

Katsimpras ,

Krithara ,

Krallinger ,

Rodríguez-Ortega ,

N. V.

Loukachevitch ,

Sakhovskiy , E. Tutubalina, G. Tsoumakas,

Giannakoulas ,

Bekiaridou ,

Samaras ,

G. M.

Di Nunzio ,

Ferro ,

Marchesin ,

Menotti , G. Silvello, G. Paliouras, BioASQ at CLEF2025: The Thirteenth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge , in: C. Hauf , C.

Macdonald , D.

Jannach , G.

Kazai , F. M.

Nardini , F.

Pinelli , F.

Silvestri , N. Tonellotto (Eds.), Advances in Information Retrieval - 47th European Conference on Information Retrieval , ECIR 2025 , Lucca, Italy, April 6- 10 , 2025 , Proceedings, Part

, volume 15576 of Lecture Notes in Computer Science, Springer, 2025 , pp. 407 - 415 . URL: https://doi.org/10.1007/978-3- 031 -88720-8_ 61 . doi: 10 .1007/978- 3- 031 - 88720- 8\_ 61 .

[7]

Bach ,

Badaskar , A review of relation extraction , Literature review for Language and Statistics II 2 ( 2007 ) 1 - 15 .

[8]

Zaratiana ,

Tomeh ,

Holat , T. Charnois, GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer , in: K. Duh,

Gomez , S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics , Mexico City, Mexico, 2024 , pp. 5364 - 5376 . URL: https://aclanthology.org/ 2024 . naacl-long . 300 /. doi: 10 .18653/v1/ 2024 .naacl- long.300.

[9]

Sanh ,

Wolf ,

Belinkov ,

A. M.

Rush , Learning from others mistakes: Avoiding dataset biases without modeling them , arXiv preprint arXiv: 2012 . 01300 ( 2020 ).

[10]

Zhou ,

Huang , T. Ma, J. Huang, Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling , in: Proceedings of the AAAI Conference on Artificial Intelligence , 2021 .

[11]

Piron ,

G. M.

Di Nunzio , Named Entity Recognition with GLiNER and Relation Extraction with LLMs , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS , 2025 .

[12]

Pamio , G. M. Di Nunzio, BioASQ task GutBrainIE 2025 Task 6: Comparing CRF vs BERT Models for Named Entity Recognition and Relation Extraction , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS , 2025 .

[13]

Keinan ,

A. D. N.

Cohen ,

Tsarfaty , From Named Entities to Relations: End-to-End Biomedical Information Extraction , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS , 2025 .

[14]

Mehta , Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted