<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graphwise @ CLEF-2025 GutBrainIE: Towards Automated Discovery of Gut-Brain Interactions - Deep Learning for NER and Relation Extraction from PubMed Abstracts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksis Datseris</string-name>
          <email>aleksis.datseris@graphwise.ai</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Kuzmanov</string-name>
          <email>mario.kuzmanov@graphwise.ai</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivelina Nikolova-Koleva</string-name>
          <email>ivelina.nikolova@graphwise.ai</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitar Taskov</string-name>
          <email>dimtaskov@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svetla Boytcheva</string-name>
          <email>svetla.boytcheva@graphwise.ai</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FMI, Sofia University</institution>
          ,
          <addr-line>5 "James Bourchier" Blvd., 1164 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IICT, Bulgarian Academy of Sciences</institution>
          ,
          <addr-line>Acad. G. Bonchev Str, bl.2, 1113 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Medical University of Sofia</institution>
          ,
          <addr-line>15 Akademik I. E. Geshov Blvd., 1431 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Ontotext</institution>
          ,
          <addr-line>111R Tsarigradsko Shosse blvd., 1784 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Tübingen University</institution>
          ,
          <addr-line>Geschwister-Scholl-Platz, 72074 Tübingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a set of approaches to tackle the named entity recognition and relation extraction from scientific literature, specifically targeting the gut-brain axis related terms and relationships between them. The proposed methods participated in the GutBrainIE Task at CLEF 2025 BioASQ Lab. The solutions rely on fine-tuned BERT-based models (BioBERT , BiomedNLP ELECTRA , BioBERT PubMed) and GLiNER for the named entity recognition task, ATLOP and REBEL fine-tuning for the relation extraction task. Hybrid models and ensemble of models are also demonstrated for end-to-end tasks. Notably, one of our proposed solutions ranked 2nd on the most dificult task of the challenge - Ternary Mention-based Relation Extraction, achieving micro-F1 37.29%. Our best system for Named Entity Recognition over the test set achieved micro-F1 80.1%. On the Binary Tag-based Relation Extraction subtask, our best solution achieved micro-F1 65.38% on the test set and on the Ternary Tag-based Relation Extraction subtask, our best result was micro-F1 63.72%. All of the proposed approaches demonstrated good performance, consistently outperforming baseline results across all subtasks of the GutBrainIE Task at CLEF 2025 BioASQ Lab.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;named entity recognition</kwd>
        <kwd>relation extraction</kwd>
        <kwd>gut-brain axis</kwd>
        <kwd>biomedical NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        An increasing amount of research indicates that there exists a complex interaction between the gut
and the brain [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, many of the biological mechanisms underlying this relationship remain
unclear. In-depth study of the various aspects of these relationships, which are published in scientific
publications, could be essential for further progress of the biomedical research.
      </p>
      <p>One step in this direction is to use the wealth of scientific publications in the biomedical field
available in PubMed. PubMed contains an extensive collection of peer-reviewed articles from leading
scientific journals and conferences in the field of biomedicine and ofers a rich resource for systematically
identifying, extracting and synthesizing new scientific insights. However, the rapid pace of publication of
new articles and the richness of the resource make the manual exploration of current results challenging,
therefore it is important to provide tools for automated tracking of new developments to speed up the
research in the area.</p>
      <p>By applying modern natural language processing (NLP) techniques to scientific literature, researchers
can eficiently collect and analyze evidences, reveal hidden patterns, explore comorbidities and risk
factors, and accelerate the discovery of new connections within the gut-brain axis, which may ultimately
help to better understanding, prevention and treatment of related diseases.</p>
      <p>
        In this paper, we present our research in developing deep learning and hybrid models for NLP,
for automated information extraction from PubMed articles. The results we present are part of the
GutBrainIE Task at CLEF 2025 BioASQ Lab [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. This task is divided into two main subtasks. The
ifrst one is Named Entity Recognition (NER) - with a focus on important terms such as genes, diseases,
microbiomes, chemicals, etc. The second subtask addresses relation extraction (RE), covering the full
palette of levels of detail from binary relationships to ternary relationships with an explicit indication
of the type of relationship and the entities that are associated with the given relationship. The dataset
includes titles and abstracts from PubMed, organized into four categories based on the quality of
annotations - from expert validated to automatically generated labels.
      </p>
      <p>
        The proposed solutions below are based on deep learning models including GLiNER [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], BioBERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
BiomedNLP ELECTRA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], BioBERT PubMed [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for NER and REBEL [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and ATLOP [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for RE.
      </p>
      <p>The paper is organized as follows: Section 2 outlines the related work, state-of-the art models and
achievements on the tasks; Section 3 defines the tasks and data; in Section 4 are listed the approaches
applied for NER; in Section 5 are presented the results of NER; Section 6 discusses the approaches to
RE, Section 7 - the results of RE and in Section 8 are provided discussion and conclusions of the study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        State-of-the-art NLP approaches for information extraction from Biomedical literature use range of
techniques from deep learning (BERT, BiLSTM-CRF, XML-R and LLMs), classical ML methods (CRF,
SVM) and even traditional solutions like dictionary- and rule-based ones. The best F1 scores achieved
for the NER task on PubMed articles are up to 89.3% and for RE the performance vary in the range
47.7-88.5%, depending on the types of relations and entities [
        <xref ref-type="bibr" rid="ref11">11, 12, 13, 14</xref>
        ]. In this overview we focus
primarily on NER and RE approaches applied on PubMed articles only, due to the specific nature of
scientific literature in contrast with terminology and vocabulary used in clinical trials and clinical texts.
The main categories of extracted entities include genes, proteins, diseases, drugs etc. The types of the
identified relations include gene-disease, drug-treatment etc. Most of the approaches address both NER
and RE tasks [12], but there are also specific techniques for RE only [ 15], [14] and NER only. Luo et
al. [12] applies BiLSTM-CRF, BioBERT-CRF, PubMedBERT-CRF for NER over 600 PubMed abstracts
achieving F1-score: 89.3% (strict) and 93.5% (relaxed) and BERT-GT, PubMedBERT for RE achieving F1
score: 47.7% (novelty), 72.9% (entity pair) and 58.9% (relation type). Hassan et al. [15] focuses only on
RE task and propose solutions on PubMed abstracts using an unsupervised approach based on BERT,
part-of-speech tagging and verb embeddings, achieving F1 score of 88.5% for Drug-Drug Interaction
(DDI dataset), while for ChemProt dataset achieves F1 85.8%. Sänger and Leser [14] also focus on
RE only, proposing an approach based on Neural Networks, corpus-level entity embeddings and pair
embeddings achieving overall improvement of F1 score in the range 4-29% over traditional methods.
      </p>
      <p>
        One of the famous approaches to the problem is to use the innovative method GLiNER [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In contrast
to the more traditional NER models, GLiNER does not need a pre-defined set of categories which
saves the eforts for labeling data and/or retraining. It uses eficient Bidirectional Language Models
(BiLMs: BERT, DeBERTa) to process the input: entity type prompt concatenated with the sentence or
text. Instead of an autoregressive generation, the core concept of GLiNER is to find the best match
between entity type embedding and textual span representation. While GLiNER outperforms large
general-purpose models such as ChatGPT and Vicuna in a zero-shot context, it is eficient and does
not need extensive resources to run. Due to this great flexibility and easy pipeline for inference, we
employ GLiNER as a zero shot classifier to give us a strong starting point by setting the very first
baseline. The family of BERT models has the obvious drawback that it has been pretrained on too many
general domain texts, which is a strong reason to perform worse in domain-specific tasks. That’s why
for biomedical text mining, we also consider BioBERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to improve the baseline of GLiNER. It takes
advantage of the same "short encoders" mentioned above by being initialized with the weights of BERT
pretrained on English Wikipedia and BooksCorpus. It takes advantage in the subsequent training phase,
where a large number of PubMed abstracts and PubMed Central (PMC) full-text articles are used for
further pretraining. As a result, the fine-tuned BioBERT sets the new state-of-the-art for biomedical
NER and biomedical relation classification/extraction (RE).
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Tasks and Data</title>
      <p>
        PubMed is one of the largest databases for biomedical literature, comprising more than 38 million
citations. The documents provided by the challenge organizers consist of a title, abstract, author
and journal metadata of a publication, retrieved from PubMed with explicit focus on the gut-brain
interplay and its implications in neurological and mental health. To foster the development of efective
Information Extraction (IE) systems within the context of the the GutBrainIE Task at CLEF 2025 BioASQ
Lab1 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the annotated training data is organized into 4 collections as demonstrated in Table 1. Each
collection has a diferent name, suggesting the quality of its annotations. The Platinum-Standard
annotations are of highest quality - expert-curated and reviewed by external biomedical specialists. The
Platinum corpus consists of only 111 documents, making it the smallest among the train collections.
Next, the Gold-Standard annotations are only expert-curated with 208 documents followed by the
Silver-Standard, which are created by trained students under expert supervision. The Silver corpus is
comprised of 499 examples. Finally, the largest but with lowest quality is the Bronze-Standard data.
This collection is annotated by distant supervision using fine-tuned GLiNER [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for NER and fine-tuned
ATLOP [16] for RE. The organizers also provide a separate set of documents for validation - Dev set.
The final evaluation is performed on an external Test set made up of only Platinum and Gold Standard
articles which was released about two weeks before the oficial deadline.
3.1. Task 6.1 - NER
The systems for biomedical NER, described in depth in the next sections, are trained/fine-tuned on
diferent subsets of the aforementioned data. For this task, only the corresponding " entities" annotations
are taken into account. An entity refers to a tuple and is expressed as shown on the example:
{
}
"start_idx": 26,
"end_idx": 35,
"location": "title",
"text_span": "BrainBiota",
"label": "microbiome"
The property label is the category of the entity which is located in location (either "title" or "abstract")
and starts at position start_idx and ends at end_idx inclusive.
      </p>
      <p>The goal of this task is to classify a text span (entity mention) into one of 13 pre-defined
categories/labels (see the X-axis of Figure 1). We will use only the term "category" in the next sections as the term
1https://hereditary.dei.unipd.it/challenges/gutbrainie/2025/
"label" used by the organizers can be ambiguous in terms of data description (meaning (1) alternative
name of the entity and also (2) entity category in the context of the given task). However, some of the
categories have only very few annotations. This poses a potential challenge for larger pretrained models
which need a substantial amount of data and resources to generalize well. For example, Figure 1 clearly
shows the distribution between labels in the Platinum-standard collection. This plot remains almost
unchanged for any other annotated dataset. The dominating labels are Disorder-Disease-Finding (DDF),
chemical, bacteria, human, and microbiome. Categories like food, gene, drug, and statistical technique are
rarely found in the texts.</p>
      <p>1,232
1,200
1,000
cy 800
n
e
u
req 600
F
400
200</p>
      <p>0
{
}
{</p>
      <p>F
DD</p>
      <p>ical
chem
bacteria</p>
      <p>an
hum</p>
      <p>e
icrobiom
m</p>
      <p>t
en
dietarysupplem</p>
      <p>edical technique
biom</p>
      <p>ical location
anatom
statistdicrualgtechnique
3.2. Task 6.2 - RE
The second main task is on RE and consists of three subtasks with increasing dificulty. The first of them
is Binary Tag-based Relation Extraction (BT-RE) - requiring to identify whether two entity categories
are in relation. Given the input text and/or the found entities, a binary relation is defined as an ordered
tuple in the following example format:
Where DDF and human are both named entity categories. It is important to note that the least represented
categories are usually those involved in relations with only a small set of other categories - objects.
For instance, gene only participates in relations with three other categories, in contrast to human or
microbiome which appear as subject in relations to more than twice other labels.</p>
      <p>
        In the next subtask - Ternary Tag-based Relation Extraction (TT-RE) the relation type (predicate)
between the entity categories is also considered. An example of a ternary tag-based relation is:
include target, impact, influence , change efect , located in, is linked to, and others. All relation types
are asymmetric, meaning that (subject, predicate, object) != (object, predicate, subject). On average,
influence , target, located in are some of the most common predicates. It is worth noting that the
Silverstandard collection has the largest number of annotated relations - 10 616, which is 5 times more than
the relations annotated in the Gold collection - 1 994. Even though the number of documents is diferent,
the average number of found relations per document is 21.27 for Silver, versus 9.59 for Gold. The highest
quality collection - Platinum, has total of 1 455 relations and an average of 13.11 relations per document.
For more detailed description on the statistics for the diferent relation types in the data, refer to the
extended overview paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Finally, the last subtask on Ternary Mention-based Relation Extraction (TM-RE) aims to extract every
text mention of the entities with their corresponding relation. The annotated training data has the
following format:
{
}
"subject_text_span": "IgA-Biome",
"subject_label": "microbiome",
"predicate": "located in",
"object_text_span": "AR and TD patients",
"object_label": "human"</p>
      <p>Across all subtasks, a true positive is only considered when there is a full match of the found instance.
Therefore, the possibility of partial matches and potentially accounting for some of the systems errors
is eliminated.</p>
      <sec id="sec-3-1">
        <title>3.3. Augmented Data</title>
        <p>As a potential solution to the problem of scarcity in a number of important categories, as observed,
PubMed API is used to fetch more articles and augment the training data. As a matching criterion, only
articles in the period (2015/01/01 - 2025/01/01) are considered, and a specially designed search query is</p>
        <sec id="sec-3-1-1">
          <title>Designed Search Query</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>PubMed API (fetch articles)</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>PubMed Documents</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Expanded Bronze Collection</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Annotated Documents</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>GLiNER, BioBERT ...</title>
          <p>created to extract all articles, where the gut-brain axis or the gastrointestinal microbiome are defined as
major topics, following the MeSH ontology.</p>
          <p>To fully ensure that the newly extracted articles are related to the task requirements, the search
query is extended to also account for specific labels relevant to the needed categories. This way, for the
rare category gene the query extracts all relevant articles within the gut-brain axis that have their main
topics defined to include genes, DNA, or genetics. The designed search query looks like this:
((Brain-Gut Axis[MeSH Major Topic]) OR (Gastrointestinal Microbiome[MeSH Major Topic])) AND
(genes[MeSH Major Topic] OR DNA[MeSH Major Topic] OR genetics[MeSH Subheading])</p>
          <p>Following the strategy shown on Figure 2 for all categories by changing the query and annotating the
entities by distant supervision, usually with diferent combinations of systems, resulted in the creation
of our own Bronze-standard annotated corpus with a total of 6728 articles.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. NER Approaches</title>
      <p>For the NER task several diferent techniques are presented. Two diferent pretrained language models
(LMs) - BERT [17] and ELECTRA [18] are considered. BERT-based models are pretrained on objectives
such as Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), in contrast to what
ELECTRA-based models are trying to learn. This family of models uses a small generator network
to corrupt the input, similar to BERT, by replacing some tokens with plausible alternatives and then
turns to train a discriminator, learning whether each token was in the original input or was artificially
generated. The results of changing the learning objective lead to better contextual representations
and outperform BERT on the GLUE natural language understanding benchmark. Furthermore, the
learned embeddings are especially more representative for small models, which are actually more
suitable for the context of the gut-brain task. To leverage the capabilities of more compute eficient
and smaller bidirectional LMs, fine-tuned GLiNER are also included in our experiments. GLiNER can
extract any entity type by maximizing the probability of a span (i,j) to be of the correct entity type t
and minimizing it, for the same span (i,j) to be of any other type t. Finally, our highest-precision
and highest-recall system is a combination of several models - ensemble. When using multiple models,
the final decision is based on pre-defined rules or majority voting. For example - if model X performs
best for category statistical technique, then on test time we will take the predictions for this category, if
any, only from model X.
To fine-tune models from the BERT/ELECTRA family, standard token classification with Hugging Face
is employed. The first step is to adjust the training data or in other words to "tag" the input in the
widely adopted BIO-format, first introduced by Ramshaw and Marcus [ 19]. However, the training
data is not always the same. This means, we experiment with diferent parts of the whole dataset.
In some configurations, the models are only given the Platinum and Gold Standard collections. In
others, they are fine-tuned on all collections including the expanded Bronze corpus created by us. To
convert the provided format to BIO, ScispaCy [20] is used, a python package that provides spaCy 2
models for scientific text preprocessing. Due to its size and complexity, en_core_sci_sm model was
chosen. The inference happens by creating a pipeline with the saved model and making the predictions
on entity level, where the input is no longer a token, but a whole chunk of text. To choose the most
successful tokenizer setup was selected through experiments with diferent maximum sequence lengths
starting from 128 and going up to 512. For simplicity, no additional tricks are considred to expand this
window. Most of the documents do not exceed 400 tokens, including the subwords, as shown in Figure 3.
Experiments with larger maximum lengths like the one in kiddothe2b/biomedical-longformer-large do
not show better performance for the data.</p>
      <p>120
ts 100
n
e
um 80
c
o
d
o 60
f
r
e
b
um 40
N
20
0</p>
      <sec id="sec-4-1">
        <title>Distribution per Document</title>
        <p>121
63
19
&lt;=200
&lt;=400 &lt;=600 &lt;=800</p>
      </sec>
      <sec id="sec-4-2">
        <title>Tokens (including subwords)</title>
        <p>3</p>
        <p>2
&gt;800</p>
        <p>To arrive at this conclusion, the data is tokenized with biobert-base-cased-v1.1. Although during
ifne-tuning diferent tokenizers are used with the diferent models, almost all of them use the WordPiece
tokenization algorithm. The models in these experiments are domain-specific, pretrained on large
biomedical corpora, usually consisting of PubMed abstracts and/or PubMed Central (PMC) full texts.</p>
        <p>Here is the list of our most successful models on the development set. Their results on the oficial
test dataset are shown in the next section 5. The exact choice on the hyperparemeters of the submitted
models is illustrated in Table 7.</p>
        <p>
          • dmis-lab/biobert-base-cased-v1.1 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] - The base version of the model performs better than the
large for all categories. A strong indicator for it might be the size of our training corpus. This
model is pretrained on PubMed abstracts and PMC full texts. Surprisingly, our most successful
submission turned out to be with only fine-tuning on the bronze corpus. It is notable that the
model exhibits one of the best precisions on the development set and is used further in a pipeline
to form a stronger ensemble.
• microsoft/BiomedNLP-BiomedELECTRA-base-uncased-abstract [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] - When using this model,
pretrained on PMC full texts, the performance decreases. For the remaining of this article, this
model is referred as pubmed-electra-base. The initial model configuration has 110M parameters
before fine-tuning. The best results on the development set are achieved by fine-tuning on all
annotated data and the expanded Bronze collection. The specific annotator for this collection is
another model - dmis-lab/biobert-base-cased-v1.1, which is fine-tuned only on the standard Bronze
collection beforehand. The model achieves the best precision after the ensemble approach, which
we will further describe. The recall is also slightly worse than this of the ensemble and GLiNER,
but overall this configuration shows clear improvements. However, on the oficial test set the
performance decreases. The main reason behind this is that the recall drops by approximately 7%
(82.45% - dev, vs. 74.86% test) which also results in a drop in the F1-micro score.
• monologg/biobert_v1.1_pubmed [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] - Lighter version of biobert-base, only pretrained on PubMed
abstracts. Similarly to pubmed-electra-base, the model is fine-tuned on all data and the additionally
augmented, the annotations are made by the same version of BioBERT. Biobert-PubMed is the
most consistent model on the development and test sets, suggesting better generalization.
• numind/NuNER_Zero [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] - Generalist Model for NER using Bidirectional Transformer (GLiNER).
        </p>
        <p>This is a lightweight zero-shot NER model based on the GLiNER architecture. It is fine-tuned on
all data excluding the Bronze collection. We refer to it as GLiNER. The model is not achieving
the best performance on the dev set, but has the highest micro precision and recall, therefore
F1-micro score on the test set.</p>
        <sec id="sec-4-2-1">
          <title>4.2. Ensembles</title>
          <p>Leveraging the power of a subset of systems, the most successful ensemble approach on the test set turns
out to be, this time matching the expectations, with the best models on the development set. In particular,
the highest precision and recall system includes monologg/biobert_v1.1_pubmed, pubmed-electra-base,
GLiNER, monologg/biobert_v1.1_pubmed and another version of pubmed-electra-base fine-tuned on all
data and the expanded Bronze corpus, but by changing the annotator model. It is referred in Table 3 as
microsoft/BiomedNLP-BiomedELECTRA-base-uncased (v2). In this case, the additional data is annotated
by pubmed-electra-base and GLiNER. In total, 5 diferent models are used. The strategy to form this
ensemble is two-fold. First, error analysis on the development set shows per-label performance for
each model. Then, on the basis of this, hand-crafted rules are prepared to take into account only the
predictions for the selected categories during testing. For example, GLiNER is used only for gene and
chemical. The whole model composition is shown on Table 3.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. NER Results</title>
      <p>Categories
bacteria
statistical technique
gene, chemical
human
anatomical location, animal, bacteria, biomedical technique,
DDF, dietary supplement, drug, food, microbiome
Table 4 lists the results of our models on both the development and test sets for micro precision. The
second column shows which subset of the data is used for fine-tuning to achieve the results, including the
model split for the ensemble. The ensemble approach discussed in the previous section, Subsection 4.2,
improves on precision by nearly 10% compared to GLiNER on the development set. However, on the
test set GLiNER is the only model out of all to perform better.</p>
      <p>Similarly to the precision table, Table 5 shows once again the better performance of our best models
on the development set, this time by micro recall. Although some of the other fine-tuned versions of
pubmed-electra-base and biobert-base-v1.1 are not too far behind, the ensemble we form turns out to
account for the biggest number of categories according to the authors annotations. In contrast, on the
test set, the micro recall of all systems drops. The biggest drop is in pubmed-electra-base, while the
most consistent model remains GLiNER.</p>
      <p>Finally, Table 6 shows the F1-micro score. This metric is used to rank the systems in the oficial
leaderboard3. GLiNER has the lowest score on the development set, because of its weak precision
compared to the other systems. On the test set, all of the systems show decreasing performance, except
for GLiNER. It is the only model to generalize well and improve on its scores, making it our most
successful submission.</p>
      <p>Before the final submission, progressive fine-tuning was applied to all of the selected models.
Surprisingly, after including the development set and training for a small number of epochs, usually 3-5,
the performance decreases on average by around 5% on F1-micro, including the ensemble using these
models. In conclusion, lighter models indeed prove to be more beneficial within the context of this task
and GLiNER is especially good for NER with limited amount of data.
3https://hereditary.dei.unipd.it/challenges/gutbrainie/2025/#six
Dev R
The error analysis on the dev set, shows that that only 15% of the false positives are not entities of
interest. Obviously the model is good enough in guessing the mentions in the text but not so good
in guessing their type. Also the model does not learn very well the annotation guidelines in terms of
nested entities - e.g. it extracts gene mentions such as "16 S rDNA gene" which are part of a longer
mention "amplification and sequencing of the V4 region of the 16 S rDNA gene" which is a biomedical
technique in the gold standard. There are also cases of extracting entities which are true positives in
our view but are missing in the gold standard such as "discriminant taxa analyses". Although NER is a
classical NLP task with long history and significant progress, this problem remains challenging in its
detail.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Relation Extraction Approaches</title>
      <sec id="sec-6-1">
        <title>6.1. Babelscape/rebel-large</title>
        <p>
          One of the applied strategies for RE is to use an autoregressive transformer-based approach to extract the
entities in relation within a document. For this purpose, Babelscape/rebel-large is employed. REBEL [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
is a seq2seq encoder-decoder model that uses BART-large as a base and solves the RE problem as an
end-to-end language generation. This architecture is commonly used in machine translation, where
given an input language, the task is to produce an output ("translation") into a target one. In the context
of the GutBrainIE challenge, the input is the raw text containing the entities and their implicit relations
and the target consists of linearized triplets, explicitly showing the entity mentions and their relations.
A triplet follows similar structure:
&lt;triplet&gt; head-entity-mention &lt;subj&gt; type-relation &lt;rel&gt; tail-entity-mention
        </p>
        <p>At this stage, the goal is to generate the target linearized triplets as accurately as possible from the
input text.</p>
        <p>What makes REBEL a good choice are its pretraining methods. The authors of the algorithm solved
the problem with the scarce RE datasets by creating a new one. This so-called silver corpus consists of
Wikipedia abstracts where Wikipedia hyperlinks are matched with WikiData entities. From these, all
the present relations are extracted. In total, REBEL is able to find 220 relation types based on the data it
has been trained on.</p>
        <p>Here, REBEL is fine-tuned to solve all RE-related subtasks. Because of the nature of the model, first all
entity mentions and their corresponding relations are extracted. To make the decoding fit our criteria
for the other subtasks, the target is modified to also include the entity types along with the mentions.
Therefore, the training linear triplets are in the following format:
&lt;triplet&gt; head-entity-mention %head-entity-type% &lt;subj&gt; type-relation &lt;rel&gt;
tail-entity-mention %tail-entity-type%</p>
        <p>
          As a last postprocessing step, also aiming to improve precision, the pre-defined by the organisers rules
for domain and range of the predicates are applied. Therefore, only relations that are valid according to
the annotation guidelines [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] are selected as the final REBEL output.
        </p>
        <p>To obtain the most successful version of REBEL, all annotated collections (excluding the development
set) are used during fine-tuning. The most eficient combination of hyperparameters is with a batch
size of 16 and a learning rate of 5e-5. Furthermore, the maximum length of the generated output during
decoding is limited to 512 tokens with the number of beams for beam search set to 5. The model is
trained for just 100 iterations on a single RTX 6000 ADA Generation GPU, suggesting strong potential
for the encoder-decoder architecture in solving RE-related tasks. The training time was about 2 hours.
During it, the model with the best F1-micro score on the development set was selected. Similarly,
other experiments with diferent combinations of input data and hyperparameters showed decreasing
performance.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Adaptive Thresholding and Localized Context Pooling</title>
        <p>
          The most successful approach for relation extraction proved to be the use of Adaptive Thresholding
and Localized Context Pooling (ATLOP) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The ATLOP method uses a standard encoder transformer
[21], [17] model as its base model. The method requires the extracted entities to be provided to the
model, so it relies on the extraction from the NER model to be able to produce a prediction, unlike REBEL,
which is an end-to-end entity linking approach. The overall approach of ATLOP is given the input
text  = []=1 and a set of entities {}=1, which mark the position of entity mentions by inserting

a special symbol “*” at the start and end of mentions. After giving the input to a pretrained encoder
model, we obtain  = [ℎ1, ℎ2, ..., ℎ] = BERT([1, 2, ..., ]). representing the contextualized token
embeddings. The embeddings of “*” at the start of the mentions are used to represent the entities, then
all the embeddings of the mention are pooled using logsumexp pooling [22]. After the logsumexp
pooled embeddings (ℎ , ℎ ) of an entity pair ,  are obtained, the entities are mapped to hidden
states  with a linear layers followed by tanh activations, then the probability of relation  by bilinear
function and sigmoid activation is calculated. This process is formulated as:
 = tanh (ℎ ) ,
 = tanh (ℎ ) ,
P (|, ) =  (⊺  + ) ,
(1)
(2)
        </p>
        <p>ATLOP introduces the "Adaptive Thresholding" (AT) mechanism to address the limitations of global
thresholding in the multi-label scenario. This technique replaces the static global threshold with a
learnable, entity-dependent threshold, aiming to reduce decision errors during inference. The core
of Adaptive Thresholding is the introduction of a special "threshold class," denoted as TH. This TH
class is treated and learned similarly to other actual relation classes within the model’s architecture.
Its purpose is to act as a learned decision boundary. The main benefit of AT is that, instead of using
the usual approach of optimizing a value in the range (0, 1) and picking the one that maximizes the
evaluation metrics, AT provides a way to learn the optimal threshold during training. Those techniques,
plus the use of "Localized Context Pooling" and some other innovations, make ATLOP an eficient and
easy-to-use.</p>
        <p>
          To try to improve the model’s performance, diferent pretrained models are used, such as: BERT [ 17],
RoBERTa [23], XLM-R [24], PubMedBERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], BiomedELECTRA [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], as base encoder models. Another
experiment was done with domain adaptation by continuing the models’ pretraining using the masked
language modeling objective [17], and task pretraining. The approach for task pretraining is to first
ifne-tune the base model on the NER objective before plugging the base encoder model into the ATLOP
method for further fine-tuning. The training hyperparameters used for fine-tuning are listed in Table
8. While evaluating on the dev set, most models reached their best performance at around the 100th
epoch. The models sent for submission are trained on both the train and dev sets. For the test, some
submissions are made with intermediate versions of the models before they finish the 200th epoch and
with the 200th epoch.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Relation Extraction Results</title>
      <p>7.1. Subtask 6.2.1 - BT-RE
For the BT-RE subtask, results on the dev set are shown in Table 9, demonstrating that ATLOP is a very
efective method for relation extraction. While REBEL is a good end-to-end alternative, performance is
slightly worse than ATLOP models, which suggests that there is a strong potential for encoder-decoder
or decoder-only models for relation extraction. Another observation is that task pretraining gives
improvement for BiomedELECTRA, but it isn’t very significant. Here, all ATLOP models use NER
predictions from the baseline GLiNER model.</p>
      <p>The results on the test set are shown in Table 10. While XLM-R and REBEL do not fall too much behind
BiomedELECTRA, BiomedELECTRA, as a base model with task pretraining, is the best-performing
model. Here again, all ATLOP models use NER predictions from the baseline GLiNER model.
rebel-large</p>
      <sec id="sec-7-1">
        <title>ATLOP + XLM-R base (100 epochs)</title>
      </sec>
      <sec id="sec-7-2">
        <title>ATLOP + BiomedELECTRA + TP (100 epochs)</title>
      </sec>
      <sec id="sec-7-3">
        <title>ATLOP + BiomedELECTRA + TP (200 epochs)</title>
        <p>Test Micro-P</p>
        <p>Test Micro-R</p>
        <p>Test Micro-F1
7.2. Subtask 6.2.2 - TT-RE
The dev and test sets results for TT-RE are shown in Table 11 and Table 12, respectively. Here, the
ifgures are similar to the ones from the previous task, with the main diference being that on the test set
this time "ATLOP + BiomedELECTRA + TP" trained with 200 epochs performs slightly better compared
to the one trained with 100 epochs. Here, all ATLOP models use the NER predictions from the baseline
GLiNER model.</p>
        <sec id="sec-7-3-1">
          <title>7.3. Subtask 6.2.3 - Ternary Mention-based Relation Extraction</title>
          <p>For the final subtask, Ternary Mention-based Relation Extraction (TM-RE), the results are again fairly
similar. The results on the dev set can be found in Table 13 and again demonstrate that ATLOP with
BiomedELECTRA as a base model is the best performing model, with XLM-R being slightly behind and
rebel-large showing strong performance while being an end-to-end model. Here, all ATLOP models use
the NER predictions from the baseline GLiNER model.</p>
          <p>The results on the test set are shown in Table 14. This time XLM-R slightly outperforms
BiomedELECTRA, and rebel-large is not much behind. Here, is observed significant degradation in performance
compared to the previous tasks. This is partially due to the accumulation of errors following each task,
as the final task contains each of the other subtasks in itself. This could also partially be due to the fact
that the search space becomes much larger with the final subtask. Let  be the input length (number
of tokens). Let  be the number of possible spans (ordered contiguous sub-sequences of tokens) in an
input of length  . Then the number of such spans is  = (+1) . Let the number of possible classes
2
for entities be  classes. Then, for the NER subtask, the search space is  , which is exponential with
respect to the input length. For the relation extraction subtask, the number of possible relations is 2
assuming that each entity can be in relation to each other since all relations could be in the document
or none of them the total search space is 22 . Which in our case is much smaller because the number
of possible types is much smaller than the number of tokens in a document. But we still see some
performance degradation for this subtask. Some of the models, like ATLOP, require the predictions of
the NER subtask, which explains the degradation of the performance. REBEL is a generative model, so
technically it needs to search through an infinite search space to generate the answer, but potentially
instead of generating the answer one could take take the averaged token log probabilities by giving
each answer to the model, and could achieve better performance, but this was not done in time for
the competition. For the TM-RE subtask, the search space increases to 22 where  is the number
of possible relation types, assuming that all relation types and all entities are possible. This is not a
significant increase in the search space, which is also shown in the results as the performance for this
subtask is just a few percent lower than the BT-RE subtask. For the TM-RE, the search space increases
to (2 + 1)2 = (2 + 1)( (+1) )2 , assuming that each entity can be in relation with each other
2
and with every relation type, and assuming there could be an overlap between entities. Although this is
an overestimation of the search space, it still shows that for this subtask, the search space is significantly
larger, thus at least partially explaining the significantly lower results for the last subtask compared to
the other subtasks. Here, all ATLOP models use the NER predictions from the baseline GLiNER model.</p>
        </sec>
        <sec id="sec-7-3-2">
          <title>7.4. Hybrid Systems</title>
          <p>For subtasks 6.2.1, 6.2.2 and 6.2.3, a hybrid system with another participant in the challenge
CLEANR4 [25]) is proposed. The results are combined by taking the union and intersection of both
solutions to combine the systems’ strengths. The results of the union achieve higher results than any of
the systems alone on subtasks 6.2.1 and 6.2.2. On subtask 6.2.3 our system alone demonstrates better
capabilities. This means that both systems are complementary. Results are shown on Table 15 and
Table 16.</p>
          <p>CLEANR utilizes RAG as its approach to incorporate detailed training data through semantic retrieval
processes in the prompt for the language model (LM). This few-shot approach, combined with dynamic
retrieval, enables the system to be extended or “retrained” by simply adding or reweighting the training
4https://github.com/Dakantz/CLEANR
samples. CLEANR extends the approach by introducing a reweighting of the samples in the retrieval
process to prefer samples with a higher degree of confidence (i.e., prefer the Gold annotations over
the Bronze annotations in our setting). A sentence-transformer system is used to embed the given
training samples and store them in a Postgres database using the pgvector extension. CLEANR utilizes
llama-cpp and llama-cpp-agent for both eficient inference of pre-trained models and constrained
generation from a provided grammar. The grammar is generated using dynamically created models,
which are transformed into the GBNF syntax, which is then used to constrain the LM output to the
exact schema provided by the challenge. CLEANR’s output was combined with the results from ATLOP
+ BiomedELECTRA + TP (100 epochs).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Discussion &amp; Conclusion</title>
      <p>This challenge is of major importance for bootstrapping the development of tools for automated analysis
of gut-brain related literature and therefore facilitating the research in the area. The provided data is
of good quantity as a starting point for fine-tuning deep learning models, however, some categories
of named entities are underrepresented and need alternative approaches. E.g. for genes extraction,
gazetteers may increase recall but then many eforts to remove noise are also necessary. And, of course,
more annotated data would potentially help for better training of these models.</p>
      <p>Our takeaways from the NER task are two: (i) GLiNER deserves more attention, it is our top performing
system on the NER task, it shows best capabilities to generalize and it is worth exploring more in depth;
(ii) progressive training of the models on the dev set is not showing efective results on the test set - all
these models degrade their performance on the test set in comparison with the dev set. This could also
mean that the dev and train set are not really close in means of annotation agreement.</p>
      <p>For the relation extraction task and ATLOP especially, we come to the conclusion that the model
is very sensitive to the initial set of named entities it works on. If the model is provided with more
entities than the ones entering in relations then the performance heavily drops. E.g. the ATLOP results
are worse when the golden named entities are provided as input than when working on the GLiNER
entities. Therefore initial pre-processing of the supplied named entities improves the results.</p>
      <p>The hybrid system in which our RE model was part of outperformed our own system which means that
the results of both systems are complementary and not overlapping, therefore combing a
transformerbased language model with an LLM-based approach seems to be promising for further research.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work is partially supported by European Union’s Horizon research and innovation programme
projects RES-Q PLUS [Grant Agreement No. 101057603] and HEREDITARY [Grant Agreement No.
101137074]. Views and opinions expressed are however those of the author only and do not necessarily
reflect those of the European Union. Neither the European Union nor the granting authority can be
held responsible for them.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
comprehensive survey, recent advancements, and future research directions, Neurocomputing
(2024) 129171.
[12] L. Luo, P.-T. Lai, C.-H. Wei, C. N. Arighi, Z. Lu, Biored: a rich biomedical relation extraction
dataset, Briefings in Bioinformatics 23 (2022) bbac282.
[13] A. Névéol, R. I. Doğan, Z. Lu, Semi-automatic semantic annotation of pubmed queries: a study on
quality, eficiency, satisfaction, Journal of biomedical informatics 44 (2011) 310–318.
[14] M. Sänger, U. Leser, Large-scale entity representation learning for biomedical relationship
extraction, Bioinformatics 37 (2021) 236–242.
[15] N. A. A. Hassan, R. A. A. A. A. Seoud, D. A. Salem, Open information extraction methodology for
a new curated biomedical literature dataset, International Journal of Advanced Computer Science
and Applications 14 (2023).
[16] W. Zhou, K. Huang, T. Ma, J. Huang, Document-level relation extraction with adaptive thresholding
and localized context pooling, Proceedings of the AAAI Conference on Artificial Intelligence 35
(2021) 14612–14620. doi:10.1609/aaai.v35i16.17717.
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. URL: https://arxiv.org/abs/1810.04805. arXiv:1810.04805.
[18] M.-T. Clark, Kevin andLuong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as
discriminators rather than generators (2020).
[19] L. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in: Third Workshop
on Very Large Corpora, 1995. URL: https://aclanthology.org/W95-0107/.
[20] M. Neumann, D. King, I. Beltagy, W. Ammar, Scispacy: Fast and robust models for biomedical
natural language processing, in: Proceedings of the 18th BioNLP Workshop and Shared Task
(2019) 319-327, 2019. URL: https://arXiv:1902.07669.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, 2023. URL: https://arxiv.org/abs/1706.03762. arXiv:1706.03762.
[22] R. Jia, C. Wong, H. Poon, Document-level n-ary relation extraction with multiscale representation
learning, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics,
Minneapolis, Minnesota, 2019, pp. 3693–3704. URL: https://aclanthology.org/N19-1370/. doi:10.
18653/v1/N19-1370.
[23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, 2019. URL: https://arxiv.org/abs/1907.
11692. arXiv:1907.11692.
[24] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020.</p>
      <p>URL: https://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[25] B. Kantz, P. Waldert, S. Lengauer, T. Schreck, Constrained linked entity annotation using rag
(cleanr), in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 –
Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Dong</surname>
          </string-name>
          , E. Mayer,
          <article-title>Advances in brain-gut-microbiome interactions: a comprehensive update on signaling mechanisms, disorders, and therapeutic implications</article-title>
          ,
          <source>Cellular and molecular gastroenterology and hepatology 18</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Mohajeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>La</surname>
          </string-name>
          <string-name>
            <surname>Fata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Steinert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <article-title>Relationship between the gut microbiome and brain function</article-title>
          ,
          <source>Nutrition reviews 76</source>
          (
          <year>2018</year>
          )
          <fpage>481</fpage>
          -
          <lpage>496</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Irrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          , Overview of GutBrainIE@CLEF 2025:
          <article-title>Gut-Brain Interplay Information Extraction</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>U.</given-names>
            <surname>Zaratiana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tomeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holat</surname>
          </string-name>
          , T. Charnois,
          <article-title>GLiNER: Generalist model for named entity recognition using bidirectional transformer</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>5364</fpage>
          -
          <lpage>5376</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>300</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>300</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, Y. Gu,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Fine-tuning large neural language models for biomedical natural language processing</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/ abs/2112.07869. doi:
          <volume>10</volume>
          .48550/ARXIV.2112.07869.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>
          ,
          <year>2020</year>
          . arXiv:arXiv:
          <year>2007</year>
          .15779.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>P.-L. Huguet</surname>
            <given-names>Cabot</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , REBEL:
          <article-title>Relation extraction by end-to-end language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2370</fpage>
          -
          <lpage>2381</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          , T. Ma, J. Huang,
          <article-title>Document-level relation extraction with adaptive thresholding and localized context pooling</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2010</year>
          .11304. arXiv:
          <year>2010</year>
          .11304.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Named entity recognition and relationship extraction for biomedical text: A</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>