<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From Named Entities to Relations: End-to-End Biomedical Information Extraction⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ron Keinan</string-name>
          <email>ronke21@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amir David Nissan Cohen</string-name>
          <email>amirdnc@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reut Tsarfaty</string-name>
          <email>reut.tsarfaty@biu.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bar Ilan University</institution>
          ,
          <addr-line>Ramat-Gan</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CLEF 2025 Working Notes</institution>
          ,
          <addr-line>9 - 12</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper details our contribution to the BioASQ CLEF Lab 2025 [1] GutBrainIE shared task (task 6) [2]. The mission focuses on Named Entity Recognition (NER) and Relation Extraction (RE) from biomedical abstracts concerning the gut-brain axis, Parkinson's disease, and mental health. We developed a system leveraging large language models (LLMs), employing GLiNER for NER and ATLOP for RE, fine-tuned on various backbones including GLiNER Large Bio and roberta large. Our approach involved a three-stage pipeline: data processing, model fine-tuning, and prediction generation. We participated in four subtasks: NER (6.1), Binary Tag-Based RE (6.2.1), Ternary Tag-Based RE (6.2.2), and Ternary Mention-Based RE (6.2.3). Our results indicate strong performance in tag-based RE, with our roberta-large model achieving a micro-F1 score of 0.6122 in binary RE (ranked 5th out of 11) and 0.5911 in Ternary tag-based RE (ranked 6th out of 12), outperforming the baseline in both cases. However, our system struggled with NER (micro-F1 0.4816, ranking 15th) and particularly with Ternary Mention-Based RE (micro-F1 0.1799, ranking 11th), highlighting challenges with fine-grained entity detection and mention-level relation identification. We conclude that while large transformers are efective for extraction of biomedical relationships, future work must address domain adaptation for NER and explore joint modeling approaches to improve mention-level performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Information Extraction</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Relation Extraction</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Transformers</kwd>
        <kwd>GLiNER</kwd>
        <kwd>ATLOP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>We present a system for extracting biomedical entities and their relations concerning the gut microbiota,
Parkinson’s disease, and mental health from PubMed abstracts. These associations, while increasingly
evidenced, remain buried within unstructured scientific texts. The challenge aims to promote the
development of NLP systems for the identification of key biomedical entities and their relations within
PubMed abstracts.</p>
      <p>The task comprises two primary subtasks: Named Entity Recognition (NER) and Relation Extraction
(RE). Subtask 6.1 (NER) involves detecting and classifying spans into 13 biomedical categories, including
microorganisms, diseases, and chemicals. Subtask 6.2 (RE) is divided into three phases: binary relation
detection (6.2.1), ternary tag-based classification (6.2.2) and mention-level relation identification (6.2.3).
The dataset includes multiple annotation tiers (Platinum, Gold, Silver, Bronze), ofering varied training
and evaluation scenarios.</p>
      <p>
        To address these tasks, we developed a system based on state-of-the-art Transformer models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
specifically leveraging encoders [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] within a pipeline architecture. The preprocessing steps included
annotation conversion and text normalization. For NER, we fine-tuned multiple GLiNER model variants[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
including domain-adapted backbones. For RE, we used the ATLOP architecture[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] , experimenting
with SapBERT from PubMedBERT fulltext [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], biobert v1.1 pubmed [8] and roberta large [9]. Training
was conducted in a supervised manner using combined datasets from the higher annotation tiers:
      </p>
      <p>Platinum (externally reviewed expert annotations), Gold (in-house expert annotations), Silver (student
annotations: A – high accuracy, B – moderate accuracy).</p>
      <p>We submitted multiple runs per subtask, each based on a diferent transformer model and varying
in training epochs, batch size, and filtering threshold. Notably, our best results were achieved in
the tag-based RE subtasks (6.2.1 and 6.2.2). Our system involved data preprocessing and fine-tuning
RuBERTa-Large with an ALTOP-style approach, ranked 5th and 6th and outperforming the baseline. In
contrast, our performance in the NER subtask (15th) and the mention-level RE subtask (11th) lagged
behind the top systems reveal ongoing dificulties with fine-grained biomedical NER and complex
relation extraction. However, our performance in the NER (15th) and mention-level RE (11th) subtasks
lagged, indicating that fine-grained biomedical entity and relation extraction remains a persistent
challenge.</p>
      <p>The reminder of this paper is organized as follows. We begin with a review of related work, followed by
a description of our system architecture and data processing pipeline. We then present our experimental
setup and results, including comparisons with other participants. We conclude with a discussion of the
insights gained, limitations encountered, and directions for future research, particularly in improving
domain adaptation, joint modeling, and knowledge integration for biomedical NLP.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The exponential growth of textual data, particularly in specialized domains such as biomedicine[10][11],
requires advanced Information Extraction (IE) techniques to help researchers and practitioners manage
the influx of information and uncover new insights [ 12]. IE encompasses a variety of tasks, with Named
Entity Recognition (NER) and Relation Extraction (RE) being fundamental for transforming unstructured
text into structured knowledge.</p>
      <sec id="sec-2-1">
        <title>2.1. Named Entity Recognition</title>
        <p>Named Entity Recognition (NER), is a foundational task in Natural Language Processing (NLP) that
involves identifying and categorizing predefined entities in unstructured text, such as names of persons,
organizations, locations, dates, and monetary values [13]. Its primary objective is to locate spans of
text that correspond to specific semantic categories, thereby enabling downstream applications such as
information retrieval, question answering, and knowledge base construction.</p>
        <p>To illustrate the application of NER in the biomedical domain, consider the following sentence
extracted from a biomedical abstract [14]:
"Probiotics are live microorganisms that confer health benefits on the host when
administered in adequate amounts."</p>
        <p>In this example, an efective NER system should correctly identify and classify “ Probiotics” as a
TherapeuticAgent and “microorganisms” as a BiologicalEntity. Recognizing such fine-grained
entities is essential for downstream biomedical applications, such as knowledge base construction,
therapeutic analysis, and personalized medicine.</p>
        <p>Historically, early NER systems were built using hand-crafted rules and domain-specific dictionaries.
While efective in constrained environments, these rule-based methods lacked generalization across
domains. The advent of statistical machine learning approaches—such as Hidden Markov Models (HMMs)
[15], Support Vector Machines (SVMs) [16], and Conditional Random Fields (CRFs) [17]—marked a
significant paradigm shift. These models leveraged annotated corpora and engineered features such as
word shape, orthographic patterns, part-of-speech tags, and lexical resources like gazetteers to improve
generalization and adaptability.</p>
        <p>
          The advent of deep learning has significantly advanced NER capabilities[ 18]. Recurrent Neural
Networks (RNNs), particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
networks, often combined with CRFs (e.g., BiLSTM-CRF architectures), became the standard for their
ability to capture contextual information efectively [ 19, 20]. More recently, transformer-based models,
pre-trained on vast text corpora, have set new benchmarks by learning rich contextual representations
of words, with advancements seen in models like BERT [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], RoBERTa [21], and ELECTRA [22]. Large
Language Models (LLMs) are also increasingly being explored for their capabilities in NER, often through
few-shot or zero-shot learning approaches [23].
        </p>
        <p>
          In the biomedical domain (BioNER), the focus shifts to identifying biologically significant entities such
as genes, proteins, diseases, chemicals, cell lines, and microbial taxa [24]. The complexity and ambiguity
of biomedical terminology[25][? ], along with the constant emergence of new terms, make BioNER a
particularly challenging task [26]. Initial BioNER systems also relied on dictionary-based methods and
rule-based systems, followed by traditional machine learning models like CRFs [27]. However, deep
learning methodologies, especially transformer-based pre trained language models like BERT [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
its domain-specific variants such as biobert v1.1 pubmed [8], PubMedBERT [28], and ClinicalBERT [29],
have demonstrated state-of-the-art performance. These models leverage vast amounts of unlabeled
biomedical text to learn contextual representations, and are then fine-tuned on task-specific annotated
datasets to recognize specific entity types relevant to a particular study.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Relation Extraction</title>
        <p>Relation Extraction (RE) is a key task in Information Extraction that seeks to identify and classify
semantic relationships between entities. In a general domain context, RE might identify that a person
"works for" an organization, or that a company is "headquartered in" a particular city. Traditional
RE systems often relied on hand-crafted linguistic patterns, dependency parsing, or kernel-based
methods to capture syntactic and semantic interactions between entity pairs [30, 31]. These approaches
typically involved extensive feature engineering to encode lexical, syntactic, and positional information
surrounding the target entities.</p>
        <p>To illustrate, consider the following sentence from a biomedical abstract:
"Reduced levels of Lactobacillus have been associated with increased severity of
depression symptoms in individuals diagnosed with Parkinson’s disease."</p>
        <p>
          A successful RE system should detect a ternary relation of type
Microbe-Disease-MentalHealthLink,
involving the entities: “Lactobacillus” (a Bacterium), “Parkinson’s disease” (a Disease), and
“depression” (a MentalHealthCondition). This example highlights the challenge of n-ary relation
extraction, wherein more than two entities must be jointly considered within a unified relational frame.
Such n-ary structures are prevalent in biomedical texts and are central to the knowledge discovery
objectives of the GutBrainIE task [
          <xref ref-type="bibr" rid="ref2">32, 33, 2</xref>
          ].
        </p>
        <p>In the biomedical domain, Relation Extraction (BioRE) is tasked with identifying clinically and
biologically significant associations, such as gene-disease links, protein-protein interactions, or therapeutic
drug-disease relationships. The GutBrainIE task extends this paradigm to encompass a range of entities,
including microbial species, neurodegenerative diseases, mental health conditions, and physiological
factors. The biomedical literature’s syntactic complexity, domain-specific vocabulary, and frequent
presence of long-range dependencies demand robust, domain-adapted modeling strategies [34, 35].</p>
        <p>The advent of deep learning has significantly transformed RE by enabling models to learn
discriminative features automatically from raw text. Convolutional Neural Networks (CNNs) have been employed
to extract local contextual features [36], while Recurrent Neural Networks (RNNs), have been utilized
to capture sequential dependencies [37]. More recently, attention mechanisms and transformer-based
encoder architectures have achieved state-of-the-art performance by enabling models to dynamically
attend to the most informative parts of the sentence [38].</p>
        <p>Transformer-based encoder models, including those fine-tuned on biomedical corpora such as
PubMedBERT or BioLinkBERT, have shown considerable eficacy in handling the intricacies of BioRE.
These models can be adapted to treat relation extraction as a classification task over entity tuples, or
through more sophisticated formulations using structured prediction or graph-based reasoning [35, 33].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Encoder-based Approaches for NER and RE</title>
        <p>The landscape of Natural Language Processing has seen rapid advancements. These models, pre-trained
on exceptionally large and diverse datasets, exhibit remarkable few-shot or even zero-shot learning
capabilities for various tasks, including NER and RE. For instance, language models can be prompted
for direct entity identification or fine-tuned with relative eficiency on smaller annotated datasets [ 39],
and they are being explored in specialized areas like the biomedical domain for rapid adaptation and
broader text applicability [40, 41]. Similarly, for RE, language models ofer potential for open-ended
relation extraction and inferential capabilities [42, 43].</p>
        <p>However, alongside these developments, Transformer-based encoder architectures, such as BERT
[44] and its numerous variants, remain highly efective and widely utilized for NER and RE tasks. These
models excel at learning rich contextual representations from text and can be fine-tuned to achieve
state-of-the-art performance on specific datasets and domains. For NER, architectures focusing on
robust entity span detection and classification, exemplified by models like GLINER, leverage powerful
encoder backbones. In the realm of RE, techniques often involve identifying entity pairs and classifying
their relationships, with models such as ATLOP demonstrating sophisticated approaches to this task.
Furthermore, specialized pre-training objectives, as seen in models like SuMeBERTs , can enhance
performance on downstream tasks by tailoring the encoder’s understanding to particular nuances of
language or information structure.</p>
        <p>While LLMs present exciting avenues, particularly for generative tasks and broad-domain applications,
dedicated encoder-based models ofer advantages in terms of computational eficiency for fine-tuning
and inference, focused performance on specific predictive tasks, and often more direct interpretability of
task-specific layers. Challenges in LLMs related to factual accuracy, hallucination mitigation (especially
for specific biomedical entities), and structured prediction for complex relations [ 42] also underscore
the continued relevance of developing and employing well-established and robust encoder-centric
approaches. This work focuses on leveraging the strengths of such encoder-based models for NER and
RE.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Biomedical Shared Tasks and Benchmark Challenges</title>
        <p>Shared tasks have been instrumental in catalyzing progress within biomedical natural language
processing by delivering high-quality benchmark datasets, well-defined evaluation metrics, and platforms
for community engagement. Notable initiatives—including the BioCreative series for assessment of
Information Extraction Systems in Biology [45], the BioNLP Shared Task for structural information
extraction from biomedical literature [46], and the BioASQ challenge (focusing on biomedical semantic
indexing and question answering) [47]. These shared tasks have each contributed to advances in entity
recognition, relation extraction, and broader information-extraction tasks through the provision of
domain-specific corpora and rigorous, competitive evaluation frameworks. These challenges foster
methodological innovation by encouraging participants to devise models that can handle complex
nomenclature, ambiguous terminology, and varied syntactic constructions inherent to biomedical text.</p>
        <p>Building on this lineage, the Gut-BrainIE Task1 concentrates specifically on the gut–brain axis, a
domain characterized by intricate, multi-scale interactions between microbial, molecular, and
physiological entities. By curating a corpus annotated with specialized entity types (e.g., microbial taxa,
neuroactive compounds) and their interrelations (e.g., modulatory efects, transport mechanisms),
GutBrainIE extends the shared-task paradigm to a highly interdisciplinary context. Moreover, its evaluation
schema emphasizes not only exact-match accuracy but also the correct identification of nested and
overlapping spans, as well as fine-grained relation categories that reflect causal or correlational links.</p>
        <p>
          Our work leverages state-of-the-art NER architectures—such as GLiNER [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which employs span
classification—and sophisticated document-level RE models—like ATLOP [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], which integrates adaptive
thresholding with localized context pooling—to address the specific challenges posed by the gut–brain
literature. In doing so, we aim to demonstrate how modern transformer-based encoders [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], when
        </p>
        <sec id="sec-2-4-1">
          <title>1https://hereditary.dei.unipd.it/challenges/gutbrainie/2025/#</title>
          <p>coupled with task-tailored pre and post-processing strategies, can achieve robust performance on a
domain that demands both linguistic precision and biological insight.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Background and Model Foundations</title>
      <sec id="sec-3-1">
        <title>3.1. GLiNER for Named Entity Recognition</title>
        <p>
          GLiNER (Generalist Language model for NER) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a span classification framework built upon a
bidirectional language model (BiLM), such as BERT or DeBERTa. At inference time, GLiNER constructs
a unified input sequence by prepending special marker tokens for each target entity type to the input
sentence, forming a structure such as:
[ENT] PERSON [ENT] LOCATION [ENT] ORGANIZATION ...
sentence.
[SEP] The input
        </p>
        <p>This augmented sequence is processed by the BiLM, which generates contextualized embeddings
for all tokens, including both the entity-type and the sentence words. GLiNER consists of three
main components. First, the pretrained BiLM encodes the full sequence, capturing the contextual
relationships across tokens. Second, a span representation module computes embeddings for each
candidate span—defined as any contiguous sequence of words—by concatenating the embeddings of
the span’s start and end tokens and passing them through a two-layer feedforward network. Third, an
entity representation module constructs embeddings for each entity type by refining the BiLM output
for the corresponding [ENT] token using another feedforward network.</p>
        <p>Both span and entity-type embeddings are projected into a shared latent space, where GLiNER
calculates a similarity score for each span–type pair using a dot product. This score is passed through a
sigmoid activation to estimate the probability that the span belongs to the given entity type. During
training, the model is optimized using binary cross-entropy loss over all such span–type combinations.
Positive pairs are those that appear in the annotated training data, while negative pairs are generated by
sampling mismatched entity types from other training examples within the same batch. This negative
sampling strategy helps the model learn to diferentiate between correct and incorrect associations.
It helps recognize that real-world data often lacks certain entity types. The model’s performance is
evaluated using diferent negative sampling ratios. Training with only positive entities leads to more
false positives (lower precision), while a high negative sampling ratio makes the model overly cautious,
resulting in missed entities (lower recall).</p>
        <p>For decoding, GLiNER uses a greedy span selection algorithm that identifies the most probable spans
while enforcing task-specific constraints. In flat NER mode, only non-overlapping spans are selected.
In nested NER mode, the algorithm allows for nested spans—those that are entirely contained within
other spans—while avoiding partial overlaps. This approach allows GLiNER to eficiently extract both
lfat and hierarchical named entities in a single forward pass.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ATLOP for Document-Level Relation Extraction</title>
        <p>
          Zhou et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] introduce the ATLOP (Adaptive Thresholding and Localized Context Pooling) model for
document-level multi-label relation extraction. ATLOP is built on a pretrained transformer encoder and
avoids the use of intermediate graph structures by leveraging two key mechanisms:
• Localized Context Pooling. ATLOP repurposes the self-attention heads of the pretrained
encoder to derive entity-level attention distributions. For each entity in a candidate pair, the
model averages the self-attention scores over all mentions of that entity to obtain an attention
distribution. These two distributions are then combined (via elementwise multiplication) to
highlight tokens jointly relevant to both entities. The resulting fused weights are used to pool the
encoder’s token representations into a localized context embedding, which serves as a pair-specific
representation for relation classification.
• Adaptive Thresholding. To address the multi-label nature of document-level RE, ATLOP
replaces the conventional fixed decision threshold with a learnable threshold class. During
training, a rank-based loss encourages true relation logits to exceed the threshold logit and
nonrelation logits to fall below it. At inference, any relation whose score surpasses the threshold class
is predicted, thereby removing the need for heuristic threshold tuning and allowing flexibility
across diferent entity pairs.
        </p>
        <p>
          Extensive experiments on RE datasets such as DocRED, CDR, and GDA [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] demonstrate that ATLOP
achieves state-of-the-art performance, validating the efectiveness of localized context pooling and
adaptive thresholding in document-level relation extraction.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The GutBrainIE Task</title>
      <sec id="sec-4-1">
        <title>4.1. Overview of the GutBrainIE Task</title>
        <p>
          We participated in Task #6 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] of the BioASQ [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] CLEF Lab 2025 (GutBrainIE1), which focuses on
extracting entities and relations relevant to the gut–brain axis from PubMed abstracts. This task is
divided into two main subtasks. The first subtask, Named Entity Recognition (NER, Subtask 6.1), involves
identifying and classifying text spans into one of 13 predefined biomedical categories, such as bacteria,
chemical, or microbiota. The expected output format includes the entity label, its location within the title
or abstract, and the start and end character ofsets. The second subtask, Relation Extraction (RE, Subtask
6.2), consists of three progressive phases. In Subtask 6.2.1, Binary RE, the goal is to detect whether a
relation exists between any pair of entities within a PubMed abstract, without the need to specify the
relation type. Subtask 6.2.2, Ternary tag-based RE, requires not only identifying related entities but also
predicting the type of relation between them. Finally, Subtask 6.2.3, Ternary mention-based RE, focuses
on pinpointing the specific entity mentions involved in a relation and classifying the type of relation
they share.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Overview of the GutBrainIE Data</title>
        <p>The annotated GutBrainIE corpus comprises titles and abstracts of biomedical articles retrieved from
PubMed, focusing on the gut–brain interplay and its implications for neurological and mental health.
Each entry in the dataset corresponds to a PubMed article identified by its PubMed ID (PMID) and is
stored in JSON format for ease of integration with NLP pipelines. Entries include rich metadata—such
as title, authorship, journal, publication year, and abstract—alongside the identifier of the annotator.
Annotators are categorized as expert annotators, student annotators, and automated distant annotations.</p>
        <p>The dataset contains two main types of annotations: entities and relations. Entity annotations consist
of individual mentions, each defined by character ofsets (start and end), location (title or abstract), the
text span itself, and a semantic label (e.g., bacteria, microbiome). Relation annotations represent links
between two entity mentions and include detailed information for both the subject and object: their
character ofsets, location, text span, and semantic label, as well as the predicate describing the type of
relationship.</p>
        <p>For ease of use in downstream tasks, relations are also provided in three derived formats: (1) binary
tagbased relations, capturing label pairs of subject and object; (2) ternary tag-based relations, representing
triplets of subject label, predicate, and object label; and (3) ternary mention-based relations, which
include mention-level tuples comprising the subject and object text spans, their respective labels, and
the predicate.</p>
        <p>Annotations are organized into four hierarchical tiers, reflecting varying levels of quality and
provenance. The overall distribution of annotations across these tiers is summarized in Table 1. Specifically,
the tiers are defined as follows:
• Platinum-Standard Annotations: Highest-quality annotations, curated and externally
reviewed by biomedical specialists to ensure maximal precision.
• Gold-Standard Annotations: High-quality annotations produced in-house by domain experts.
• Silver-Standard Annotations: Mid-quality annotations created by trained students under expert
supervision, subdivided into two clusters:
– Student A: Annotators demonstrating consistently high accuracy.</p>
        <p>– Student B: Annotators with less consistent performance.
• Bronze-Standard Annotations: Automatically generated annotations using fine-tuned GLiNER
for NER and ATLOP for relation extraction.</p>
        <p>These hierarchical tiers enable a nuanced understanding of annotation quality and provenance, which
is critical for downstream evaluation and model training. By distinguishing between levels of human
expertise and automation, the dataset supports flexible experimentation across a spectrum of reliability,
allowing researchers to benchmark their systems under varying degrees of annotation fidelity.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. System Architecture and Workflow</title>
        <p>Our pipeline comprises three main stages: data processing, model fine-tuning, and prediction. During
data processing, we standardize and aggregate annotated data to prepare it for training, structuring
named entity recognition (NER) data in the GLiNER format, which includes entity spans and types
in a structured JSON representation, and transforming relation extraction (RE) data into the ATLOP
format, which captures entity pairs and their associated relations within the context of a document.
In the model fine-tuning stage, we adapt the GLiNER model for NER using five variants—GLiNER
multipii-v1, Medium, Large, Large Bio v0.1, and Large Bio v0.2—each ofering diferent balances between
general language capabilities, biomedical specialization, and model size to evaluate performance across
diverse settings. For RE, we fine-tune the ATLOP model on four pretrained language models: SapBERT
from PubMedBERT fulltext, bert-base-cased, biobert v1.1 pubmed v1.1, and roberta large, leveraging their
distinct strengths in general and biomedical language representation to enhance relation extraction.
Finally, in the prediction stage, we generate entity and relation outputs, merge the results from NER
and RE, and convert them into the appropriate evaluation format.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Key Pre-processing Steps</title>
        <p>Consistent pre-processing was applied to optimize model performance:
1. Lowercase: All input text was converted to lowercase to reduce vocabulary size and mitigate
data sparsity. This standardizes input, treating "Organization" and "organization" as identical,
which aids generalization despite potential loss of casing signals.
2. Space Normalization: This involved standardizing whitespace by collapsing multiple
consecutive spaces into one and removing leading/trailing spaces. This ensures input consistency,
preventing models from learning spurious patterns from irregular spacing.
3. Tokenization: Text was broken down into tokens using the specific tokenizer trained with the
corresponding fine tuned model. This sub-word tokenization handles out-of-vocabulary words
and captures morphological similarities. For NER, entity labels were carefully aligned with tokens,
often assigning the primary label to the first sub-token.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Data and Training</title>
        <p>For both Named Entity Recognition (NER) and Relation Extraction (RE) tasks, we utilized a combination
of Platinum, Gold, and Silver datasets to maximize training coverage and robustness.</p>
        <p>The NER model was trained using the GLiNER architecture. Training was conducted for 30 epochs
with a batch size of 8 and a maximum sequence length of 384 tokens. The detailed hyperparameters are
listed in Table 2.</p>
        <p>For the RE task, we adopted the ATLOP model. Training was performed over 500 epochs with a batch
size of 4 and a maximum input length of 1024 tokens. The complete configuration is summarized in
Table 3.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Evaluation Metrics</title>
        <p>To rigorously evaluate system performance across both subtasks, we used the metrics provided by
the task organizers2. The metrics Precision, Recall, and F1-score, computed under both macro- and
micro-averaging schemes. These metrics allow for the assessment of model efectiveness at both the
label level and across the entire label distribution. The evaluation criteria are consistent across all
subtasks, and system outputs are benchmarked against manually annotated ground truth labels.</p>
        <p>Let  denote the set of target labels:
• For Subtask 6.1,  includes all entity labels.
• For Subtask 6.2.1,  refers to pairs of (subject label, object label).</p>
        <p>• For Subtasks 6.2.2 and 6.2.3,  comprises triples of (subject label, predicate, object label).</p>
        <p>The macro-average scores compute metric values independently for each label and then average them,
treating all labels equally regardless of their frequency. In contrast, micro-average scores aggregate the
contributions of all classes to compute overall metrics, giving more weight to frequent labels.</p>
        <p>The evaluation metrics are formally defined as follows:
macro =
1 ∑︁</p>
        <p>|| ∈   +</p>
        <p>macro · macro ,
 1macro = 2 · macro + macro
,
macro =
micro = ∑︀
1 ∑︁</p>
        <p>|| ∈   +</p>
        <p>∑︀∈  
∈(  +  )
,
micro = ∑︀
∑︀∈</p>
        <p>micro · micro .</p>
        <p>∈(  +  ) ,  1micro = 2 · micro + micro</p>
        <p>The primary leaderboard metric for ranking participating systems is the micro-averaged F1-score, as it
more efectively captures model performance in the presence of class imbalance, a common characteristic
of real-world relation extraction tasks. Nonetheless, macro-averaged scores and per-relation metrics
are also reported to provide a comprehensive performance profile, including sensitivity to rare and
long-tail classes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Participation Overview</title>
        <p>The shared task comprised four subtasks. Subtask 6.1 (NER) saw 16 teams submit at least one run,
ranging from the organizer’s baseline to ensemble and transformer-based systems. Subtask 6.2.1 (Binary
Tag-Based RE) included 11 teams, with top systems leveraging both rule-based and deep-learning
approaches. Subtask 6.2.2 (Ternary Tag-Based RE) attracted 12 teams, many extending their binary-tag
RE pipelines to support a neutral relation label. Subtask 6.2.3 (Ternary Mention-Based RE) saw 12 teams
tackling mention-level relation extraction with three labels, including several strong neural-model
submissions.
2https://hereditary.dei.unipd.it/challenges/gutbrainie/2025/#five</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. System Results</title>
        <p>The results for the BIU-ONLP submissions across all subtasks and runs are shown in Table 4. For Subtask
T6.1, the multi pii v1 model achieved the highest Macro-F1 and Micro-F1 scores among all submissions,
indicating strong overall performance on entity recognition. Among the GLiNER variants, gliner large
bio-v0.1 performs best in terms of Micro-F1. For Subtask T6.2.1 and T6.2.2, roberta large consistently
outperforms other models across both macro and micro metrics, demonstrating its robustness in diferent
settings. Finally, Subtask T6.2.3 shows significantly lower scores overall, reflecting the higher dificulty
of this task, although roberta large still maintains a marginal lead in performance.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Comparison to Other Participants</title>
        <p>Table 5 presents our rank, micro-F1, the best micro-F1, and the organizer’s baseline for each subtask.
We see that our system performed competitively on the Binary and Tag-Based Relation Extraction (RE)
subtasks, ranking mid-field and surpassing the baseline in both cases. In Binary RE (6.2.1), we achieved
a micro-F1 of 0.6122, placing 5th out of 11, and similarly in Tag-Based RE (6.2.2), we ranked 6th out of
12 with a score of 0.5911. However, our performance in the NER (6.1) and Mention-Based RE (6.2.3)
subtasks was substantially lower, with our NER system ranking near the bottom and our Mention-Based
RE system showing the largest performance gap relative to the best and baseline systems.
• NER (Subtask 6.1): Our best micro-F1 (0.4816) places us near the bottom (15th of 16), trailing the
baseline by over 0.31. The large gap to the top system (0.8408) suggests specialized biomedical
NER models or ensembling are crucial.
• Binary Tag-Based RE (Subtask 6.2.1): Ranking 5th of 12 with 0.6122 micro-F1, we
outperform the baseline by 0.0175. The margin to the best (0.6864) is 0.0742, indicating competitive
performance but room for relation-specific fine-tuning.
• Ternary Tag-Based RE (Subtask 6.2.2): Our 6th place with 0.5911 micro-F1 is above the
baseline by 0.0160. The 0.0955 gap to the top performer highlights challenges in neutral-relation
distinction.
• Ternary Mention-Based RE (Subtask 6.2.3): Our performance (0.1799) is well below both
baseline and top, reflecting the challenge of mention-level extraction. Low recall suggests the
mention detection module needs enhancement via joint modeling.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion and Conclusions</title>
      <p>Our participation in the CLEF 2025 biomedical shared task focused on four challenging subtasks,
including named entity recognition (NER) and multiple forms of relation extraction (RE). The evaluation
of our systems yields several important insights into model performance, dataset complexity, and
directions for future research.</p>
      <p>The strongest results were obtained using large pretrained transformer architectures, such as roberta
large. These models demonstrated particularly robust performance in the RE subtasks, achieving a
micro-F1 score of 0.6122 in Binary RE and 0.5911 in Ternary tag based RE. These outcomes highlight
the value of contextualized representations in modeling biomedical relations, especially under limited
supervision.</p>
      <p>In contrast, the NER subtask proved significantly more dificult in our settings. Our best-performing
NER model achieved a micro-F1 score of 0.4816. This performance gap underscores the challenges posed
by the dataset, including the presence of fine-grained and domain-specific entity types that are often
rare or absent in general pretraining corpora. The resulting domain shift and data sparsity hindered
generalization, particularly in recognizing low-frequency entities. Broadening the training dataset to
include a wider range of annotated entities across related domains could enhance the model’s ability to
generalize to rare or unseen entity types.</p>
      <p>In stark contrast to the RE tasks, our performance on the NER subtask (6.1) was unexpectedly poor,
with our best system ranking 15th out of 16 participants. This result fell significantly short of our
expectations, as our Micro-F1 score (0.4816) trailed the oficial baseline (0.7927) by a substantial margin.
This wide gap suggests that our chosen models were ill-suited for the specific challenges of this dataset.
While we initially hypothesized a domain shift, the scale of the underperformance indicates a more
fundamental mismatch. The top-performing systems and the baseline likely leveraged models with
extensive pre-training on biomedical corpora (e.g., biobert v1.1 pubmed, PubMedBERT), giving them
a decisive advantage in recognizing the domain-specific entity types that our more general models
struggled with.</p>
      <p>A particularly noteworthy and surprising finding from this subtask was the relative performance
of our GLiNER model variants. Counterintuitively, gliner medium v2.1 achieved a Macro-F1 score
(0.3864) that was not only comparable but slightly superior to its larger counterparts, gliner large v2.1
(0.3842) and gliner large bio v0.1 (0.3711). This outcome was somewhat unexpected. While larger models
usually outperform smaller ones due to their greater capacity, in this case, gliner medium v2.1 may
have benefited from better regularization or simply a more favorable initialization. Alternatively, the
performance convergence may indicate that the available training data was insuficient for the larger
models to fully realize their capacity, leading to overfitting or unstable learning. This points to the
importance of model-data fit, particularly in low-resource domains such as biomedical NER.</p>
      <p>The low recall in Ternary Mention-Based RE (macro-F1 &lt; 0.09) points to limitations in current pipeline
architectures for span detection and relation classification. It might be that the low reults in our NER
module created a cascade of errors, as relations cannot be correctly identified if the constituent entities
are missed. This confirms that for complex, mention-based RE, a pipeline architecture is suboptimal. We
suggest that future systems may benefit from joint models that integrate entity and relation extraction
into a unified framework. Second, while large transformer models excel in high-resource conditions,
they are prone to overfitting when applied to tasks with limited annotated data. Addressing this requires
exploring lightweight alternatives such as domain-specific data augmentation to enhance generalization.</p>
      <p>Moreover, error patterns in both NER and RE indicate confusion between semantically similar entity
types and fine-grained relation labels. These ambiguities suggest that leveraging external knowledge
bases to better disambiguate closely related entities and relations.</p>
      <p>In conclusion, while our models achieved competitive performance in several subtasks, the CLEF
2025 dataset exposes persistent challenges in biomedical information extraction. Progress in this field
will require not only more sophisticated architectures but also greater emphasis on domain adaptation,
structured knowledge integration, and robust learning from sparse data.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <sec id="sec-8-1">
        <title>This work was supported by the Israeli Innovation Authority.</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Gemini, and Grammarly in
order to: Grammar and spelling check, Paraphrase, and reword. After using these tools, the
authors reviewed and edited the content as needed and take full responsibility for the publication’s content.
[8] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language
representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240.
[9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).</p>
      <p>URL: https://arxiv.org/abs/1907.11692.
[10] R. Keinan, E. Margalit, D. Bouhnik, Analysis of user trends in digital health communities using
big data mining, Plos one 19 (2024) e0290803.
[11] R. Keinan, E. A. Margalit, D. Bouhnik, Impacts of a public health crisis on health-centered online
social networks, Informing Science: The International Journal of an Emerging Transdiscipline 28
(2025) 022.
[12] Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi,
S. Sohn, H. Liu, Clinical information extraction applications: A literature review, Journal of
biomedical informatics 77 (2018) 34–49.
[13] D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Lingvisticae</p>
      <p>Investigationes 30 (2007) 3–26.
[14] H. Mirzaei, S. Sedighi, E. Kouchaki, E. Barati, E. Dadgostar, M. Aschner, O. R. Tamtaji, Probiotics
and the treatment of parkinson’s disease: An update, Cellular and Molecular Neurobiology 42
(2022) 2449–2457. doi:10.1007/s10571-021-01128-w, epub 2021 Jul 20.
[15] L. R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition,</p>
      <p>Proceedings of the IEEE 77 (1989) 257–286.
[16] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297.
[17] J. Laferty, A. McCallum, F. Pereira, Conditional random fields: Probabilistic models for segmenting
and labeling sequence data, in: Proceedings of the 18th International Conference on Machine
Learning, Morgan Kaufmann, 2001, pp. 282–289.
[18] D. Bareket, R. Tsarfaty, Neural modeling for named entities and morphology (NEMO2),
Transactions of the Association for Computational Linguistics 9 (2021) 909–928. URL: https://aclanthology.
org/2021.tacl-1.54/. doi:10.1162/tacl_a_00404.
[19] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named
entity recognition, in: Proceedings of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, 2016, pp. 260–270.
[20] J. P. Chiu, E. Nichols, Named entity recognition with bidirectional lstm-cnns, Transactions of the</p>
      <p>Association for Computational Linguistics 4 (2016) 357–370.
[21] Y. Liu, H. Zao, M. Ghosal, A. Goyal, Y. Ding, J. Yang, L. Yang, D. Huang, D. Ma, Z. Yang, X. He,
L. Du, H. Hu, J. Wang, J. Yang, L. Yang, D. Huang, D. Ma, Z. Yang, X. He, L. Du, H. Hu, J. Wang,
RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692
(2019).
[22] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training text encoders as
discriminators rather than generators, International Conference on Learning Representations (2020).
[23] I. Keraghel, S. Morbieu, M. Nadif, Recent advances in named entity recognition: A comprehensive
survey and comparative study, arXiv preprint arXiv:2401.10825 (2024).
[24] S. Yadav, S. Bethard, A survey on recent advances in named entity recognition from deep learning
models, in: Proceedings of the 27th International Conference on Computational Linguistics, 2019,
pp. 2145–2158.
[25] R. Keinan, E. Margalit, D. Bouhnik, Emotional analysis in a morphologically rich language:</p>
      <p>Enhancing machine learning with psychological feature lexicons, Electronics 14 (2025) 3067.
[26] S. Kim, J. Lee, S.-Y. Shin, J. Kang, Bioner: a comprehensive survey of named entity recognition in
the biomedical domain, Briefings in bioinformatics 21 (2020) 2111–2127.
[27] C. E. Kusuma, K. H. Chen, M. H. Hsieh, W. J. Lee, Y. C. Chang, Bioner: A survey of pre-trained
language models for biomedical named entity recognition, Briefings in Bioinformatics 24 (2023)
bbad346. doi:10.1093/bib/bbad346, published online Oct 2023, covers recent trends. If your
Kusuma2024 is diferent, please replace this.
[28] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Pubmedbert:
Domain-specific language model pretraining for biomedical text mining, in: Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8537–8547.
[29] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, M. McDermott, Publicly
available clinical bert embeddings, in: Proceedings of the 2nd Clinical Natural Language Processing
Workshop, 2019, pp. 72–78.
[30] D. Zelenko, C. Aone, A. Richardella, Kernel methods for relation extraction, Journal of machine
learning research 3 (2003) 1083–1106.
[31] R. C. Bunescu, R. J. Mooney, A shortest path dependency kernel for relation extraction, in:
Proceedings of the conference on Human Language Technology and Empirical Methods in Natural
Language Processing, 2005, pp. 724–731.
[32] N. Peng, H. Poon, C. Quirk, K. Toutanova, W.-t. Yih, Cross-sentence n-ary relation extraction with
graph lstms, Transactions of the Association for Computational Linguistics 5 (2017) 101–115.
[33] G. Westerfield, K. Lee, R. Xu, K. C.-C. Chang, TRIDENT: A TRansformer-based method for
openworld n-ary relation extraction, in: The Thirty-Seventh AAAI Conference on Artificial Intelligence,
AAAI 2023, Washington, DC, USA, February 7-14, 2023, AAAI Press, 2023, pp. 13889–13897. URL:
https://ojs.aaai.org/index.php/AAAI/article/view/26615.
[34] S. Zhao, L. Wang, Z. Liu, F. Li, Gutbrainie: A dataset for information extraction of gut-brain axis
from scientific literature, Database 2021 (2021) baab028. doi: 10.1093/database/baab028.
[35] C.-H. Lin, D. Ji, F. Li, A survey on biomedical relation extraction: methods and applications,</p>
      <p>Briefings in Bioinformatics 22 (2021) bbab149.
[36] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation classification via convolutional deep neural
network, in: Proceedings of COLING 2014, the 25th International Conference on Computational
Linguistics: Technical Papers, 2014, pp. 2335–2344.
[37] D. Zhang, D. Wang, Relation classification via recurrent neural network, arXiv preprint
arXiv:1508.01006 (2015).
[38] A. D. Cohen, S. Rosenman, Y. Goldberg, Supervised relation classification as two-way
spanprediction., in: AKBC, 2022.
[39] J. Wang, Y. Li, S. Chang, S. Liu, P. S. Yu, Is chatgpt a good ner annotator? a preliminary study,
arXiv preprint arXiv:2304.12790 (2023).
[40] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,</p>
      <p>S. Pfohl, et al., Large language models encode clinical knowledge, Nature 620 (2023) 172–180.
[41] M. Agrawal, S. Hegselmann, D. Sontag, Large language models are diverse nlu models, arXiv
preprint arXiv:2305.12539 (2023).
[42] X. Han, W. Zhao, Z. Liu, Language Model Is a Good ReLation Extractor, in: Findings of the
Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics,
Toronto, Canada, 2023, pp. 2896–2907. URL: https://aclanthology.org/2023.findings-acl.184.
[43] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, T.-Y. Liu, Biogpt: generative pre-trained transformer
for biomedical text generation and mining, Briefings in bioinformatics 23 (2022) bbac409.
[44] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. URL:
https://aclanthology.org/N19-1423.
[45] R. Islamaj, P.-T. Lai, C.-H. Wei, L. Luo, T. Almeida, R. A. A. Jonker, S. I. R. Conceição, D. F. Sousa,
C.-P. Phan, J.-H. Chiang, et al., The overview of the biored (biomedical relation extraction dataset)
track at biocreative viii, Database 2024 (2024) baae069.
[46] D. Demner-Fushman, S. Ananiadou, M. Miwa, K. Roberts, J. Tsujii (Eds.), Proceedings of the 23rd
Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics,
Bangkok, Thailand, 2024. URL: https://aclanthology.org/2024.bionlp-1.0/.
[47] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, Bioasq-qa: A manually curated corpus for
biomedical question answering, Scientific Data 10 (2023) 170.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <source>volume TBA of Lecture Notes in Computer Science</source>
          , Springer,
          <year>2025</year>
          , p.
          <source>TBA.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Irrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          , Overview of GutBrainIE@CLEF 2025:
          <article-title>Gut-Brain Interplay Information Extraction</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          , et al.,
          <article-title>Transformers: State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/N19-1423/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>U.</given-names>
            <surname>Zaratiana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tomeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holat</surname>
          </string-name>
          , T. Charnois, Gliner:
          <article-title>Generalist model for named entity recognition using bidirectional transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2311.08526</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/ abs/2311.08526.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          , T. Ma, J. Huang,
          <article-title>Document-level relation extraction with adaptive thresholding and localized context pooling</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>35</volume>
          , AAAI Press,
          <year>2021</year>
          , pp.
          <fpage>7093</fpage>
          -
          <lpage>7101</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/ view/17164.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shareghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Basaldella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>Self-alignment pretraining for biomedical entity representations</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4220</fpage>
          -
          <lpage>4230</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>334</volume>
          /.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>