<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. A. Miller, Wordnet: a lexical database for english, Commun. ACM</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/219717.219748</article-id>
      <title-group>
        <article-title>Understanding Gut-Brain Interplay in Scientific Literature: A Hybrid Approach from Classification to Generative LLM Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chaeeun Lee</string-name>
          <email>chaeeun.lee@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simona E. Doneva</string-name>
          <email>simona.doneva@uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Juliana Rodriguez-Cubillos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisa Castagnari</string-name>
          <email>e.castagnari@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antoine D. Lain</string-name>
          <email>a.lain@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joram M. Posma</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>T. Ian Simpson</string-name>
          <email>ian.simpson@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Reproducible Science, University of Zurich</institution>
          ,
          <addr-line>Hirschengraben 84, 8001 Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
          ,
          <addr-line>10 Crichton Street, EH8 9AB, Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism</institution>
          ,
          <addr-line>Digestion, and Reproduction</addr-line>
          ,
          <institution>Faculty of Medicine, Imperial College London</institution>
          ,
          <addr-line>London W12 0NN</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>38</volume>
      <issue>1995</issue>
      <fpage>38</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>In this work we present our approach to Task 6 GutBrainIE in the CLEF2025 BioASQ, in which we develop Natural Language Processing (NLP) systems to extract structured information from biomedical literature related to the gut microbiome and its connection to neurological disease and mental health. The task consists of a multi-class Named Entity Recognition (NER) subtask (6.1) and three Relation Extraction (RE) subtasks(6.2.1 Binary Relation Extraction, 6.2.2 Ternary Tag-Based Relation Extraction and 6.2.3 Ternary Mention-Based Relation Extraction). Our system adopts a two-stage pipeline. First, we addressed the NER as a token classification task with encoderonly BERT-based models. To address the complexity of the multi-class NER including significant class imbalance, we explored a range of training and post-processing strategies, such as span-based ensemble of predictions from models trained on diferent subsets of labels. For RE, we investigated both an encoder-based classification approach and a generative approach where we fine-tuned a large language model (LLM) on generated reasoning traces. Our systems achieved competitive performance for both NER and RE subtasks, with our best RE system ranking 3rd for mention-level RE on the oficial leaderboard and our best NER system ranking 4th, demonstrating the efectiveness of combining structured classification with generative reasoning in biomedical information extraction. In addition, we provide qualitative insights into the challenges of multi-class NER for domain-specific corpus and complementary strengths and limitations of encoder-based and generative approaches for RE. Our ifndings underscore the value of combining structured classification with interpretability-oriented generative reasoning in information extraction pipelines.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Natural Language Processing</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Relation Extraction</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Gut Microbiota</kwd>
        <kwd>Gut-Brain</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The gut–brain axis is a complex biological system enabling bidirectional signalling between the brain
and the gut. A growing body of studies have highlighted the potential role of the gut microbiome
in neurological and psychiatric conditions [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Much of this information is only accessible as
unstructured text in scientific journal articles [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] limiting our ability to use these data to deepen our
understanding of the gut-brain axis. Natural Language Processing (NLP)-based information extraction
(IE) methods ofer a promising way to harness this information for research use through reliable IE
tools that are capable of identifying and organising relevant biomedical knowledge[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The GutBrainIE
task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in BioASQ Laboratory [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provides an annotated dataset for developing such tools, with a focus
on extracting structured information from biomedical abstracts related to gut microbiota and their roles
in psychiatic and neurological disease.
      </p>
      <p>
        The GutBrainIE task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is composed of four subtasks designed to evaluate biomedical IE tools for
gut-brain interplay. The first, Subtask 6.1: Named Entity Recognition (NER), requires systems to identify
and classify entity mentions into one of thirteen predefined biomedical categories. The remaining three
subtasks focus on Relation Extraction (RE) at varying levels of detail. Subtask 6.2.1: Binary RE asks
participants to detect whether a relation exists between two identified entities within a document,
without specifying the type of relation. Subtask 6.2.2: Ternary Tag-Based RE extends this by requiring
systems to predict not only the presence of a relation but also its type from a predefined set of relation
predicates. Finally, Subtask 6.2.3: Ternary Mention-Based RE requires participants to identify the exact
entity mentions involved in a relation and classify the relation type between them. These four subtasks
were created using a fine-grained annotation schema, which defines 13 distinct entity categories and
25 relation types. This level of granularity supports richer biomedical understanding and enables
applications such as knowledge-graph construction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and evidence synthesis.
      </p>
      <p>Biomedical IE is challenging because many biomedical categories overlap and entity meanings are
often ambiguous. For instance, ‘bacteria’ typically refers to bacterial taxonomy, whereas ‘microbiome’
covers the broader microbial community and thus requires contextual interpretation. Likewise,
distinguishing among ‘chemical’, ‘dietary supplement’, and ‘drug’ mentions is dificult, since the same
compound may belong to diferent categories depending on use context or regulatory status. Gene
mentions also require nuanced understanding of the context. In the literature, a ‘gene’ can denote the
gene itself, its protein product, an enzyme, or even an entire biological pathway, again depending on
context. Such contextual variability complicates the design of a fine-grained annotation schema that
generalises, especially when combined with relation-extraction tasks that must identify both interaction
type and directionality.</p>
      <p>
        This paper presents the work of our team (ICUE) for the GutBrainIE task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in BioASQ Laboratory
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We discuss the dataset analysis and distribution, our methodology for each subtask, experimental
setup, results, limitations and directions for future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        With much of biomedical knowledge often only accessible through unstructured scientific literature,
biomedical information extraction continues to be a rapidly evolving area of research. Pioneering
resources such as the GENIA corpus [9] laid the foundation two decades ago, and the field has since
progressed from feature-engineering approaches to transformer-based models (e.g., BioBERT [10],
SciBERT [11]) and, most recently, to LLMs capable of few-shot reasoning. Over the years, many
datasets and systems have been developed for both NER and RE tasks. In NER, previous research has
primarily focused on identifying entities such as diseases, drugs/chemicals, genes/proteins, and species
[12, 13, 14, 15, 16, 17, 18, 19]. For RE, the focus has typically been on identifying the presence or absence
of a relationship between entities such as genes and diseases or proteins and chemicals [
        <xref ref-type="bibr" rid="ref9">20, 21, 22, 23</xref>
        ].
However, while these well-known and widely used datasets cover established biomedical categories,
they do not capture the finer distinctions introduced in this challenge, such as separating chemicals from
drugs or dietary supplements, or distinguishing bacteria from the broader microbiome. Additionally,
entity types such as biomedical technique and statistical technique are rarely annotated in existing
corpora.
      </p>
      <p>
        In terms of methods, the advent of the Transformer architecture [
        <xref ref-type="bibr" rid="ref10">24</xref>
        ] spurred the development
of biomedical language models based on fine-tuned transformers. BioBERT [ 10], a domain-specific
variant of BERT pre-trained on PubMed and PMC articles, achieved state-of-the-art performance on
various biomedical NER benchmarks, including F1 scores of 89.71% on NCBI Disease [12] and 87.15%
on BC5CDR Disease [14]. PubMedBERT [
        <xref ref-type="bibr" rid="ref11">25</xref>
        ], trained exclusively on PubMed abstracts, ofers improved
performance in specific biomedical categories, particularly gene (F1 score of 79.10% on JNLPBA [ 19])
and disease mentions (F1 score of 85.62% on BC5-disease [14]), and outperformed BioBERT for RE with
a F1 score of 83.96% on GAD [20] and 77.24% on ChemProt [
        <xref ref-type="bibr" rid="ref9">23</xref>
        ].
      </p>
      <p>
        There has been notable parallel progress in RE methods. Following early successes with
transformerbased encoder-only models for sequence classification, generative sequence-to-sequence approaches
using encoder-decoder architectures have also shown strong potential. REBEL [
        <xref ref-type="bibr" rid="ref12">26</xref>
        ] is an autoregressive
sequence-to-sequence model for RE. It frames RE as a generation task, translating raw text into structured
relation triplets. Built on a BART-based Transformer architecture, REBEL uses a linearisation approach
with special tokens to represent triplets, enabling eficient autoregressive decoding. The model has
been evaluated on standard RE benchmarks such as TACRED [
        <xref ref-type="bibr" rid="ref13">27</xref>
        ], DocRED [
        <xref ref-type="bibr" rid="ref14">28</xref>
        ], and CONLL04 [
        <xref ref-type="bibr" rid="ref15">29</xref>
        ],
where it achieves competitive or state-of-the-art results, as well as better generalisation in low-resource
settings. While REBEL does not have a distinct NER module or requires it as a preliminary step, its
end-to-end generation process directly identifies and outputs the entity spans and their types as part of
generating the complete relation triplet. SciSpacy [
        <xref ref-type="bibr" rid="ref16">30</xref>
        ], a spaCy extension with pre-trained NER models
for biomedical texts, has proven efective in lightweight applications but lacks the contextual reasoning
capabilities of transformer-based approaches.
      </p>
      <p>
        LLMs have shown promise in few-shot biomedical classification tasks [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">31, 32, 33</xref>
        ]. However, these
models face limitations when applied to NER. A key challenge arises from the divergence between the
parameter knowledge of LLMs, learned during pretraining, and the specific annotation guidelines of
the biomedical corpus of interest. This misalignment often results in trade-ofs between precision and
recall, especially in context-dependent cases. Entity types like gene or microbiome are particularly
dificult, as their meanings are heavily dependent on the surrounding context. Despite these challenges,
LLMs have demonstrated remarkable capabilities in natural language understanding and generation
tasks, driven by Transformer-based architectures that have scaled to hundreds of billions of parameters.
Pre-trained on vast text corpora using self-supervised objectives, these models are typically fine-tuned
on task-specific data using strategies like prompting or supervised training. One notable advancement
has been the incorporation of chain-of-thought (CoT) prompting, which enables LLMs to perform
complex, multi-step reasoning tasks by introducing explicit reasoning traces into few-shot examples
[
        <xref ref-type="bibr" rid="ref20">34</xref>
        ]. This has substantially improved both accuracy and interpretability in many tasks.
      </p>
      <p>
        Recent research also focuses on scaling reasoning at inference time, where models generate explicit
reasoning tokens interspersed with normal output, allowing for more interpretable and structured chains
of thought [
        <xref ref-type="bibr" rid="ref21 ref22">35, 36</xref>
        ]. Building on these advances, knowledge distillation has emerged as a promising
method for transferring the reasoning capabilities of larger models to smaller ones. In this setup, a
teacher model generates both final answers and intermediate rationales, which are then used to train a
student model to replicate both outputs and reasoning traces [
        <xref ref-type="bibr" rid="ref23">37</xref>
        ]. Incorporating synthetic reasoning
data during distillation has been shown to significantly boost performance. Distilled models trained in
this way can match the zero-shot performance of larger models on specific reasoning tasks. However,
due to their smaller parameter size, these models are still constrained when handling
knowledgeintensive problems, and some information loss is inevitable during the distillation process. While these
approaches have shown limited efectiveness for NER, they appear more promising for RE tasks, where
reasoning plays a greater role than static entity knowledge.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset used in this study consists of titles and abstracts from PubMed articles, with a thematic
focus on the gut microbiota and its connection to Parkinson’s disease and mental health. The data is
divided into three primary subsets: a training set of 1,567 articles, a development set of 40 articles, and
a test set of 40 articles.</p>
      <p>
        Training data is stratified by annotation quality into four levels. The highest quality,
Platinumstandard annotations, are expert-curated and externally reviewed by biomedical professionals.
Goldstandard annotations are also expert-curated but without external review. Silver-standard annotations
were created by trained student annotators under supervision and are further subgrouped based on
annotator consistency, where documents in StudentA where annotated by annotators with more consistent
performance compared to StudentB. Bronze-standard annotations were generated automatically using
GLiNER [
        <xref ref-type="bibr" rid="ref24">38</xref>
        ] for NER and ATLOP [39] for RE.
      </p>
      <p>Each article is annotated with entity mentions and, where applicable, relations between entity pairs.
Entities are labeled according to a predefined schema of 13 biomedical categories. The test set is drawn
from the gold and platinum standard data and includes only titles and abstracts. It is constructed to
provide broad coverage of the entity and relation types relevant to the task.</p>
      <p>Table 1 summarises the entity labels, definitions, and their frequency in the training data.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>4.1. NER
We describe here the methods and configurations used to develop our systems for the NER task. We
utilised encoder-only Transformer models for token classification, mainly focusing on domain-specific
pretrained models with various preprocessing, fine-tuning, and postprocessing steps. Each system was
trained as a token-level sequence tagger using the IOB2 labeling scheme [40], with model configurations
varying in backbone architecture, class subset coverage, and ensemble composition. Data preparation
involved token alignment, label assignment, and filtering based on entity presence. Postprocessing
recovered entity spans from token-level predictions, resolving subword splits and validating ofset
mappings. The systems were trained and evaluated using a unified framework built on the HuggingFace
Transformers library [41]. To improve clarity and reproducibility, we defined a set of standardised codes
for backbone models, class coverage, ensemble strategies, and post-processing steps (Table 2). These
codes are then used in Table 3 to describe each of the top 5 system based on test set performance.</p>
      <sec id="sec-4-1">
        <title>4.1.1. Preprocessing</title>
        <p>To prepare the data for model training, article texts were first merged with their corresponding
annotations using PubMed IDs. Tokenization was then performed using either the bert-base-cased or
bert-base-uncased tokenizer, depending on the model configuration.</p>
        <p>During tokenisation, words may be split into subwords. These subwords were re-aligned to form
original tokens, and entity labels were propagated across subwords according to the standard IOB2
tagging scheme. Tokens were labeled as B-, I-, or O based on whether they marked the beginning,
continuation, or absence of an entity span. Each document was processed independently, grouped by
PubMed ID and sentence location. For each token, we checked for overlaps with annotated entity spans
and assigned corresponding labels. The system supported multi-label settings and handled overlapping
annotations by prioritising the longest match.</p>
        <p>Label coverage was defined by a configurable label set. By default, models were trained on the full
set of 13 entity types, but we also trained models on subsets (e.g., high-frequency labels or task-specific
classes like food, DDF, or microbiome). All IOB2-converted data were exported to JSON format. For
sequences exceeding 512 tokens, the input was split into chunks to conform to model input size limits
of 512 with overlap size of 12.</p>
        <sec id="sec-4-1-1">
          <title>4.2. Model Architecture</title>
          <p>We experimented with four model families for NER. The organiser baseline system was GLiNER,1 a
lightweight span classification model designed to generalise to unseen entity types via transfer learning.
It does not require predefining a label set, making it suitable for few-shot or open-domain scenarios.</p>
          <p>
            In our systems, mainly domain-specific BERT variants trained on biomedical
corpora were utilised: BiomedNLP-BiomedBERT-base-uncased-abstract and
BiomedNLP-BiomedBERT-large-uncased-abstract [
            <xref ref-type="bibr" rid="ref11">25</xref>
            ], as well as BioLinkBERT-large
[42].
          </p>
          <p>Each model was trained either on all 13 entity types or on a focused subset of entity types. This
design enabled experimentation with label subset selection to test performance trade-ofs.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2.1. Inference and Postprocessing</title>
        <p>At inference time, models produced IOB2-encoded label predictions. Subword tokens were merged back
into full words, and ofset alignment was re-computed to extract contiguous entity spans from
tokenlevel labels. Predictions were filtered using span-level validation checks to ensure consistency with
input text and IOB2 rules. For ensemble systems, we applied span-based majority voting. Specifically in
Table 3, MINK denotes a voting strategy, where a span was retained if predicted as an entity by at least
 models.</p>
        <p>Given the pronounced class imbalance across the 13 entity labels, we experimented with targeted
post-processing techniques tailored to individual classes. We observed that the food entities in the
development dataset were commonly recognized, but misclassified as dietary supplement. This may
be attributed to the overlapping contextual usage of these concepts, combined with the dominance of
annotations for the dietary supplement class. To address this, we used WordNet, a lexical database of
English, to extract hierarchically structured sets of food- and beverage-related terms [43]. Specifically,
we retrieved all hyponyms of the synsets “food.n.02” and “beverage.n.01”. The use of those subsets was
motivated by their similarity to the guideline definition of food: “a group of solid, semi-solid, and liquid
substances which are consumed by humans and animals”. This closely matches the WordNet definitions:
“any solid substance (as opposed to liquid) that is used as a source of nourishment” (“food.n.02”) and
“any liquid suitable for drinking” (“beverage.n.01”).</p>
        <p>Extracted terms were normalized by lowercasing, removing underscores, and applying lemmatization
to reduce morphological variance. These normalized term sets were then used to relabel entities initially
labeled as “dietary supplement”: if a phrase or any of its constituent words matched the food or drink
term sets, the label was overwritten as “food”. In addition, a small manually curated keyword list, based
on the annotation guidelines for food, was included to capture relevant edge cases not covered by
WordNet2. Example relabelings included: “dairy products” → “food” and “unpasteurised milk” → “food”.
4.3. RE
We approached RE tasks using two main strategies: (1) sequence classification with a BERT-based
encoder-only models, and (2) supervised fine-tuning of LLM using generated reasoning traces. In
addition, we implemented two baseline methods: one based on rule-based dataset statistics, and another
using a generative encoder-decoder model for RE. Below, we describe the baselines, our two primary
approaches, and various pre- and post-processing methods that we explored.</p>
        <p>We approached all three RE subtasks (6.2.1–6.2.3) using the same underlying methods. Predictions
were generated uniformly and task-specific outputs were derived by including the appropriate fields
as required by each subtask. Model selection was based on performance on the development set for
subtask 6.2.3 ternary mention-based RE.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3.1. RE Baseline 1: Rule-based Method</title>
        <p>We implemented a simple baseline for RE using relation frequency statistics and co-occurrence patterns
observed in the training data. This method relies on identifying commonly annotated (subject, object)
label pairs and using these to predict relations in unseen documents, with optional filtering based on
distance and likelihood.</p>
        <p>To support this, we first computed relation statistics from the training corpus. For each annotated
relation, we extracted the subject and object labels, the predicate, and the character distance between
the subject’s end and the object’s start index. We tracked the frequency of each (subject, object) pair, the
number of unique annotators who labeled it, and predicate frequencies per pair (excluding annotations
by distant supervision). Additionally, we computed distance-based metrics from the character distances
between the subject’s end and the object’s start index, including mean, median, minimum, maximum,
and robust percentiles (5th and 95th) for each pair.
2The manual keyword list included: diets, diet, product, products, food, and foods.</p>
        <p>We also calculated entity label co-occurrence frequencies across all documents, independent of
whether a relation was annotated. These co-occurrence statistics allowed us to define a relation
likelihood as the ratio of annotated frequency to total co-occurrence frequency for each pair. For
example, the pair &lt;DDF, animal&gt; co-occurred 5,285 times in the dataset, but the relation “target” was
annotated only 547 times, resulting in a relation likelihood of 0.10.</p>
        <p>For binary relation prediction, we used a filtering-based approach grounded in the training statistics
described above. A (subject, object) pair was considered valid if it met all of the following criteria: (1)
it was annotated by at least one non-distant annotator, (2) the total number of predicate annotations
for the pair met a minimum frequency threshold, and (3) its relation likelihood exceeded a predefined
cutof (default: 0.01). We further refined the candidate entity pairs by comparing their character-level
distance against the learned statistics from the training data. A candidate pair was retained only if it
satisfied two conditions: (1) the direction and magnitude of the distance had to be consistent with the
average distance observed in training (e.g., subjects typically preceding objects), and (2) the distance
had to lie within the 5th to 95th percentile range of training distances for that label pair.</p>
        <p>To extend binary predictions to full relation triples, we assigned predicates to each filtered (subject,
object) pair using frequency statistics from the training data. For each pair, we retrieved the distribution
of observed predicates from the training set. If predict_all was enabled, all known predicates for
the pair were predicted, simulating an upper-bound scenario. Otherwise, we selected the single most
frequently annotated predicate as the predicted relation.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3.2. RE Baseline 2: REBEL</title>
        <p>We employed the REBEL framework using the publicly available implementation from https://github.
com/Babelscape/rebel. To adapt the model to the domain-specific dataset, we fine-tuned
Babelscape/rebellarge with the custom entity and relation types from the challenge. This process involved creating new
configuration files for data and training parameters, as well as implementing a dataset loader to handle
our specific data format. We also modified core REBEL source files to support the new schema, including
pl_modules.py, train.py, and test.py, to integrate the custom relation and entity definitions.
&lt;triplet&gt; gut microbiota &lt;microbiome&gt; central nervous system
&lt;anatomical_location&gt; located in
&lt;microbiome&gt; depression &lt;ddf&gt; is linked to
&lt;microbiome&gt; depressive disorder &lt;ddf&gt; is linked to
&lt;triplet&gt; neurotransmitters &lt;chemical&gt; gut microbiota &lt;microbiome&gt; impact
&lt;triplet&gt; gut peptides &lt;chemical&gt; gut microbiota &lt;microbiome&gt; produced by
&lt;triplet&gt; gut microbiota &lt;microbiome&gt; mental health &lt;ddf&gt; is linked to
. . .</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.3.3. Sequence Classification with Encoder-only Transformers</title>
        <p>As our main approach to RE, we formulated RE as a binary sequence classification task. Given a sentence
containing two tagged entities, the model predicts whether a given relation exists between them. To
this end, we fine-tuned a BERT-based Transformer model for sequence classification, which consists of
a pre-trained encoder followed by a classification head. For each input sequence, the model leverages
the contextual representation of the special [CLS] token, which serves as a summary embedding of the
entire input sequence. This representation is passed through a classification head to predict a binary
label indicating the presence or absence of the candidate relation. We fine-tuned the model end-to-end
using binary cross-entropy loss ℒBCE (Equation 1), where  is the number of training examples in a
batch,  is the ground-truth binary label for the -th example, and ˆ is the predicted probability that
the relation exist. We evaluated the model on the development set using standard classification metrics
including precision, recall, and F1 score.</p>
        <p>=1

ℒBCE = − 1 ∑︁[︁ log(︀ ˆ)︀ + (︀ 1 − )︀ log(︀ 1 − ˆ)︀ ]︁,
(1)</p>
        <p>To construct input instances for the binary classification model we first generated all legal entity
pairs from the ground truth NER labels, based on the annotation guidelines (Section 4.1 Relation Labels),
which defines valid combinations of head entity type, tail entity type, and relation predicate. Using
ground-truth NER annotations, we extracted all entity pairs that matched one of these legal entity type
combinations. For each such pair, we created a classification instance by inserting tags around the
head and tail entities in the full sentence. Since there were cases where multiple valid relation types
could exist between two entity types, we added a prefix sentence to explicitly indicate which relation
was being classiefid (Figure 4). This enables the model to disambiguate between diferent predicates
applicable to the same entity pairs. Each resulting sentence was treated as a binary classification
example, with the label indicating whether the specific relation was present or not.
. . .
. . .</p>
        <p>Is "influence" the correct relation between the subject entity &lt;chemical&gt; proinflammatory
cytokines &lt;/chemical&gt; and the object entity &lt;DDF&gt; depression &lt;/DDF&gt; in the following text?</p>
      </sec>
      <sec id="sec-4-6">
        <title>TEXT:</title>
        <p>Moreover &lt;DDF&gt; depression &lt;/DDF&gt; can be induced by administration of &lt;chemical&gt;
proinflammatory cytokines &lt;/chemical&gt;, including IL-2 or IFN- .</p>
        <p>To prepare input sequences, we experimented with three diferent strategies. In the first setting, we
included only the sentences that explicitly contained both the head and tail entities, ensuring the input
was focused on the mention span where the relation might be expressed. To account for cases where
the entities were mentioned in separate sentences but still shared a contextual link, we also tried a
broader context window by selecting the sentences containing the head and tail entities along with all
sentences between them. Lastly, we also experimented with using full-text including both the title and
full abstract as input and tagging head and tail entities where they appear. Due to the nature of the task
setup where binary classification instances were created for all legal subject-object entity pairs, the
resulting dataset was highly imbalanced, with far fewer positive instances (i.e., ground-truth annotated
relations) compared to the large number of negative pairs. To address this, we generated multiple
balanced versions of the training dataset by randomly sampling diferent subsets of negative instances,
while keeping the full set of positive instances fixed across all splits. This ensured that each model
variant saw the complete set of annotated relations while being exposed to diverse, representative
samples of negatives. We trained separate models on each of these balanced training splits, evaluated
them individually on a shared development set, and ultimately ensembled their predictions to improve
robustness and mitigate the efects of label imbalance and sampling variance.</p>
        <sec id="sec-4-6-1">
          <title>4.4. LLM-based RE via Supervised Fine-Tuning</title>
          <p>
            Beyond encoder-based binary classifiers, we explored the use of LLMs for RE via supervised fine-tuning
(SFT) with reasoning traces. Inspired by recent work on interpretable reasoning with LLMs [
            <xref ref-type="bibr" rid="ref21">35, 44</xref>
            ], we
framed relation classification as a two-option multiple-choice question answering (QA) task, where the
model must not only classify a relation but also justify it through an intermediate reasoning trace.
          </p>
          <p>To build the training corpus, we used the more capable DeepSeek-R1 [45] as a teacher model to
generate reasoning traces via API for binary-choice QA instances. Each prompt contained: (1) an
instruction clarifying the relation to be classified, (2) the full document text with the subject and object
entities highlighted using custom tags (in the same format as our encoder models), and (3) two candidate
options, one being afirmative of the given relation between tagged entities and the other indicating
an absence of such relation between the entities. We present an example input provided to both the
teacher and student LLMs in Table 4.</p>
          <p>The reasoning model outputs a chain-of-thought (CoT) explanation followed by a final answer label,
either “A” or “B”. From the generated dataset, we curated a fine-tuning corpus by keeping only those
traces whose final answer matched the gold label. This filtered set was then used to fine-tune a smaller
LLM with token-level cross-entropy loss ℒCE (Equation 2), where  is the batch size,  is the length of
the -th sequence, , is the ground-truth token at position , and  is the model’s predicted probability
for the given token. Each reasoning trace and its final answer were concatenated and treated as a single
target sequence.</p>
          <p>1 ∑︁ ∑︁ log  (︀ , | ,&lt;︀) ,
ℒCE = − 
=1 =1
(2)</p>
          <p>We used deepseek-ai/DeepSeek-R1-Distill-Qwen-7B as our student model. Fine-tuning was
performed using HuggingFace TRL with DeepSpeed ZeRO-3 for eficient multi-GPU training. We
evaluated the resulting model in two settings:
(1) Standalone prediction – the model directly outputs reasoning and a relation decision.
(2) Post-processing verifier – used as a verifier after the BERT-based classification.</p>
          <p>The latter approach was motivated by the observation that the BERT-based models achieved high
recall but comparatively lower precision and F1 scores; the LLM was used to verify or filter predicted
positive relations, thus acting efectively as a re-ranker.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <p>All experiments were conducted using NVIDIA H100 GPUs. The encoder-based NER and RE classifiers
were trained on a single H100 GPU with 40GB of memory. For both NER and RE tasks, we fine-tuned
pretrained BERT models with a batch size of 16 and a learning rate of 2 × 10− 5. Models were trained
for up to 15 epochs and the best checkpoint was saved based on the development set overall F1 score.
No hyperparameter-tuning was done.</p>
      <p>For supervised fine-tuning (SFT) of LLMs using reasoning traces, we utilised four H100 GPUs with
80GB of memory each. We used a batch size of 2 per device, 1 epoch, a cosine learning rate schedule
(min ratio 0.1), learning rate of 5 × 10− 5, and bfloat16 precision. Flash Attention 2 and gradient
checkpointing were enabled for memory eficiency.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>We submitted 25 system runs for each four subtasks. Given the number of configurations explored, we
present here the top 5 systems for the NER task and top 3 systems for each RE subtask, based on their
oficial test set performance. These represent the most competitive and informative combinations of
model architecture, training strategy, and post-processing, providing insight into which method and
configurations were most efective across subtasks.</p>
      <p>PMID 36244970: Depression and fatigue in active IBD from a microbiome perspective-a Bayesian
approach to faecal metagenomics
Depression and fatigue in active IBD from a microbiome perspective-a Bayesian approach to faecal
metagenomics. Extraintestinal symptoms are common in inflammatory bowel diseases (IBD) and
include depression and fatigue.
. . .</p>
      <p>Based on taxonomic and functional metagenomic profiles of &lt;microbiome&gt;faecal gut
microbiota&lt;/microbiome&gt;, we used Bayesian statistics to investigate the associative networks
and triangle motifs between bacterial genera, functional modules and symptom severity of
self-reported fatigue and depression. Associations with moderate to strong evidence were found
for 3 genera (Odoribacter, Anaerotruncus and &lt;bacteria&gt;Alistipes&lt;/bacteria&gt;) and 3 functional
modules (pectin, glycosaminoglycan and central carbohydrate metabolism) with regard to
depression and for 4 genera
. . .</p>
      <p>A) &lt;bacteria&gt;Alistipes&lt;/bacteria&gt; is part of &lt;microbiome&gt;faecal gut microbiota&lt;/microbiome&gt;.
B) &lt;bacteria&gt;Alistipes&lt;/bacteria&gt; is not part of &lt;microbiome&gt;faecal gut microbiota&lt;/microbiome&gt;.</p>
      <sec id="sec-6-1">
        <title>Teacher LLM Reasoning</title>
        <p>Okay, let’s try to figure out whether Alistipes is part of the faecal gut microbiota based on the
provided study.
. . .
clearly lists Alistipes as one of the genera identified in the faecal samples. Since the study is about
the microbiome in fecal samples of IBD patients, and Alistipes is listed among the genera found
there, that supports option A.
. . .</p>
      </sec>
      <sec id="sec-6-2">
        <title>Final Answer. A</title>
        <p>6.1. Subtask 6.1 NER
All top five systems were constructed by ensembling predictions from multiple NER models. Our
bestperforming system (Run ID ensemble5) was an ensemble of predictions from 11 diferent single-model
runs. The ensembling process was span-based, meaning we aggregated predictions at the level of entity
spans rather than tokens. In this system, we applied the MINK ensemble strategy, where a predicted span
was retained as an entity mention only if at least K individual models (in this case, K=10) predicted the
exact same start and end ofsets for that span as well as the same entity type. This voting-based filtering
helped reduce spurious predictions while preserving spans that were consistently identified across
multiple models and improved the consistency of the final annotations especially in cases involving
overlapping or ambiguous spans.</p>
        <p>The ensemble included models trained on diferent subsets of entity classes (e.g., ALL, DDF,
Microbiome) as well as those from diferent backbone models. Configurations for each top 5 submissions
can be found on Table 3 and Table 5 shows development and test set metrics. The results indicate that
span-based ensembling contributes meaningfully to NER performance, particularly when aggregating
predictions from diverse models. All top five submissions employed ensemble methods.</p>
        <p>Another observation was made that training models on subsets of entity classes, rather than the full
label set, can still be efective when such models are integrated within an ensemble. Several top systems
included models specialised in high-frequency or semantically similar labels (e.g., DDF, Microbiome, or
Food), which, while not necessarily strong on their own, contributed to performance increase when
ensembled with broader models. These results suggest that selective training on entity type subsets,
when paired with robust ensembling, can be a practical strategy in complex multi-label NER tasks.
6.2. RE Subtasks 6.2.1 - 6.2.3
We report here results on the RE subtasks from both the rule-based and generative baseline methods,
BERT-based binary classification models, and post-processing with LLM-based reasoning.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.2.1. Baseline 1: Rule-based Method</title>
        <p>The rule-based system, which relies on corpus statistics, performs strongly in terms of recall, achieving
0.90 in both binary and ternary tagging tasks and 0.68 in the more challenging mention-level setting
(Table 6). This suggests that the system is efective at identifying a wide range of relation instances by
leveraging frequently co-occurring patterns observed in the training data. However, this broad matching
strategy comes at the expense of precision, which remains below 0.50 across all tasks, indicating that
many extracted relations are incorrect. The performance especially deteriorates sharply in the more
ifne-grained mention-level task, where the F1 score falls to just 0.18. These results highlight a key
limitation of rule-based, corpus-driven methods: while they can achieve high coverage, they often lack
the specificity needed for accurate RE.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.2.2. Baseline 2: REBEL</title>
        <p>The REBEL model demonstrates consistent performance across the three RE tasks, with F1 scores of 0.59
for binary tagging, 0.57 for ternary tagging, and 0.35 for the more challenging mention-level extraction.</p>
        <p>In the binary relation-extraction setting, REBEL produced several recurring false positives. The pair
(microbiome → anatomical location) was falsly predicted most often. Two other pairs—(bacteria →
DDF) and (microbiome → DDF)—each was predicted wrongly six times. In the tag-based ternary setting,
the most frequent false positive was the complete relation (microbiome located in anatomical location),
predicted seven times. Two further ternary relations were over-predicted six times each: (microbiome
is linked to DDF) and (microbiome located in animal).</p>
        <p>Turning to false negatives, the relation (DDF is a DDF) was missed fourteen times, making it the single
most common omission. The model also failed to recover (DDF afects DDF) ten times and overlooked
two other DDF-centric relations—(DDF strikes anatomical location) and (DDF targets human)—on six
occasions each. Overall, these patterns indicate that REBEL tends to over-generate microbiome-related
links while struggling to capture intra-DDF interactions and DDF relations to human or anatomical
entities.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.2.3. BERT-based sequence classification</title>
        <p>The BERT-based binary classifiers trained on balanced datasets achieved consistently high recall,
often well above organiser baseline and leaderboard best systems (Table 7), accurately
identifying most of the true positive relations. However, due to the large number of negative pairs
precision was relatively lower. Across multiple training splits (with fixed positives and randomly
sampled negatives), the performance was stable, and all models performed comparably on the
development set. To improve robustness and mitigate sampling noise, we ensembled
predictions from these independently trained models, which led to a modest but consistent increase
in F1 score. We experimented with BiomedNLP-BiomedBERT-large-uncased-abstract,
BiomedNLP-BiomedBERT-base-uncased-abstract, and BioLinkBERT-large, and all top three
systems for each subtasks were based on BioLinkBERT-large as the backbone model. We report
development and test set metrics for each systems in Table 7.</p>
      </sec>
      <sec id="sec-6-6">
        <title>6.2.4. LLM Supervised fine-tuning</title>
        <p>In Table 7, systems marked with the note “LLM verifier” refer to configurations where a SFT-trained
student LLM was used to verify predictions made by the base BioLinkBERT-large classifier. This
twostage setup was motivated by the observation that the classifier achieved high recall on the development
set, and the LLM was used to improve precision by filtering false positives. For the mention-based RE
task, our best-performing system submission was based on the LLM-based verifier, which unexpectedly
achieved the highest recall among all submissions despite our initial assumption that it might trade
recall for precision.</p>
        <p>On the development set, systems with an LLM verifier exhibited improved precision but a drop
in recall. We hypothesize that this trade-of is due in part to the model’s limited ability to capture
global document-level annotation patterns. Specifically, the LLM was trained on individual (subject,
predicate, object) triples along with full-text in isolation, without access to the surrounding annotation
distribution. In cases where the same textual mention of an entity pair occurred in diferent positions
across the document, the model often predicted the negative option, since there were many more
negative instances with that entity pair text span in the training set.</p>
        <p>To compensate for these limitations, we applied the fine-tuned LLM as a post-hoc verifier on the
outputs of the BERT-based classifier. In this setting, the LLM was used to re-evaluate positive predictions
from the previous step with encoder-only models, with the goal of reducing false positives. This
postprocessing approach led to a modest increase in development and test set metrics, as shown in
Table 7.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work, we explored multiple approaches to biomedical information extraction, addressing both
NER and RE tasks within a unified framework. For NER, we framed the task as a multilabel token
classification problem and experimented with a diverse set of strategies, including training on individual
subsets of labels and combining predictions via span-level ensembling. This enabled more balanced
handling of under-represented classes and improved overall entity coverage. Our analysis revealed that
while larger backbone models helped with complex entity types, ensemble strategies were especially
efective in reducing false positives and improving consistency across classes.</p>
      <p>For RE, we combined classical encoder-based classification with LLM’s reasoning capabilities. We
formulated RE as a binary classification task using BERT-based sequence classifiers, incorporating
explicit entity markers in the input. To improve robustness, we addressed class imbalance by generating
multiple training splits, each containing all positive instances and a diferent subset of sampled negative
examples. Independent models were trained on each split, and their predictions were combined through
ensembling to enhance recall and reduce variance, although the overall improvement was modest. To
harness the reasoning capabilities of LLMs, we generated natural language reasoning traces using a
more capable teacher LLM. From these generations, we selected only those that concluded with the
correct label to construct a supervised fine-tuning dataset. A smaller student LLM was then fine-tuned
using token-level cross-entropy loss. Given that our classification-based system achieved high recall but
lower precision, we used the fine-tuned LLM as a second-step verifier to better filter out false positives
and improve overall precision.</p>
      <p>Together, our findings highlight the strength of encoder-based classification for NER, and the benefit
of combining classical classification system with LLM-based reasoning for RE. This hybrid approach
was particularly beneficial in cases where consistency with annotation patterns across the whole corpus
was important, a setting where document-level LLM reasoning alone often fell short, but classical
classification models were efective when used as a first step.
C.L. was supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre
for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. For the
purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any
author accepted manuscript version arising. M.J.R.C. was supported by EASTBIO - East of Scotland
Biosciences consortium, UKRI doctoral training program. E.C. was supported by the United Kingdom
Research and Innovation (grant EP/Y030869/1), UKRI AI Centre for Doctoral Training in Biomedical
Innovation at the University of Edinburgh. For the purpose of open access, the author has applied a
creative commons attribution (CC BY) licence to any author accepted manuscript version arising. J.M.P.
and A.D.L. are supported by the CoDiet project. The CoDiet project is funded by the European Union
under Horizon Europe grant number 101084642 and supported by UK Research and Innovation (UKRI)
under the UK government’s Horizon Europe funding guarantee [grant number 101084642].</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: grammar and spelling
check, paraphrase and reword, and improve writing style. After using this tool/service, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
[9] J.-D. Kim, T. Ohta, Y. Tateisi, J. Tsujii, Genia corpus - a semantically annotated corpus
for bio-textmining, Bioinformatics 19 Suppl 1 (2003) i180–2. URL: https://academic.oup.com/
bioinformatics/article/19/suppl_1/i180/227927.
[10] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234 – 1240.</p>
      <p>URL: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506.
[11] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: Conference
on Empirical Methods in Natural Language Processing, 2019. URL: https://aclanthology.org/
D19-1371/.
[12] R. I. Dogan, R. Leaman, Z. Lu, Ncbi disease corpus: A resource for disease name recognition
and concept normalization, Journal of biomedical informatics 47 (2014) 1–10. URL: https://www.
sciencedirect.com/science/article/pii/S1532046413001974?via%3Dihub.
[13] Ö. Uzuner, B. R. South, S. Shen, S. L. Duvall, 2010 i2b2/va challenge on concepts, assertions, and
relations in clinical text, Journal of the American Medical Informatics Association : JAMIA 18 5
(2011) 552–6. URL: https://academic.oup.com/jamia/article/18/5/552/830538.
[14] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C.</p>
      <p>Wiegers, Z. Lu, Biocreative v cdr task corpus: a resource for chemical disease relation extraction,
Database: The Journal of Biological Databases and Curation 2016 (2016). URL: https://academic.
oup.com/database/article/doi/10.1093/database/baw068/2630414.
[15] M. Krallinger, O. Rabal, F. Leitner, M. Vázquez, D. Salgado, Z. Lu, R. Leaman, Y. Lu, D.-H. Ji, D. M.</p>
      <p>Lowe, R. A. Sayle, R. T. Batista-Navarro, R. Rak, T. Huber, T. Rocktäschel, S. Matos, D. Campos,
B. Tang, H. Xu, T. Munkhdalai, K. H. Ryu, S. V. Ramanan, P. S. Nathan, S. Žitnik, M. Bajec, L. Weber,
M. Irmer, S. A. Akhondi, J. A. Kors, S. Xu, X. An, U. K. Sikdar, A. Ekbal, M. Yoshioka, T. M. Dieb,
M. Choi, K. M. Verspoor, M. Khabsa, C. L. Giles, H. Liu, K. E. Ravikumar, A. Lamurias, F. M. Couto,
H.-J. Dai, R. T.-H. Tsai, C. Ata, T. Can, A. Usie, R. Alves, I. Segura-Bedmar, P. Martínez, J. Oyarzábal,
A. Valencia, The chemdner corpus of chemicals and drugs and its annotation principles, Journal
of Cheminformatics 7 (2015) S2 – S2. URL: https://jcheminf.biomedcentral.com/articles/10.1186/
1758-2946-7-S1-S2.
[16] L. L. Smith, L. K. Tanabe, R. Ando, C.-J. Kuo, I.-F. Chung, C.-N. Hsu, Y.-S. Lin, R. Klinger, C. Friedrich,
K. Ganchev, M. Torii, H. Liu, B. Haddow, C. A. Struble, R. J. Povinelli, A. Vlachos, W. A. Baumgartner,
L. E. Hunter, B. Carpenter, R. T.-H. Tsai, H.-J. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. W. Adriaans,
C. Blaschke, R. Torres, M. L. Neves, P. Nakov, A. Divoli, M. Maña-López, J. Mata, W. J. Wilbur,
Overview of biocreative ii gene mention recognition, Genome Biology 9 (2008) S2 – S2. URL:
https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-s2-s2.
[17] M. Gerner, G. Nenadic, C. M. Bergman, Linnaeus: A species name identification system for
biomedical literature, BMC Bioinformatics 11 (2010) 85 – 85. URL: https://bmcbioinformatics.
biomedcentral.com/articles/10.1186/1471-2105-11-85.
[18] E. Pafilis, S. P. Frankild, L. Fanini, S. Faulwetter, C. Pavloudi, A. Vasileiadou, C. Arvanitidis, L. J.</p>
      <p>Jensen, The species and organisms resources for fast and accurate identification of taxonomic
names in text, PLoS ONE 8 (2013). URL: https://journals.plos.org/plosone/article?id=10.1371/
journal.pone.0065390.
[19] J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, N. Collier, Introduction to the bio-entity recognition
task at jnlpba, in: Proceedings of the international joint workshop on natural language processing
in biomedicine and its applications, Citeseer, 2004, pp. 70–75.
[20] K. G. Becker, K. C. Barnes, T. J. Bright, S. A. Wang, The genetic association database, Nature</p>
      <p>Genetics 36 (2004) 431–432. URL: https://www.nature.com/articles/ng0504-431.
[21] E. M. van Mulligen, A. Fourrier-Réglat, D. Gurwitz, M. Molokhia, A. Nieto, G. Trifirò, J. A. Kors, L. I.</p>
      <p>Furlong, The eu-adr corpus: Annotated drugs, diseases, targets, and their relationships, Journal of
biomedical informatics 45 5 (2012) 879–84. URL: https://www.sciencedirect.com/science/article/
pii/S1532046412000573.
[22] M. Krallinger, O. Rabal, A. Miranda-Escalada, A. Valencia, Drugprot corpus: Biocreative vii track
1 - text mining drug and chemical-protein interactions, 2021. URL: https://academic.oup.com/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cryan</surname>
          </string-name>
          , T. G. Dinan,
          <article-title>Mind-altering microorganisms: the impact of the gut microbiota on brain and behavior</article-title>
          ,
          <source>Nature Reviews Neuroscience</source>
          <volume>13</volume>
          (
          <year>2012</year>
          )
          <fpage>701</fpage>
          -
          <lpage>712</lpage>
          . doi:
          <volume>10</volume>
          .1038/nrn3346.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-A. McVey</surname>
            <given-names>Neufeld</given-names>
          </string-name>
          ,
          <article-title>Gut-brain axis: how the microbiome influences anxiety and depression</article-title>
          ,
          <source>Trends in Neurosciences</source>
          <volume>36</volume>
          (
          <year>2013</year>
          )
          <fpage>305</fpage>
          -
          <lpage>312</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.tins.
          <year>2013</year>
          .
          <volume>01</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tripathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dada</surname>
          </string-name>
          ,
          <article-title>Role of gut microbiota in neurological disorders and its therapeutic significance</article-title>
          ,
          <source>J Clin Med</source>
          <volume>12</volume>
          (
          <year>2023</year>
          )
          <article-title>1650</article-title>
          . doi:
          <volume>10</volume>
          .3390/jcm12041650, pMID: 36836185; PMCID:
          <fpage>PMC9965848</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>The role of gut microbiota in various neurological and psychiatric disorders-an evidence mapping based on quantified evidence</article-title>
          ,
          <source>Mediators of Inflammation</source>
          <year>2023</year>
          (
          <year>2023</year>
          ). URL: https://onlinelibrary.wiley.com/doi/10.1155/
          <year>2023</year>
          /5127157.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Loh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Q.</given-names>
            <surname>Mak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Yeow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Foo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. W.</given-names>
            <surname>How</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Y.</given-names>
            <surname>Khaw</surname>
          </string-name>
          ,
          <article-title>Microbiota-gut-brain axis and its therapeutic applications in neurodegenerative diseases</article-title>
          ,
          <source>Signal Transduction and Targeted Therapy</source>
          <volume>9</volume>
          (
          <year>2024</year>
          ). URL: https://www.nature.com/ articles/s41392-024-01743-1.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Irrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          , Overview of GutBrainIE@CLEF 2025:
          <article-title>Gut-Brain Interplay Information Extraction</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: L.
          <string-name>
            <surname>P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge Carrillo-de Albornoz</surname>
          </string-name>
          , Julio Gonzalo (Ed.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Lain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Posma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kozdoba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Perets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mannor</surname>
          </string-name>
          ,
          <article-title>From medical literature to predictive features: An evidence-based knowledge graph approach</article-title>
          ,
          <source>in: Proceedings of the LMRL Workshop at the International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2025</year>
          . URL: https://openreview.net/forum?id=qCSNi1BRPc. database/article/doi/10.1093/database/baad080/7453369.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R. O. L. A.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <article-title>Overview of the biocreative vi chemical-protein interaction track</article-title>
          ,
          <source>Proceedings of the BioCreative VI Workshop</source>
          ,
          <volume>141</volume>
          -
          <fpage>146</fpage>
          (
          <year>2017</year>
          ). URL: https://biocreative.bioinformatics. udel.edu/tasks/biocreative-vi/track-5/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Neural Information Processing Systems</source>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/ abs/1706.03762.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>
          ,
          <year>2020</year>
          . arXiv:arXiv:
          <year>2007</year>
          .15779.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [26]
          <string-name>
            <surname>P.-L. Huguet</surname>
            <given-names>Cabot</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , REBEL:
          <article-title>Relation extraction by end-to-end language generation</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2370</fpage>
          -
          <lpage>2381</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>204</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Angeli,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Position-aware attention and supervised data improve slot filling</article-title>
          , in: M.
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hwa</surname>
          </string-name>
          , S. Riedel (Eds.),
          <source>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>45</lpage>
          . URL: https://aclanthology.org/D17-1004/. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>D17</fpage>
          -1004.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , M. Sun,
          <article-title>DocRED: A large-scale document-level relation extraction dataset</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>764</fpage>
          -
          <lpage>777</lpage>
          . URL: https://aclanthology.org/P19-1074/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1074.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          , W.-t. Yih,
          <article-title>A linear programming formulation for global inference in natural language tasks</article-title>
          ,
          <source>in: Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL2004) at HLT-NAACL</source>
          <year>2004</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Boston, Massachusetts, USA,
          <year>2004</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: https://aclanthology.org/W04-2401/.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <article-title>Scispacy: Fast and robust models for biomedical natural language processing</article-title>
          , ArXiv abs/
          <year>1902</year>
          .07669 (
          <year>2019</year>
          ). URL: https://aclanthology.org/W19-5034/.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <article-title>An extensive benchmark study on biomedical text generation and mining with chatgpt</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>39</volume>
          (
          <year>2023</year>
          ). URL: https://academic.oup.com/bioinformatics/article/39/9/btad557/7264174.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>I.</given-names>
            <surname>Jahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T. R.</given-names>
            <surname>Laskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>A comprehensive evaluation of large language models on benchmark biomedical text processing tasks</article-title>
          ,
          <source>Computers in biology and medicine 171</source>
          (
          <year>2023</year>
          )
          <article-title>108189</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0010482524002737.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Keloth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Adelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations</article-title>
          , Nature
          <string-name>
            <surname>Communications</surname>
          </string-name>
          (
          <year>2025</year>
          ). URL: https://www.nature.com/articles/s41467-025-56989-2.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain of thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>ArXiv abs/2201</source>
          .11903 (
          <year>2022</year>
          ). URL: https: //arxiv.org/abs/2201.11903.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , React:
          <article-title>Synergizing reasoning and acting in language models</article-title>
          ,
          <source>ArXiv abs/2210</source>
          .03629 (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2210.03629.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Snell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Scaling llm test-time compute optimally can be more efective than scaling model parameters</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2408.03314. arXiv:
          <volume>2408</volume>
          .
          <fpage>03314</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Unveiling the key factors for distilling chain-of-thought reasoning, 2025</article-title>
          . URL: https://arxiv.org/ abs/2502.18001. arXiv:
          <volume>2502</volume>
          .
          <fpage>18001</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Crabbé</surname>
          </string-name>
          , E. Bernard, Nuner: Entity recognition encoder
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>