<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. Waldert)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Constrained Linked Entity ANnotation using RAG (CLEANR)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benedikt Kantz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Lengauer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Waldert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Schreck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graz University of Technology</institution>
          ,
          <addr-line>Rechbauerstrasse 12, Graz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Structured information extraction from text relies heavily on natural language processing tools and a robust understanding of the structure. Language Models (LMs) provide the text understanding for long and unstructured input, even in domain-specific data. The generative aspect of these systems, however, can be unstructured and quickly return data that does not conform to the intended structural constraints. Our system, Constrained Linked Entity ANnotation using RAG (CLEANR), introduces structured output based on the ontological constraint placed through a grammar to the LM. This addition enables us to reliably utilize relatively small and inexpensive models in our pipeline to process domain-specific data for information extraction in the CLEF GutBrainIE task, resulting in good precision in the Relation Extraction (RE) tasks and improving the Graphwise solution by taking the union.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;RAG</kwd>
        <kwd>LM</kwd>
        <kwd>Semantic retrieval</kwd>
        <kwd>Structured Output</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text to Annotate
Finalize Annotations
(Finetuned) LM
with Structured &amp;
Constrained output
Constructed Prompt</p>
      <p>Instructions</p>
      <p>Example Text 1
Example Annotations 1</p>
      <p>Example Text 2
Example Annotations 2
...</p>
      <p>Text to Annotate</p>
      <p>CLEANR</p>
      <p>Text Embedding model
sentence-transformers
Retrieve Similar Examples
Embed Samples</p>
      <p>Training
Samples</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>CLEANR is inspired by existing Relationship Extraction systems, such as RAG4RE [4], which utilizes
RAG as its approach to incorporate detailed training data through semantic retrieval processes in the
prompt for the LM. This few-shot approach, combined with dynamic retrieval, enables the system to
be extended or “retrained” by simply adding or re-weighting the training samples, allowing test-time
adaptation and generalization of the system with just a few new examples online without redeploying
or retraining the model.</p>
      <p>Prior systems, such as REBEL [5], train a supervised model to perform RE using special output tokens
and fine-tune it for hours, as the REBEL 2021 model was trained for 9 hours. Our system aims to reduce
the efort and time required for training.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>We investigate Subtasks 6.2.1, 6.2.2, and 6.2.3 within the GutBrainIE Task [6] of the BioASQ
Laboratory [7] in this paper. These tasks focus on the RE from titles and abstracts within the PubMed database
on the topic of gut-brain interplay. The subtasks we explore require three levels of expression detail –
just the entities, entities and relation type, and, finally, the entities, relation, and location within the
text. The task provides a labeled dataset, split into four tiers of annotated samples - platinum, gold,
silver, and bronze. Human annotators annotate the first three tiers with a varying degree of expertise
in the field. At the same time, the last one is automated using a “[. . . ] distantly supervised [approach]
[. . . ] comprising automatically generated annotations.” [6]</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>CLEANR extends the approach to use RAG for RE using two key contributions. The first novelty of our
methodology is the addition of constrained LM generation for RE. The second addition of our approach
is the introduction of a re-weighting of the samples in the retrieval process to prefer samples with
a higher degree of confidence (i.e., prefer the Gold annotations over the Bronze annotations in our
setting). We use the sentence-transformer system [8] to embed the given training samples and
store them in a Postgres database using the pgvector extension.</p>
      <p>We, furthermore, utilize llama-cpp1 and llama-cpp-agent2 for both eficient inference of
pretrained models and constrained generation from a provided grammar. The grammar is generated using
dynamically created Python types from the provided schema, as shown Appendix A.1. The necessary
entities and links are taken from the provided schema from the GutBrainIE Task [6]. The schema can
be constructed by taking the set of relations between head entities, tail entities, and predicates and
converting these into allowed outputs for the LM, e.g. Bacteria|Interact|Drug. These entities and
links could be exchanged for any other domain or setting, making our system very straightforward
to adapt. The generated types are then automatically transformed into the GGML Backus-Naur Form
(GBNF) syntax using the llama-cpp-agent package, which is then used to constrain the LM output
to the exact schema provided by the task description. We extend the existing grammar features of
llama-cpp-agent to include enumerable and literal support, to properly constrain the LM to only
allow correct relations, including directions within the relations (i.e., the object and subject may not be
switched). The contribution is already present as a pull request on GitHub for the original project3. We
also repair any JSONs that may not be complete due to output sequence length limitations.</p>
      <p>We also fine-tune a small 3B parameter model from Hermes 3-family of models [ 9] to the dataset
and generative use case with few-shot prompts to illustrate the strength of our method compared to
a finetune system. This is achieved using the torchtune framework to apply a Low-Rank Adaption
(LoRA) [10] on the network.</p>
      <p>Our RE utilizing the constrained and finetuned model is then used within the architecture illustrated in
Figure 1, where we use a classic few-shot approach with RAG [11] to perform the RE4. This architecture
utilizes the sentence-transformer to retrieve semantically similar samples from the database based
on the text to be annotated. These are then used to build the prompt for the constrained LM, which are
then parsed into the final annotation format required by the task.</p>
      <sec id="sec-4-1">
        <title>4.1. Combination of Results</title>
        <p>We also collaborated with the Graphwise team [12] to combine the strengths of our Test-Time method
in the precision  with their strong method. We took the set union and intersection between the
CLEANR results and theirs based on the Subject-Predicate-Object triplets predicted by our approaches.
The results are presented in Appendix A.3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Methodology</title>
        <p>CLEANR was initially evaluated using our implementation of the 1, metric, which yielded
promising results when using the evaluation script that counted each duplicate entry. The results
presented in this report paper, however, were all generated using the latest version of the final evaluation
script of the task [6].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. Training setup</title>
        <p>We utilize the torchtune system to fine-tune the Hermes-3-Llama-3.2-3B model5 on the provided
training data, aiming to develop a multi-turn query-response system. The finetuned model is used to
compare with our few-shot RAG system. Our training parameters can be found in Table 1.</p>
        <p>We used a single RTX 8000 to fine-tune the model using LoRA, taking about 12 hours.</p>
        <sec id="sec-5-1-1">
          <title>1https://github.com/ggml-org/llama.cpp</title>
          <p>2https://github.com/Maximilian-Winter/llama-cpp-agent
3https://github.com/Maximilian-Winter/llama-cpp-agent/pull/89.
4The Named Entity Recognition (NER) results from Appendix A.2 are obtained using the same methodology
5https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. RE Process</title>
        <p>Open-Weight</p>
        <p>Applied LoRA
×
✓
✓</p>
        <p>Value
Our approach is focused on test time retrievak and relies mainly on fixed-weight models – we therefore
show them in Table 2. As CLEANR uses a RAG-approach, we show the generative parameters in Table 3.</p>
        <p>For the reweighting for the RAG based on the classes, we first retrieve the top  matching
documents (by cosine similarity) from the collections. The embeddings are generated using a
sentence-transformer model [8]6, then reweigh them slightly by mutiplying the distances
using the coeficients in Table 4 and reranking them again and taking the resulting top  results.</p>
        <p>Our system uses a Postgres Database with version 17 with the pgvector extension as documented by
our Docker Compose file for storage and eficient and fast retrieval for the RAG, with a RTX 4090 used</p>
        <sec id="sec-5-2-1">
          <title>6Using the all-MiniLM-L6-v2 model</title>
          <p>Cross-Entropy (CE) loss for Hermes 3.2.3B LoRA
loss
s
s
o
L
1.2</p>
          <p>1
0.8
50
Steps
0
10
20
30
40
60
70
80
90
100
for inference of the Open-Weight models.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Reproducibility</title>
        <p>Our code is available on https://github.com/Dakantz/CLEANR and includes all necessary details to
reproduce our results, such as dependency versions, training setups, and annotation system.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>We perform our evaluation on the dev-set provided within the GutbrainIE tasks using the latest
evaluation script. The results are below the baseline posted by the task. Nevertheless, our system
combines RAG and structured generation to retrieve data without the need for fine-tuning or adaptation
to the model even with comparatively small LMs, and still achieves a comparatively good precision. We
perform additional finetuning on the LM (the Cross-Entropy (CE) loss is plotted in Figure 2), where the
1, score increases only for the last task. The , however, does benefit significantly from the
ifnetune.</p>
      <p>The strength of our system is evident in its very competitive precision, which indicates that the
system retrieves the correct results, reaching up to 0.8, outperforming the baseline and many other
submitted systems for Subtasks 6.2.1 and 6.2.2. The system, however, retrieves too few results, resulting
in a very weak recall , which significantly drops our 1, result.</p>
      <p>Our results show that the addition of retrieved data significantly improves the output, as almost all
methods that utilize it experience a notable performance increase. We also observe a small impact of
ifne-tuning on the 1, score for the first two tasks, similar to our reordering approach. The best
model using our methodology is the OpenAI 4o-mini model, primarily due to the high recall using our
RAG approach. There appears to be some merit to our method, as it slightly improves the solution of
Graphwise, most likely due to the higher precision shown in Appendix A.3.</p>
      <sec id="sec-6-1">
        <title>6.1. Test set results</title>
        <p>We additionally compare our results to the test set results to set them into context. The Tables 8 to 10
contain the test results for our CLEANR. These results align quite well with our dev set evaluation, with
only a minor diference resulting in the best 1, by the Hermes 8B model, which applies both our
RAG and Reorder approaches. The micro precision is not as high as on the dev set, but still higher than
the best results in this category on the leaderboard. This indicates that our eficient method has merit
in situations where high micro precision is important, particularly when only a few good relations
are required. The worse scores on Subtask 6.2.3, however, indicate that our system is still unable to
pinpoint the correct entities from which the relations originate properly.</p>
        <p>Our combined results in Tables 11 to 13 tell a similar story to our observations on the test set, where
the union performs very well, and the intersection has a very high micro precision.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>In this paper, we present CLEANR, a resource-eficient test-time system that combines existing systems
to perform information extraction eficiently. Our system benefits from structured output and RAG
approaches, demonstrating that fine-tuning may not be necessary when a strong enough model is
available. The evaluated performance of CLEANR, however, indicates that we need to further improve
the retrieval approach – especially the recall . The system, nevertheless, appears to have some
merit, as its precision is high compared to other systems on the leaderboard.</p>
      <p>We, however, identify a few possible improvements for our model, namely:
• Add more information to the system prompt, i.e., describe the task better and add the schema to
the input such that it is not only constrained by the output, but can better decide on the results,
• use more domain-specific models (like a BERT model trained specifically on PubMed data) for
the retrieval,
• constrain the returned data - either manually using a heuristic afterwards, or parse the response
during generation and eliminate results that may not fit, e.g., by semantic search. A
straightforward approach could be to limit or extend the generated output sequence length, as we repair
any “broken” JSON anyway, or even extend the result by running the prompts multiple times or
with a higher temperature,
• increase the model output to force the model to return more relations to improve the recall.
• CLEANR, additionally, does not implement any NER functionality as the LM does not build upon
any prior entities. The NER task, however, could be solved using a very similar approach.</p>
      <p>These improvements can be implemented through minor adjustments to the system, which could
slightly enhance performance. We explore some of these suggestions in Appendix A.2, discussing them
and possible reasons why they might fail or have some merit. A significant improvement could come
from improved model performance, i.e., through a reasoning step allowing the model to “contemplate”
the relations or using more recent agentic approaches. However, little improvement can be made in
Subtask 6.2.3, as the task requires the model to accurately pinpoint the text segment from which the
result was obtained. A possible remedy for this issue could be further improving the structured output
by only allowing valid pairs from the text, which might even be preselected using a diferent NER
model.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the HEREDITARY Project, as part of the European Union’s Horizon
Europe research and innovation programme under grant agreement No GA 101137074.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. After using these tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
×
✓
doi:https://doi.org/10.1016/j.ipm.2024.103802.
[3] N. Milošević, W. Thielemann, Comparison of biomedical relationship extraction methods and
models for knowledge graph creation, Journal of Web Semantics 75 (2023) 100756.
[4] S. Efeoglu, A. Paschke, Retrieval-augmented generation-based relation extraction, 2024. URL:
https://arxiv.org/abs/2404.13397. arXiv:2404.13397.
[5] P.-L. Huguet Cabot, R. Navigli, REBEL: Relation extraction by end-to-end language generation,
in: Findings of the Association for Computational Linguistics: EMNLP 2021, Association for
0.08
0.41
0.89
0.71
0.13
0.61
0.23
0.66
Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2370–2381. URL: https:
//aclanthology.org/2021.findings-emnlp.204.
[6] M. Martinelli, G. Silvello, V. Bonato, G. M. Di Nunzio, N. Ferro, O. Irrera, S. Marchesin, L. Menotti,
F. Vezzani, Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction, in:
G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025 Working Notes, 2025.
[7] A. Nentidis, G. Katsimpras, A. Krithara, M. Krallinger, M. Rodríguez-Ortega, E. Rodriguez-López,
N. Loukachevitch, A. Sakhovskiy, E. Tutubalina, D. Dimitriadis, G. Tsoumakas, G. Giannakoulas,
A. Bekiaridou, A. Samaras, G. M. Di Nunzio, N. Ferro, S. Marchesin, M. Martinelli, G. Silvello,
G. Paliouras, Overview of BioASQ 2025: The thirteenth BioASQ challenge on large-scale biomedical
semantic indexing and question answering, Lecture Notes in Computer Science, Springer, 2025.
[8] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[9] R. Teknium, J. Quesnelle, C. Guang, Hermes 3 technical report, 2024. URL: https://arxiv.org/abs/
2408.11857. arXiv:2408.11857.
[10] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, 2021. URL: https://arxiv.org/abs/2106.09685. arXiv:2106.09685.
[11] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, Q. Li, A survey on rag meeting llms:
Towards retrieval-augmented large language models, 2024. URL: https://arxiv.org/abs/2405.06211.
arXiv:2405.06211.
[12] A. Datseris, M. Kuzmanov, I. Nikolova-Koleva, D. Taskov, S. Boytcheva, Graphwise @ clef-2025
gutbrainie: Towards automated discovery of gut-brain interactions: Deep learning for ner and
relation extraction from pubmed abstracts, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.),
Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop
Proceedings, CEUR-WS, 2025.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Appendix</title>
      <sec id="sec-10-1">
        <title>A.1. Model Constraints</title>
        <p>Our CLEANR system relies on, at its core, dynamically generated types from the GutBrainIE schema.
This enables our system to perform two tasks at the same time:
• validate the input data to check whether it fits the schema,
• constraint the LM to the correct relations.</p>
        <p>We therefore provide the code in Listing 1 to build our schema here. The function requires the relations
as a list of allowed combinations, enumerates all possibilities and combines it in a single Enum type that
is set as field in the dynamic Pydantic 7 type.</p>
        <p>Listing 1: Dynamic types generated from the relations.
def build_model(relations=relations):
possible_links = {}
for relation in relations:</p>
        <p>heads = [clean_label(head) for head in relation["heads"]]
tails = [clean_label(tail) for tail in relation["tails"]]
predicates = [clean_label(pred) for pred in relation["predicate"]]</p>
      </sec>
      <sec id="sec-10-2">
        <title>A.2. Further Experiments</title>
        <p>We also conduct additional experiments with our approach using the small Hermes 3B model to
investigate some of the possible improvements we suggest in Section 7 to address the weaknesses in our
approach. We present them in Tables 14 to 16. These results indicate that our variations do not improve
the scores, suggesting that we have either reached the limits of our small models or require some further
research and adjustments to our methodology. The additional, longer training for the model (indicated by
LoRA+) did help the model achieve performance similar to that of the OpenAI models, beating it by only a
margin. This fin-etune of 3 epochs, however, took significantly longer than using the base model directly,
using our constrained output, and imposed a significant reduction in precision. The output loss is shown
Figure 3. We also employ a new embedding model, the NeuML/pubmedbert-base-embeddings8
for the RAG embeddings, showing only minor improvements compared to our initial results. We also
experimented with variations in output token lengths, including fewer allowed tokens, which resulted
in slightly lower overall performance. Adding the possible entities and descriptions to the prompts also
slightly reduced performance.</p>
        <p>These experiments suggest that our approach, in combination with our small models, can not beat
the specifically trained baseline. We did not attempt larger models, which could still ofer improved
performance, as the RAG4RE approach has been shown to do [4].</p>
        <p>We additionally explore the NER task in a limited setting in Table 17. These experiments yield
similarly poor performance, most likely due to the approach’s inability to accurately pinpoint the
correct locations of the entities in the input texts, and thus failing to extract the proper indices required
for validation. We address this shortcoming by extracting the indices from the text based on the
predicted text spans, with little apparent performance impact.</p>
        <p>A.2.1. Experimenting with the output lengths
Further experiments include a study of the scores for capped outputs and ground truths, efectively
calculating the micro averages for diferent ’s in Figure 4. These evaluations suggest that our method
may initially return the best-efort results and does not generate too many relations at once, indicating
8https://huggingface.co/NeuML/pubmedbert-base-embeddings
that the model’s performance is at fault here, or that the model should output more results, also
supported by the improved performance of our extended fine-tuning.</p>
      </sec>
      <sec id="sec-10-3">
        <title>A.3. Graphwise collaboration</title>
        <p>We also collaborated with the Graphwise team to combine our results, taking both the intersection
and the union between our results. The results of this collaboration can be found in Tables 18 to 20,
matching our test results quite well. These results indicate that the LoRA fine-tune models perform best

×
✓
✓
✓
✓
✓
✓
✓
✓
×
✓
✓
✓
✓
✓
✓
✓
×
×
×
×
✓
×
✓
✓
×
✓
✓
×
✓
✓
✓
×
✓
✓
×
✓
✓
×
✓
✓
in this combined setting. The Union performs significantly better, suggesting that our model indeed
produces a few very good results. This is even more evident when the precision for the intersection is
investigated, reaching a score of 0.96 for Subtasks 6.2.1 and 6.2.2, which is significantly higher than any
other model on the leaderboards.</p>
        <p>1
s
s
o
0.4
0.2
 1,
2
3
4
6
7</p>
        <p>8

5
✓
0.00
0.00
0.06
0.06
0.11
0.02
0.05
0.06
0.28
0.27
0.33
0.33
0.28
0.30
0.33
0.34</p>
        <p />
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          , E. Cambria,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marttinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge graphs: Representation, acquisition, and applications</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>33</volume>
          (
          <year>2022</year>
          )
          <fpage>494</fpage>
          -
          <lpage>514</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2021</year>
          .
          <volume>3070843</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>