1. Introduction

P. Waldert)

Constrained Linked Entity ANnotation using RAG (CLEANR)

Benedikt Kantz

Stefan Lengauer

Peter Waldert

Tobias Schreck

0 0 Graz University of Technology , Rechbauerstrasse 12, Graz , Austria

2025

000 0 0003

Structured information extraction from text relies heavily on natural language processing tools and a robust understanding of the structure. Language Models (LMs) provide the text understanding for long and unstructured input, even in domain-specific data. The generative aspect of these systems, however, can be unstructured and quickly return data that does not conform to the intended structural constraints. Our system, Constrained Linked Entity ANnotation using RAG (CLEANR), introduces structured output based on the ontological constraint placed through a grammar to the LM. This addition enables us to reliably utilize relatively small and inexpensive models in our pipeline to process domain-specific data for information extraction in the CLEF GutBrainIE task, resulting in good precision in the Relation Extraction (RE) tasks and improving the Graphwise solution by taking the union.

eol>RAG LM Semantic retrieval Structured Output

1. Introduction

Text to Annotate Finalize Annotations (Finetuned) LM with Structured & Constrained output Constructed Prompt

Instructions

Example Text 1 Example Annotations 1

Example Text 2 Example Annotations 2 ...

Text to Annotate

CLEANR

Text Embedding model sentence-transformers Retrieve Similar Examples Embed Samples

Training Samples

2. Related Work

CLEANR is inspired by existing Relationship Extraction systems, such as RAG4RE [4], which utilizes RAG as its approach to incorporate detailed training data through semantic retrieval processes in the prompt for the LM. This few-shot approach, combined with dynamic retrieval, enables the system to be extended or “retrained” by simply adding or re-weighting the training samples, allowing test-time adaptation and generalization of the system with just a few new examples online without redeploying or retraining the model.

Prior systems, such as REBEL [5], train a supervised model to perform RE using special output tokens and fine-tune it for hours, as the REBEL 2021 model was trained for 9 hours. Our system aims to reduce the efort and time required for training.

3. Task Description

We investigate Subtasks 6.2.1, 6.2.2, and 6.2.3 within the GutBrainIE Task [6] of the BioASQ Laboratory [7] in this paper. These tasks focus on the RE from titles and abstracts within the PubMed database on the topic of gut-brain interplay. The subtasks we explore require three levels of expression detail – just the entities, entities and relation type, and, finally, the entities, relation, and location within the text. The task provides a labeled dataset, split into four tiers of annotated samples - platinum, gold, silver, and bronze. Human annotators annotate the first three tiers with a varying degree of expertise in the field. At the same time, the last one is automated using a “[. . . ] distantly supervised [approach] [. . . ] comprising automatically generated annotations.” [6]

4. Methodology

CLEANR extends the approach to use RAG for RE using two key contributions. The first novelty of our methodology is the addition of constrained LM generation for RE. The second addition of our approach is the introduction of a re-weighting of the samples in the retrieval process to prefer samples with a higher degree of confidence (i.e., prefer the Gold annotations over the Bronze annotations in our setting). We use the sentence-transformer system [8] to embed the given training samples and store them in a Postgres database using the pgvector extension.

We, furthermore, utilize llama-cpp1 and llama-cpp-agent2 for both eficient inference of pretrained models and constrained generation from a provided grammar. The grammar is generated using dynamically created Python types from the provided schema, as shown Appendix A.1. The necessary entities and links are taken from the provided schema from the GutBrainIE Task [6]. The schema can be constructed by taking the set of relations between head entities, tail entities, and predicates and converting these into allowed outputs for the LM, e.g. Bacteria|Interact|Drug. These entities and links could be exchanged for any other domain or setting, making our system very straightforward to adapt. The generated types are then automatically transformed into the GGML Backus-Naur Form (GBNF) syntax using the llama-cpp-agent package, which is then used to constrain the LM output to the exact schema provided by the task description. We extend the existing grammar features of llama-cpp-agent to include enumerable and literal support, to properly constrain the LM to only allow correct relations, including directions within the relations (i.e., the object and subject may not be switched). The contribution is already present as a pull request on GitHub for the original project3. We also repair any JSONs that may not be complete due to output sequence length limitations.

We also fine-tune a small 3B parameter model from Hermes 3-family of models [ 9] to the dataset and generative use case with few-shot prompts to illustrate the strength of our method compared to a finetune system. This is achieved using the torchtune framework to apply a Low-Rank Adaption (LoRA) [10] on the network.

Our RE utilizing the constrained and finetuned model is then used within the architecture illustrated in Figure 1, where we use a classic few-shot approach with RAG [11] to perform the RE4. This architecture utilizes the sentence-transformer to retrieve semantically similar samples from the database based on the text to be annotated. These are then used to build the prompt for the constrained LM, which are then parsed into the final annotation format required by the task.

4.1. Combination of Results

We also collaborated with the Graphwise team [12] to combine the strengths of our Test-Time method in the precision with their strong method. We took the set union and intersection between the CLEANR results and theirs based on the Subject-Predicate-Object triplets predicted by our approaches. The results are presented in Appendix A.3.

4.2. Evaluation Methodology

CLEANR was initially evaluated using our implementation of the 1, metric, which yielded promising results when using the evaluation script that counted each duplicate entry. The results presented in this report paper, however, were all generated using the latest version of the final evaluation script of the task [6].

5. Experimental Setup 5.1. Training setup

We utilize the torchtune system to fine-tune the Hermes-3-Llama-3.2-3B model5 on the provided training data, aiming to develop a multi-turn query-response system. The finetuned model is used to compare with our few-shot RAG system. Our training parameters can be found in Table 1.

We used a single RTX 8000 to fine-tune the model using LoRA, taking about 12 hours.

1https://github.com/ggml-org/llama.cpp

2https://github.com/Maximilian-Winter/llama-cpp-agent 3https://github.com/Maximilian-Winter/llama-cpp-agent/pull/89. 4The Named Entity Recognition (NER) results from Appendix A.2 are obtained using the same methodology 5https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF

5.2. RE Process

Open-Weight

Applied LoRA × ✓ ✓

Value Our approach is focused on test time retrievak and relies mainly on fixed-weight models – we therefore show them in Table 2. As CLEANR uses a RAG-approach, we show the generative parameters in Table 3.

For the reweighting for the RAG based on the classes, we first retrieve the top matching documents (by cosine similarity) from the collections. The embeddings are generated using a sentence-transformer model [8]6, then reweigh them slightly by mutiplying the distances using the coeficients in Table 4 and reranking them again and taking the resulting top results.

Our system uses a Postgres Database with version 17 with the pgvector extension as documented by our Docker Compose file for storage and eficient and fast retrieval for the RAG, with a RTX 4090 used

6Using the all-MiniLM-L6-v2 model

Cross-Entropy (CE) loss for Hermes 3.2.3B LoRA loss s s o L 1.2

1 0.8 50 Steps 0 10 20 30 40 60 70 80 90 100 for inference of the Open-Weight models.

5.3. Reproducibility

Our code is available on https://github.com/Dakantz/CLEANR and includes all necessary details to reproduce our results, such as dependency versions, training setups, and annotation system.

6. Results

We perform our evaluation on the dev-set provided within the GutbrainIE tasks using the latest evaluation script. The results are below the baseline posted by the task. Nevertheless, our system combines RAG and structured generation to retrieve data without the need for fine-tuning or adaptation to the model even with comparatively small LMs, and still achieves a comparatively good precision. We perform additional finetuning on the LM (the Cross-Entropy (CE) loss is plotted in Figure 2), where the 1, score increases only for the last task. The , however, does benefit significantly from the ifnetune.

The strength of our system is evident in its very competitive precision, which indicates that the system retrieves the correct results, reaching up to 0.8, outperforming the baseline and many other submitted systems for Subtasks 6.2.1 and 6.2.2. The system, however, retrieves too few results, resulting in a very weak recall , which significantly drops our 1, result.

Our results show that the addition of retrieved data significantly improves the output, as almost all methods that utilize it experience a notable performance increase. We also observe a small impact of ifne-tuning on the 1, score for the first two tasks, similar to our reordering approach. The best model using our methodology is the OpenAI 4o-mini model, primarily due to the high recall using our RAG approach. There appears to be some merit to our method, as it slightly improves the solution of Graphwise, most likely due to the higher precision shown in Appendix A.3.

6.1. Test set results

We additionally compare our results to the test set results to set them into context. The Tables 8 to 10 contain the test results for our CLEANR. These results align quite well with our dev set evaluation, with only a minor diference resulting in the best 1, by the Hermes 8B model, which applies both our RAG and Reorder approaches. The micro precision is not as high as on the dev set, but still higher than the best results in this category on the leaderboard. This indicates that our eficient method has merit in situations where high micro precision is important, particularly when only a few good relations are required. The worse scores on Subtask 6.2.3, however, indicate that our system is still unable to pinpoint the correct entities from which the relations originate properly.

Our combined results in Tables 11 to 13 tell a similar story to our observations on the test set, where the union performs very well, and the intersection has a very high micro precision.

7. Conclusions and Future Work

In this paper, we present CLEANR, a resource-eficient test-time system that combines existing systems to perform information extraction eficiently. Our system benefits from structured output and RAG approaches, demonstrating that fine-tuning may not be necessary when a strong enough model is available. The evaluated performance of CLEANR, however, indicates that we need to further improve the retrieval approach – especially the recall . The system, nevertheless, appears to have some merit, as its precision is high compared to other systems on the leaderboard.

We, however, identify a few possible improvements for our model, namely: • Add more information to the system prompt, i.e., describe the task better and add the schema to the input such that it is not only constrained by the output, but can better decide on the results, • use more domain-specific models (like a BERT model trained specifically on PubMed data) for the retrieval, • constrain the returned data - either manually using a heuristic afterwards, or parse the response during generation and eliminate results that may not fit, e.g., by semantic search. A straightforward approach could be to limit or extend the generated output sequence length, as we repair any “broken” JSON anyway, or even extend the result by running the prompts multiple times or with a higher temperature, • increase the model output to force the model to return more relations to improve the recall. • CLEANR, additionally, does not implement any NER functionality as the LM does not build upon any prior entities. The NER task, however, could be solved using a very similar approach.

These improvements can be implemented through minor adjustments to the system, which could slightly enhance performance. We explore some of these suggestions in Appendix A.2, discussing them and possible reasons why they might fail or have some merit. A significant improvement could come from improved model performance, i.e., through a reasoning step allowing the model to “contemplate” the relations or using more recent agentic approaches. However, little improvement can be made in Subtask 6.2.3, as the task requires the model to accurately pinpoint the text segment from which the result was obtained. A possible remedy for this issue could be further improving the structured output by only allowing valid pairs from the text, which might even be preselected using a diferent NER model.

Acknowledgments

This work is partially supported by the HEREDITARY Project, as part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074.

Declaration on Generative AI

During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling check. After using these tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. × ✓ doi:https://doi.org/10.1016/j.ipm.2024.103802. [3] N. Milošević, W. Thielemann, Comparison of biomedical relationship extraction methods and models for knowledge graph creation, Journal of Web Semantics 75 (2023) 100756. [4] S. Efeoglu, A. Paschke, Retrieval-augmented generation-based relation extraction, 2024. URL: https://arxiv.org/abs/2404.13397. arXiv:2404.13397. [5] P.-L. Huguet Cabot, R. Navigli, REBEL: Relation extraction by end-to-end language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, Association for 0.08 0.41 0.89 0.71 0.13 0.61 0.23 0.66 Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2370–2381. URL: https: //aclanthology.org/2021.findings-emnlp.204. [6] M. Martinelli, G. Silvello, V. Bonato, G. M. Di Nunzio, N. Ferro, O. Irrera, S. Marchesin, L. Menotti, F. Vezzani, Overview of GutBrainIE@CLEF 2025: Gut-Brain Interplay Information Extraction, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025 Working Notes, 2025. [7] A. Nentidis, G. Katsimpras, A. Krithara, M. Krallinger, M. Rodríguez-Ortega, E. Rodriguez-López, N. Loukachevitch, A. Sakhovskiy, E. Tutubalina, D. Dimitriadis, G. Tsoumakas, G. Giannakoulas, A. Bekiaridou, A. Samaras, G. M. Di Nunzio, N. Ferro, S. Marchesin, M. Martinelli, G. Silvello, G. Paliouras, Overview of BioASQ 2025: The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering, Lecture Notes in Computer Science, Springer, 2025. [8] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084. [9] R. Teknium, J. Quesnelle, C. Guang, Hermes 3 technical report, 2024. URL: https://arxiv.org/abs/ 2408.11857. arXiv:2408.11857. [10] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, 2021. URL: https://arxiv.org/abs/2106.09685. arXiv:2106.09685. [11] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, Q. Li, A survey on rag meeting llms: Towards retrieval-augmented large language models, 2024. URL: https://arxiv.org/abs/2405.06211. arXiv:2405.06211. [12] A. Datseris, M. Kuzmanov, I. Nikolova-Koleva, D. Taskov, S. Boytcheva, Graphwise @ clef-2025 gutbrainie: Towards automated discovery of gut-brain interactions: Deep learning for ner and relation extraction from pubmed abstracts, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, 2025.

A. Appendix A.1. Model Constraints

Our CLEANR system relies on, at its core, dynamically generated types from the GutBrainIE schema. This enables our system to perform two tasks at the same time: • validate the input data to check whether it fits the schema, • constraint the LM to the correct relations.

We therefore provide the code in Listing 1 to build our schema here. The function requires the relations as a list of allowed combinations, enumerates all possibilities and combines it in a single Enum type that is set as field in the dynamic Pydantic 7 type.

Listing 1: Dynamic types generated from the relations. def build_model(relations=relations): possible_links = {} for relation in relations:

heads = [clean_label(head) for head in relation["heads"]] tails = [clean_label(tail) for tail in relation["tails"]] predicates = [clean_label(pred) for pred in relation["predicate"]]

A.2. Further Experiments

We also conduct additional experiments with our approach using the small Hermes 3B model to investigate some of the possible improvements we suggest in Section 7 to address the weaknesses in our approach. We present them in Tables 14 to 16. These results indicate that our variations do not improve the scores, suggesting that we have either reached the limits of our small models or require some further research and adjustments to our methodology. The additional, longer training for the model (indicated by LoRA+) did help the model achieve performance similar to that of the OpenAI models, beating it by only a margin. This fin-etune of 3 epochs, however, took significantly longer than using the base model directly, using our constrained output, and imposed a significant reduction in precision. The output loss is shown Figure 3. We also employ a new embedding model, the NeuML/pubmedbert-base-embeddings8 for the RAG embeddings, showing only minor improvements compared to our initial results. We also experimented with variations in output token lengths, including fewer allowed tokens, which resulted in slightly lower overall performance. Adding the possible entities and descriptions to the prompts also slightly reduced performance.

These experiments suggest that our approach, in combination with our small models, can not beat the specifically trained baseline. We did not attempt larger models, which could still ofer improved performance, as the RAG4RE approach has been shown to do [4].

We additionally explore the NER task in a limited setting in Table 17. These experiments yield similarly poor performance, most likely due to the approach’s inability to accurately pinpoint the correct locations of the entities in the input texts, and thus failing to extract the proper indices required for validation. We address this shortcoming by extracting the indices from the text based on the predicted text spans, with little apparent performance impact.

A.2.1. Experimenting with the output lengths Further experiments include a study of the scores for capped outputs and ground truths, efectively calculating the micro averages for diferent ’s in Figure 4. These evaluations suggest that our method may initially return the best-efort results and does not generate too many relations at once, indicating 8https://huggingface.co/NeuML/pubmedbert-base-embeddings that the model’s performance is at fault here, or that the model should output more results, also supported by the improved performance of our extended fine-tuning.

A.3. Graphwise collaboration

We also collaborated with the Graphwise team to combine our results, taking both the intersection and the union between our results. The results of this collaboration can be found in Tables 18 to 20, matching our test results quite well. These results indicate that the LoRA fine-tune models perform best × ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ × ✓ ✓ ✓ ✓ ✓ ✓ ✓ × × × × ✓ × ✓ ✓ × ✓ ✓ × ✓ ✓ ✓ × ✓ ✓ × ✓ ✓ × ✓ ✓ in this combined setting. The Union performs significantly better, suggesting that our model indeed produces a few very good results. This is even more evident when the precision for the intersection is investigated, reaching a score of 0.96 for Subtasks 6.2.1 and 6.2.2, which is significantly higher than any other model on the leaderboards.

1 s s o 0.4 0.2 1, 2 3 4 6 7

8 5 ✓ 0.00 0.00 0.06 0.06 0.11 0.02 0.05 0.06 0.28 0.27 0.33 0.33 0.28 0.30 0.33 0.34

[1] S. Ji , S. Pan , E. Cambria, P. Marttinen , P. S. Yu , A survey on knowledge graphs: Representation, acquisition, and applications , IEEE Transactions on Neural Networks and Learning Systems 33 ( 2022 ) 494 - 514 . doi: 10 .1109/TNNLS. 2021 . 3070843 .