1. Introduction

Text2KGBench-LettrIA: A Refined Benchmark for Text2Graph Systems

Julien Plu

julien@lettria.com 1

Oscar Moreno Escobar

oscar@lettria.com 1

Edouard Trouillez

edouard@lettria.com 1

Axelle Gapin

Pasquale Lisena

pasquale.lisena@eurecom.fr 0

Thibault Ehrhart

thibault.ehrhart@eurecom.fr 0

Raphaël Troncy

raphael.troncy@eurecom.fr 0 0 EURECOM , Sophia Antipolis , France 1 LettrIA , Paris , France

3 12

Recent advances in Large Language Models (LLMs) have catalyzed significant research into automated knowledge graph (KG) construction from text, a fundamental challenge at the intersection of natural language processing and semantic web technologies. However, the reliability of evaluating model performance is hindered by limitations in existing benchmarks like Text2KGBench, which exhibit shortcomings in data quality, ontological consistency, and structural design. To address these issues, this paper introduces Text2KGBench-LettrIA, a substantially revised and curated benchmark derived from the DBpedia-WebNLG portion of Text2KGBench. Our primary contributions include: (1) the systematic refinement of 19 domain ontologies to enforce hierarchical structure and formal typing; (2) a complete re-annotation of 4,860 sentences, yielding over 14,000 high-fidelity triples under a strict set of annotation guidelines; and (3) the introduction of an enriched data format with enhanced metadata to ensure reproducibility and support multifaceted evaluation. We demonstrate the utility of our benchmark by evaluating a suite of both proprietary and open-weights LLMs in zero-shot and fine-tuned settings, respectively. Our results reveal a key finding: smaller, fine-tuned open-weights models can achieve superior F1 accuracy compared to their larger, proprietary counterparts, underscoring the critical role of high-quality, schema-aligned training data.

1. Introduction

• Annotation and Data Quality: The data annotations in the original benchmark were inconsistent and unreliable. This was caused by a lack of standardization for entity names and literal values, a failure to strictly limit annotations to textual evidence, and the presence of grammatical errors in the source sentences. • Structural and Technical: From a technical perspective, the original dataset was dificult to use and lacked features essential for reproducibility. Its data structure was missing key information and contained formatting errors, while the ontologies themselves was undocumented and used an overly complicated URI scheme.

To address these shortcomings, this work makes the following primary contributions: • We introduce Text2KGBench-LettrIA, a rigorously corrected and enriched benchmark for ontology-guided KG construction. This new version rectifies annotation errors, ensures ontological compliance, and improves overall data quality to facilitate more accurate and meaningful model evaluation. The benchmark is available upon request to the authors. • We conduct an extensive empirical evaluation of diverse language models, including proprietary APIs and open-weights models, on Text2KGBench-LettrIA. Our findings reveal that ifne-tuned open models can consistently outperform larger, proprietary models in zero- or fewshot settings, demonstrating their efectiveness for structured information extraction.

The remainder of this paper is organized as follows. Section 2 reviews related work on KG construction from text. Section 3 details our methodology for revising the benchmark. Section 4 presents our experimental setup and comparative results. Finally, Section 5 concludes with a summary of our ifndings and outlines directions for future research.

2. Related Work

The task of automatically constructing Knowledge Graphs (KGs) from unstructured text, commonly known as Text-to-Knowledge-Graph (Text2KG), has become a central challenge in natural language processing and semantic web research. This process facilitates the transformation of textual information into structured, machine-readable knowledge representations. It is a composite task that typically integrates sub-problems such as Named Entity Recognition (NER), Relation Extraction (RE), and Entity Linking (EL), which are orchestrated within either pipeline or end-to-end architectures. For a comprehensive formalization of the problem and an extensive literature review, we direct the reader to the systematic survey by Regino et al. [ 4 ].

The growing interest in this field is evidenced by sustained community eforts, including the Text2KG workshop series, held annually since 2022 and approaching its fifth edition in 2025 [ 5 ], and the yearly Knowledge Base Construction from Pre-trained Language Models (LM-KBC) challenge [ 6 ].

These eforts are supported by the development of standardized datasets. One of the earliest and most influential is WebNLG [ 3 ], which pairs textual descriptions with RDF-style triples. WebNLG inspired subsequent work like TekGen [ 2 ], which expanded the corpus with synthetically generated data. More recently, Text2KGBench [ 1 ] established a benchmark to evaluate the generation of ontologycompliant triples grounded in source text. However, as we will detail, Text2KGBench exhibits limitations concerning data quality and ontological rigor, which directly motivates the development of our proposed benchmark. Another significant contribution in this domain is the REBEL dataset [ 7 ], which was specifically designed to advance the task of open relation extraction from unstructured text. REBEL introduces a large-scale, fine-grained benchmark that captures a wide range of relations and entities, enabling more comprehensive evaluations of models’ ability to extract structured knowledge from natural language.

Methodologies for relation extraction have evolved significantly. Early approaches progressed from rule-based systems to feature-engineered machine learning and subsequently to deep learning architectures. Seminal neural models introduced sequence labeling and multi-task learning frameworks [ 8 ]. More advanced architectures like Seq2RDF [ 9 ] later framed the task as a sequence-to-sequence problem to translate natural language directly into RDF triples. The advent of transformer-based encoders led to powerful models for joint entity and relation extraction [ 10 ]. A critical shortcoming of many of these models, however, is their frequent lack of explicit integration with ontological constraints, limiting their utility for constructing semantically coherent KGs.

To address this gap, the paradigm of schema-aware extraction has emerged, where generated triples must conform to a predefined ontology. Recent studies have explored leveraging external schema constraints during training, for example through few-shot perspective transfer [ 11 ] or knowledgedriven synthetic data generation for zero-shot extraction [ 12 ]. Others have investigated the use of structured prompts or ontology-guided decoding to improve the alignment of LLM outputs with a target schema. For instance, Ding et al. [ 13 ] proposed model collaboration strategies to mitigate hallucinations and enhance recall.

Large Language Models (LLMs) such as GPT-4 and Claude have demonstrated impressive in-context learning capabilities for information extraction. Nonetheless, their application to Text2KG is hampered by a propensity for factual hallucination and inconsistent adherence to structured output formats [ 14, 5 ]. While eforts to evaluate and mitigate these issues are ongoing, existing benchmarks often lack the ontological precision required for a fair and rigorous assessment. The benchmark introduced in this paper is specifically designed to fill this void.

3. Revision of Text2KGBench

This section details the revision and re-annotation of the Text2KGBench benchmark, undertaken to address critical limitations in its original version and enhance its utility for evaluating modern textto-graph models. Our eforts focused on two key areas: a comprehensive revision of the underliying ontologies and a complete re-annotation of the corpus based on a new, rigorous set of guidelines. A team of four experts specializing in knowledge representation and natural language processing conducted both activities. The process began with an independent pass by each expert, followed by a reconciliation phase to resolve disagreements by discussing and finding majority consensus on a particular solution. Once all individual annotations were complete, the team convened to review the entire set, discuss any discrepancies, and reach a final consensus.

3.1. Ontologies Refinement

The original Text2KGBench ontologies, while extensive, sufered from structural and semantic issues that limited its precision. It was organized into 19 ontologies, one for each domain, but lacked hierarchical depth and formal consistency. We conducted a thorough revision to address these limitations, focusing on improving its coherence, structural integrity, and semantic expressiveness.

Semantic Coherence and Granularity A primary objective was to ensure each domain ontology was self-contained and conceptually coherent. We systematically identified and pruned concepts and relations not directly relevant to their specified domains. For example, within the Film ontology, entities such as Club and Station, and relations like spokenIn, were removed as they are better situated in other contexts. This curation ensures that each domain ontology accurately models its core concepts, improving the benchmark’s overall focus. The reason behind this decision is because we are internally using ontologies that are focus on a specific client domain or use-case and we do not want extract informations that are not related to this specific client domain or use-case.

To reduce ambiguity and improve clarity, we harmonized property names. For instance, the property campus was renamed to address to more accurately reflect its semantic role, and staff was specified as academicStaffSize for explicitness. Similarly, the generic location property was refined into more specific relations such as city or country, depending of the context, thereby increasing the precision of the knowledge graph.

Structural and Formal Enhancements A significant structural enhancement was the introduction of a formal class hierarchy using rdfs:subClassOf relationships. In the original flat structure, University was an isolated class. It is now explicitly defined as a subclass of AcademicInstitution, which itself is a subclass of Organization. This hierarchical structure is not merely a formal improvement; it enables more nuanced evaluation metrics. For instance, we can now measure hierarchical precision, rewarding a model for predicting a correct superclass (e.g. AcademicInstitution even if the specific subclass University) is missed.

Further, properties were rigorously typed as either ObjectProperty (linking two entities) and DatatypeProperty (linking an entity to a literal value), with explicit domains and ranges defined for each. Datatype ranges were specified using standard XML Schema types (e.g. xsd:string, xsd:date, or xsd:integer), enforcing data consistency and aiding downstream processing. To improve usability, we added rdfs:comment annotations for all properties and classes and simplified the URIs by removing the intermediate relations and /concepts path segments. The rdfs:comment annotations have been generated by the authors altogether with our own words to define each annotations.

Finally, to support reproducibility and tracking, the new ontology includes metadata for contributors and is explicitly versioned as version 2.0 using owl:versionIRI. A comprehensive comparison of these changes is presented in Table 4, in the Appendix.

In the appendix, Table 2 presents an overview of the main statistics for each ontology in Text2KGBench-LettrIA and Text2KGBench. The Text2KGBench-LettrIA dataset is significantly lighter, with approximately 21.80% fewer classes and approximately 37.81% fewer properties. Additionally, datatype properties are exclusively present in Text2KGBench-LettrIA.

3.2. Re-annotations Guidelines

A robust benchmark requires annotation guidelines that are consistent, unambiguous, and computationally tractable. We established a comprehensive rulebook for the re-annotation process to ensure high-quality, reproducible data.

Normalization of Literals To ensure uniformity, we normalized literal values. Dates are standardized to the ISO 8601 format (yyyy-mm-dd). Ambiguous formats like xx/xx/xxxx are interpreted as mm/dd/yyyy, a common default in digital systems; if the first value exceeds 12, it is interpreted as dd/mm/yyyy. Partial dates (e.g. only a year, or only month plus year) associated to the xsd:gYear or xsd:gYearMonth datatypes. Durations are also standardized to the XSD notation (e.g. 20 minutes is turned into PT20M).

Entity and Relation Extraction

• Location Handling: Our guidelines for locations proritize capturing geographical containment.

When a text lists a hierarchy of locations (e.g. “Caen, Normandy, France”), we extract each as distinct entity. We then generate isPartOf relations to model their relationship of inclusion (e.g. Caen isPartOf Normandy, Normandy isPartOf France, and Caen isPartOf France). Even though, we take the full string "Caen, Normandy, France" to define a location. For example, Antoine livesAt "Caen, Normandy, France". Finally, definite articles are omitted from place names (e.g., "the Philippines" becomes Philippines). • Strict Adherence to Textual Evidence: Annotations are strictly confined to information explicitly present in the source text, avoiding reliance on external world knowledge. For example, in “Lettria was founded in Paris, France,” Paris is typed as Place. However, in “Lettria was founded in the city of Paris, France,”, the explicit mention allows for the more specific type City. This principle ensures that the benchmark evaluates a model’s ability to extract information from the provided context alone. This rule ensures that the text-2-graph task can be solved relying on the sole information in the benchmark.

Entity Scoping • Organization names: Corporate sufixes (“Inc.”, “Co.”) are preserved as part of the entity name to maintain fidelity to the source text (e.g. Caterpillar Inc.). • Pronoun Resolution: We resolve pronouns to their antecedent entity within the extracted triple. For ambiguous pronouns like “which,” we employ a heuristic of selecting the immediately preceding noun phrase as the antecedent. For example, in “...beef kway teow which comes from the region of Indonesia,” the pronoun “which” is resolved to beef kway teow. • Multiple Entities: When a single statement applies to multiple entities, we create a separate triple for each. “Huseyin Butuner and Hilmi Guner designed...” yields two distinct designer relations, one for each person.

3.3. The Resulting Benchmark: Curation and Structural Enhancement

The culmination of the re-annotation process, guided by the revised ontology and the new annotation principles, is a benchmark of significantly higher quality and consistency. The resulting dataset comprises a total of 4860 sentences, which correspond to 14882 extracted triples.

In addition to the primary re-annotation, the benchmark underwent a comprehensive data curation and enhancement phase to address artifacts present in the original version and to enrich its structure for more rigorous model evaluation. These post-processing enhancements are detailed as follows: • Data Sanitization and Canonicalization: A systematic normalization process was applied to entity and literal values to ensure uniformity and eliminate parsing inconsistencies. This included several key transformations: – Entity Name Normalization: Underscores used as word separators in entity names were replaced with spaces to form canonical, human-readable identifiers (e.g., "AWH_Engineering_College" was corrected to "AWH Engineering College"). – Literal Value Cleaning: Superfluous quotation marks that erroneously encapsulated object values in the original data were removed (e.g., {"obj": "\"Kuttikkattoor\""} was corrected to {"obj": "Kuttikkattoor"}). – Numeric Data Typing: String representations of numbers were parsed into their correct numeric types (e.g., "2000" became 2000). Numerical Values are stripped of punctuation; for example, “18,527” is annotated as 18527 (distinguishing the cases in which the comma was used as thousand or decimal separator). – Textual Harmonization: Spelling inconsistencies and diacritical variations in names were corrected to ensure a true reproduction of what is in the text (e.g., "Hüseyin Bütüner" in the text is kept as it is and not turned into "Huseyin Butuner"). • Explicit Ontological Typing: To improve the formal alignment between data instances and the ontology, each triple was enriched with new keys. The subType and objType fields now explicitly declare the ontological class of the subject and the datatype of the object, respectively. This structural addition is critical for enabling type-aware evaluation metrics and enforcing semantic consistency. • Corpus and Linguistic Refinement: The source text corpus itself was subject to a final review.

Minor grammatical and punctuation errors were corrected to improve linguistic quality.

The cumulative efect of these enhancements is illustrated in Figure 1 in the Appendix, which presents a side-by-side comparison of a data entry before and after the revision process. Table 3 in the Appendix presents a comparison between the original and new datasets. Text2KGBench-LettrIA maintains the same number of sentences as Text2KGBench, while the number of triples varies, showing both additions and reductions respect to Text2KGBench.

4. Experimental Evaluation with LLMs

Our study evaluates the performance of contemporary Large Language Models (LLMs) on the Text-toKnowledge-Graph (Text2KG) task, which involves extracting knowledge graph triples from unstructured text. The evaluation is conducted using the Text2KGBench-LettrIA benchmark. We assess two distinct categories of models under diferent conditions.

First, we assessed a comprehensive suite of proprietary models in a zero-shot setting, where models perform the task without any specific fine-tuning. The evaluated models, grouped by provider, included several from Anthropic, such as the Claude 3 family (Haiku, Sonnet, Opus) [ 15 ], the Claude 3.5 series (Haiku, Sonnet V1, Sonnet V2), the Claude 3.7 Sonnet, and the Claude 4 series (Sonnet, Opus). From Google, we evaluated the Gemini 2.0 family (Flash-Lite, Flash, Pro) and the Gemini 2.5 family (Flash-Lite, Flash, Pro) [ 16 ]. Our assessment also covered OpenAI’s GPT-4.1 series (Full, Mini, Nano) [ 17 ] and GPT-4o series (Full, Mini) [ 18 ]. Finally, from Mistral AI, we included the Mistral Medium 2505 model1.

In parallel, we fine-tuned and subsequently evaluated a selection of prominent open-weights models to gauge their performance after task-specific adaptation. This set comprised Gemma 3 (4B-IT, 12B-IT, 27B-IT) [ 19 ]2, Mistral Small 3.2 (24B-Instruct)3, Phi-4 (14B) [ 20 ], and Qwen 3 in several parameter sizes (0.6B, 1.7B, 4B, 8B, 14B, 32B) [ 21 ]. We have decided to go with these models because they were at that moment the best instructed pre-trained models on the huggingface leaderboard.

4.1. Fine-Tuning Methodology

We employed a Supervised Fine-Tuning (SFT) methodology to adapt the selected Large Language Models (LLMs) for the relation extraction task, utilizing the Unsloth4 framework for eficient training. The fine-tuning process is based on Low-Rank Adaptation (LoRA) [ 22 ] and involved providing each model with an input prompt containing two components: (1) a natural language sentence and (2) a compact representation of the relevant ontology. To mitigate the verbosity of the standard Turtle syntax and ensure the input fits within the models’ context windows, we adopted a format inspired by Manchester syntax for representing the ontology schema. The target output for the SFT process was a JSON object containing the knowledge graph triples extracted from the sentence, mirroring the ground-truth annotations in our dataset.

To assess model performance under diferent data conditions, we designed and evaluated three distinct fine-tuning configurations: Classic Models were fine-tuned on the complete, original training dataset. This configuration serves as our performance baseline.

Extended This configuration incorporates data augmentation. The original training set was supplemented with synthetic data generated by the Gemini 2.5 Pro model. The objective of this augmentation was to enrich the training data for each ontology, ensuring a number of 500 training examples per ontology, bringing the training set to 9500 examples in total.

Generalization This configuration evaluates the models’ zero-shot generalization capabilities to unseen ontologies using a leave-one-out strategy. Models were trained on a dataset comprising 1Model details available at: https://mistral.ai/news/mistral-medium-3 2Model card: https://huggingface.co/google/gemma-3-12b-it 3Model card: https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506 4https://unsloth.ai/ 18 of the 19 ontologies. The held-out ontology (the City ontology) was then used exclusively for testing. The final test set for this scenario was composed of all examples (both original training and test splits) associated with the unseen City ontology.

All the fine-tuning runs 5 for each model have been conducted on a Nvidia H100 GPU.

4.2. Evaluation

To provide a multifaceted evaluation of our relation extraction approach, we introduce a suite of metrics that extends beyond the traditional F1-score. Our methodology first categorizes the components of the knowledge graph into four distinct types: Entities (E) The classes that serve as the domain and range for object properties, or as the domain for datatype properties.

Attributes (A) The literal values that constitute the range of datatype properties. Properties (P) The datatype properties that link entities to attributes.

Relations (R) The object properties that link entities to other entities.

Based on this categorization, we assess model performance across six key dimensions: • F1-Score: The macro-averaged F1-score for the correct identification and classification of each extracted entity, attribute, property, and relation. • Ontological Fidelity: A measure to quantify hallucinations, defined as the generation of types, properties, or relations that are not present in the reference ontology. • Domain/Range Adherence: Assesses whether the model’s outputs respect the domain and range constraints defined in the ontology for all properties (datatype properties) and relations (object properties). This metric accounts for subclass hierarchies; for instance, if an ontology specifies a domain of Place and the model predicts City, the prediction is considered valid provided City is a subclass of Place. • Structural Validity: Measures whether the generated output conforms to the required JSON schema, ensuring it is well-formed and parseable. • Latency: The average inference time in seconds required to generate a response, calculated across all examples in the test set. • Cost: The average monetary cost per query. For proprietary models, this is the API cost. For open-weights models, we estimate the cost based on the hourly price of the required hardware from a cloud provider (e.g., a OVH Cloud instance at 2.80 €/hour).

4.3. Performance and Insights

Performance was evaluated using three distinct fine-tuning configurations. The first two configurations were tested on our "full benchmark," a revised and comprehensive version of the new benchmark. The third configuration was subsequently tested on a single ontology in a "generalization" scenario. All experiments involving closed models utilized the most recent, optimized prompt from our internal text-to-graph production framework. Entities Attributes Properties Relations

Types

Relations Properties Relations Properties source models using a 1-shot prompting strategy. The second section presents results for open-weights models after two finetuning variants: "Classic" (unmarked) and "Extended" (marked with (ext.)).

4.3.1. Full Benchmark Performance

The most striking finding is the significant performance gap between the two groups. Fine-tuned models operate in a diferent league, with most achieving an Entity F1 score exceeding 0.80. This underscores the immense power of specialization. The top performer, Mistral-Small-3.2 (ext.), achieved an outstanding Entity F1 of 0.8837, with other models from the Qwen3 and gemma-3 families clustering in the impressive 0.85–0.87 range. In contrast, the proprietary models, which test general-purpose reasoning without task-specific training, top out with an Entity F1 below 0.70. Within this group, a clear performance hierarchy emerges. gemini-2.5-pro stands out as the best allrounder, with consistently high F1 scores across all categories (E=0.6595, A=0.8762, P=0.8627, R=0.7076). Other models act as high-performing specialists: claude-sonnet-4 excels at understanding complex connections with the highest Relations score (R=0.7126), while gpt-4.1-mini-2025-04-14 is best at identifying discrete items (E=0.6866). Meanwhile, models like gemini-2.0-flash and claude-3-haiku struggle with the task’s complexity, proving unsuitable for this type of detailed extraction.

Linear 5Fine-Tuning Hyper-Parameters: Lora Rank: 128 Lora Alpha: 512 Batch Size: 1 Gradient Accumulation: 8 Epochs: 3 Safety and Reliability Beyond raw performance, fine-tuning proves to be a profound method for ensuring safety and reliability. Nearly all fine-tuned models achieved over 99% validly formatted outputs— with several reaching a perfect 100%—demonstrating that specialization is an exceptionally efective way to guarantee adherence to a specific output format. Furthermore, we observed an "extended efect" in fine-tuned variants: these models often trade a slight dip in Entity F1 for improved scores in other categories and, crucially, lower hallucination rates and better adherence to the ontology. This suggests the -extended process prioritizes overall robustness and safety. Among the proprietary models, the top performers also demonstrate strong reliability. gemini-2.5-pro and claude-opus-4 lead in producing validly formatted outputs (99.80% and 99.20%, respectively) and show superior adherence to the ontology. However, safety is not a given in this category. While models like claude-3.7-sonnet and gemini-2.5-pro boast extremely low hallucination scores, gpt-4.1-nano exhibits a catastrophic failure with a hallucination precision of just 0.4698, making it a high risk for generating false information. Eficiency The eficiency profiles of the two groups present starkly diferent trade-ofs. For the API-based proprietary models, the balance is between performance, latency, and cost-per-call. The gemini-flash models are the fastest, with response times around 2 seconds, while the powerful claude-opus-4 is the slowest at a substantial 37.4 seconds. A similar trade-of exists in cost: gemini-2.0-flash-lite (0.0002¢) is one of the cheapest, whereas claude-opus-4 (0.1682¢) is by far the most expensive, illustrating the classic balance between capability and operational cost. This dynamic shifts entirely with the fine-tuned models, which run on dedicated local hardware. Latencies are astonishingly low, with all models completing the task in under 0.02 seconds—orders of magnitude faster than API calls. The trade-of here is the high, amortized cost of the fine-tuning process and hosting the model on powerful GPU infrastructure. This cost scales directly with model size, making larger models like gemma-3-27b and Qwen3-32B the most expensive to operate.

4.3.2. Generalization Benchmark

The Generalization Benchmark results are displayed in Table C in the Appendix.

Robust Generalization to Unseen Ontologies The fine-tuned models demonstrate a remarkable capacity for generalization, adeptly applying their learned skills to novel ontologies with only a minimal drop in performance. A direct comparison reveals that the top-performing models maintain their elite status even on unfamiliar schemas. For instance, gemma-3-12b-it achieves an outstanding Entity F1 of 0.8376 on the generalization set, a marginal decrease from its 0.8606 score on the full benchmark. Crucially, this level of performance significantly surpasses that of the best closed-source models on the same generalization task, with gemma-3-12b-it outperforming the top proprietary model, claude-sonnet-4 (0.7829), by a substantial margin. This robustness extends beyond raw F1 scores to safety and reliability; the fine-tuned models maintain their near-zero hallucination rates and high adherence to ontological constraints (e.g., gemma-3-27b-it scores 0.9325 for relations respect), with valid output rates remaining at or near 100%. This indicates that the fine-tuning process instills a deep, transferable understanding of the text-to-graph task structure, creating models that are not only specialized but also highly adaptable and reliable when faced with new, unseen challenges.

4.3.3. Lessons Learned

This comprehensive benchmark reveals a clear and instructive dichotomy between specialized, finetuned models and general-purpose, proprietary models, ofering several key lessons for practitioners.

First, specialization is paramount for peak performance and reliability. The fine-tuned openweights models operate in a separate, higher-performance tier, unambiguously demonstrating that for complex, structured tasks like text-to-graph conversion, task-specific training is the most efective strategy. This superiority is not confined to accuracy metrics like F1 scores; it extends crucially to output reliability, where fine-tuned models achieve near-perfect adherence to formatting and ontological constraints, efectively eliminating structural errors and minimizing hallucinations.

Second, efective fine-tuning teaches generalization, not just memorization . A critical finding is that fine-tuned models maintain their performance advantage even when confronted with entirely unseen ontologies. Their ability to robustly generalize the underlying task structure surpasses even the most advanced proprietary models on the same out-of-domain test set. This proves that the fine-tuning process instills a deep, transferable understanding of the task’s logic, making it a viable strategy for building adaptable and scalable systems.

Finally, the choice between the two approaches hinges on a fundamental trade-of between accessibility and eficiency . Proprietary models ofer an invaluable, zero-setup solution for rapid prototyping and tasks where the overhead of fine-tuning is prohibitive. Within this group, a clear hierarchy exists, with models like gemini-2.5-pro and the claude-4 family providing a strong baseline of general reasoning. However, this convenience comes at the cost of higher latency and a pay-per-call model. In contrast, fine-tuned models represent a strategic investment. While they require significant upfront and ongoing infrastructure costs for training and hosting, they deliver inference speeds that are orders of magnitude faster and are economically superior for high-volume, production-level applications, all while providing unparalleled performance and safety.

5. Conclusion and Future Work

In this paper, we introduced Text2KGBench-LettrIA, a rigorously revised benchmark for evaluating ontology-guided Text-to-Knowledge-Graph systems. By systematically overhauling the DBpediaWebNLG portion of Text2KGBench, we addressed critical limitations in its ontological design, annotation quality, and structural consistency. The resulting benchmark features 19 refined ontologies with enforced hierarchical relationships and strict typing, alongside over 14,000 high-fidelity triples re-annotated under stringent guidelines to ensure textual grounding and reproducibility. This work provides the community with a resource that enables a more precise and nuanced evaluation of model capabilities in structured knowledge extraction.

Our experiments yield a significant finding: smaller, open-weights language models, when properly ifne-tuned on our high-quality benchmark, can outperform larger, proprietary models in terms of F1-score for triple extraction. This result underscores the pivotal role that task-specific data quality and model adaptation play in achieving state-of-the-art performance. Nevertheless, our analysis also highlights a persistent challenge: even high-performing models exhibit a tendency to hallucinate or deviate from ontological constraints, indicating that high accuracy on individual components does not guarantee perfect schema adherence.

Building on this work, we identify several key directions for future research.

• Post-Hoc Alignment: The prevalence of schema violations and hallucinations, even after supervised fine-tuning (SFT), suggests the need for a subsequent alignment phase. Investigating reinforcement learning-based techniques such as Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO) could further refine model outputs to improve ontological fidelity. • Explainability and Reasoning: Future work could focus on developing a reasoning layer atop the extraction models. Such a component would not only extract triples but also generate explanations for its predictions, thereby increasing the transparency and trustworthiness of the KG construction process. • Context Window Extension: A current limitation of many open-weights models is their relatively small context window compared to proprietary counterparts. Future experiments should explore methods to extend the efective context size of fine-tuned models, enabling them to process larger and more complex documents and ontologies. • Ontology: The ontologies have only binary relations (they cannot describe complex entities such as event), an improvement would be to create n-ary relations with reification, in order to have more realistic ontologies, and see if the LLMs, even fine-tuned, can properly handle such complex ontologies.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT and LeChat by MistralAI in order to: Grammar and spelling check; Paraphrase and reword. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

Acknowledgments References

This work was supported by the French Public Investment Bank (Bpifrance) i-Demo program within the LettRAGraph project (Grant ID DOS0256163/00).

A. Dataset Statistics

Ontology Name B. Ontology and Annotation Comparison 0.7661 0.7829 0.7825 0.7823 0.7777 0.7775 0.7764 0.7748 0.7731 0.771 0.7661 0.7539 0.7499 0.7489 0.6968 0.6898 0.68 0.6756 0.5572 0.4522

[1]

Mihindukulasooriya ,

Tiwari ,

C. F.

Enguix ,

Lata , Text2KGBench: A Benchmark for OntologyDriven Knowledge Graph Generation from Text, in: The Semantic Web - ISWC 2023 : 22nd International Semantic Web Conference, Proceedings, Part

, Springer-Verlag, Berlin, Heidelberg, 2023 , p. 247 - 265 . doi: 10 .1007/978-3- 031 -47243-5_ 14 .

[2]

Agarwal ,

Ge ,

Shakeri ,

Al-Rfou , Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training , in: K. Toutanova , A.

Rumshisky , L.

Zettlemoyer , D.

Hakkani-Tur , I.

Beltagy , S.

Bethard , R.

Cotterell , T.

Chakraborty , Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 3554 - 3565 . doi: 10 .18653/v1/ 2021 .naacl-main. 278 .

[3]

Gardent ,

Shimorina ,

Narayan ,

Perez-Beltrachini , Creating Training Corpora for NLG Micro-Planners , in: R. Barzilay, M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 179 - 188 . doi: 10 .18653/v1/ P17 -1017.

[4]

A. G.

Regino ,

Rossanez , R. da Silva Torres, J. C. dos Reis ,

A Systematic

Literature Review on RDF Triple Generation from Natural Language Text , Semantic Web ( 2025 ).

[5]

Tiwari ,

Mihindukulasooriya ,

Osborne ,

Kontokostas , J. D'Souza , M. Kejriwal , et al., Preface for the Third International Workshop on Knowledge Graph Generation from Text , in: 3rd International workshop one knowledge graph generation from text . Data Quality meets Machine Learning and Knowledge Graphs 2024 , volume 3747 , CEUR-WS , 2024 , pp. 1 - 4 .

[6]

Kalo ,

Nguyen ,

Razniewski ,

Zhang , Preface: Lm-kbc challenge 2024, in: KBC-LM-LM-KBC 2024 Joint proceedings of the KBC-LM workshop and the LM-KBC challenge 2024, CEUR-WS .org, 2024 , pp. 1 - 5 .

[7] P.-L. Huguet

Cabot

Navigli , REBEL: Relation Extraction By End-to-end Language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics , Punta Cana, Dominican Republic, 2021 . URL: https://github.com/ Babelscape/rebel/blob/main/docs/EMNLP_2021_REBEL__Camera_Ready_.pdf.

[8]

Zeng ,

Liu ,

Lai ,

Zhou ,

Zhao , Relation Classification via Convolutional Deep Neural Network , in: J. Tsujii , J. Hajic (Eds.), Proceedings of COLING 2014 , the 25th International Conference on Computational Linguistics: Technical Papers , Dublin City University and Association for Computational Linguistics, Dublin, Ireland, 2014 , pp. 2335 - 2344 . URL: https://aclanthology.org/C14-1220/.

[9]

Liu ,

Zhang ,

Liang ,

Ji ,

D. L.

McGuinness , Seq2rdf: An end-to-end application for deriving triples from natural language text , in: Proceedings of the ISWC 2018

Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference (ISWC

2018 ), 2018 .

[10]

Wang ,

Lu , Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders , in: B. Webber , T. Cohn, Y. He , Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Online, 2020 , pp. 1706 - 1721 . URL: https://aclanthology.org/ 2020 .emnlp-main. 133 /. doi: 10 .18653/v1/ 2020 .emnlp-main. 133 .

[11]

Fei ,

Zeng ,

Zhao ,

Li ,

Xiao , Few-Shot Relational Triple Extraction with Perspective Transfer Network , in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM '22 , Association for Computing Machinery, New York, NY, USA, 2022 , p. 488 - 498 . doi: 10 .1145/3511808.3557323.

[12]

He ,

Zhang , J. Liu,

Sun ,

Zhang , Zero-Shot Relation Triplet Extraction via KnowledgeDriven LLM Synthetic Data Generation , in: D. -S. Huang , Z. Si , C. Zhang (Eds.), Advanced Intelligent Computing Technology and Applications , Springer Nature, Singapore, 2024 , pp. 329 - 340 .

[13]

Ding ,

Huang ,

Liang ,

Xiao ,

Yang , Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Turin , Italy, 2024 , pp. 8890 - 8901 . URL: https://aclanthology.org/ 2024 .lrec-main. 778 /.

[14]

Ananya ,

Tiwari ,

Mihindukulasooriya ,

Soru ,

Xu ,

Moussallem , Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text , in: TEXT2KG 2024: Third International Workshop on Knowledge Graph Generation from Text , Hersonissos, Greece, 2024 .

[15] Anthropic , The Claude 3 Model Family: Opus , Sonnet, Haiku, 2024 . URL: https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.

[16]

Comanici , et al., Gemini 2 . 5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context , and Next Generation Agentic Capabilities, 2025 . URL: https://arxiv.org/abs/2507. 06261. arXiv: 2507 . 06261 .

[17] OpenAI , et al., GPT-4 Technical Report , 2024 . URL: https://arxiv.org/abs/2303.08774. arXiv: 2303 . 08774 .

[18] OpenAI , et al., GPT-4o System Card , 2024 . URL: https://arxiv.org/abs/2410.21276. arXiv: 2410 . 21276 .

[19]

Gemma

Team , et al., Gemma 3 Technical Report , 2025 . URL: https://arxiv.org/abs/2503.19786. arXiv: 2503 . 19786 .

[20]

Abdin ,

Aneja ,

Behl ,

Bubeck ,

Eldan ,

Gunasekar ,

Harrison ,

R. J.

Hewett ,

Javaheripi , P. Kaufmann,

J. R.

Lee ,

Y. T.

Lee ,

Li ,

Liu ,

C. C. T.

Mendes ,

Nguyen , E. Price, G. de Rosa,

Saarikivi ,

Salim ,

Shah ,

Wang ,

Ward ,

Wu ,

Yu ,

Zhang , Y. Zhang, Phi-4 Technical Report , 2024 . URL: https://arxiv.org/abs/2412.08905. arXiv: 2412 . 08905 .

[21]

Yang ,

Li ,

Yang ,

Zhang ,

Hui ,

Zheng ,

Yu ,

Gao ,

Huang ,

Lv ,

Zheng ,

Liu ,

Zhou ,

Huang ,

Hu ,

Ge ,

Wei ,

Lin ,

Tang ,

Yang ,

Tu ,

Zhang ,

Yang ,

Zhou ,

Lin ,

Dang ,

Bao ,

Yang ,

Yu ,

Deng ,

Li ,

Xue ,

Li ,

Zhang ,

Wang ,

Zhu ,

Men ,

Gao , S. Liu,

Luo ,

Li ,

Tang ,

Yin ,

Ren ,

Wang ,

Zhang ,

Ren ,

Fan ,

Su ,

Zhang ,

Wan ,

Liu ,

Wang ,

Cui ,

Zhang ,

Zhou ,

Qiu , Qwen3 Technical Report , 2025 . URL: https://arxiv.org/abs/2505.09388. arXiv: 2505 . 09388 .

[22]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Wang , W. Chen, LoRA: Low-Rank Adaptation of Large Language Models , in: International Conference on Learning Representations, 2022 . URL: https://openreview.net/forum?id=nZeVKeeFYf9.

mistral-medium-2505 claude-sonnet-4 claude-3-opus claude-3.5-sonnet-v2 claude-3-sonnet claude-3.7-sonnet gpt-4 . 1 -mini- 2025 -04-14 gemini-2.5-pro gpt-4 . 1 -2025 -04-14 gemini-2 . 5 - flash-lite mistral-medium-2505 claude-3.5-sonnet-v1 claude-3 .5-haiku gpt-4o -2024-11-20 gemini-2 .0 -flash-lite claude-3-haiku gpt-4o- mini- 2024 -07-18 gpt-4 . 1 -nano- 2025 -04-14 gemini-2.5-flash gemini-2 .0-flash