<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text2KGBench-LettrIA: A Refined Benchmark for Text2Graph Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julien Plu</string-name>
          <email>julien@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Moreno Escobar</string-name>
          <email>oscar@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edouard Trouillez</string-name>
          <email>edouard@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axelle Gapin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Lisena</string-name>
          <email>pasquale.lisena@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thibault Ehrhart</string-name>
          <email>thibault.ehrhart@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphaël Troncy</string-name>
          <email>raphael.troncy@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LettrIA</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Recent advances in Large Language Models (LLMs) have catalyzed significant research into automated knowledge graph (KG) construction from text, a fundamental challenge at the intersection of natural language processing and semantic web technologies. However, the reliability of evaluating model performance is hindered by limitations in existing benchmarks like Text2KGBench, which exhibit shortcomings in data quality, ontological consistency, and structural design. To address these issues, this paper introduces Text2KGBench-LettrIA, a substantially revised and curated benchmark derived from the DBpedia-WebNLG portion of Text2KGBench. Our primary contributions include: (1) the systematic refinement of 19 domain ontologies to enforce hierarchical structure and formal typing; (2) a complete re-annotation of 4,860 sentences, yielding over 14,000 high-fidelity triples under a strict set of annotation guidelines; and (3) the introduction of an enriched data format with enhanced metadata to ensure reproducibility and support multifaceted evaluation. We demonstrate the utility of our benchmark by evaluating a suite of both proprietary and open-weights LLMs in zero-shot and fine-tuned settings, respectively. Our results reveal a key finding: smaller, fine-tuned open-weights models can achieve superior F1 accuracy compared to their larger, proprietary counterparts, underscoring the critical role of high-quality, schema-aligned training data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Annotation and Data Quality: The data annotations in the original benchmark were
inconsistent and unreliable. This was caused by a lack of standardization for entity names and literal
values, a failure to strictly limit annotations to textual evidence, and the presence of grammatical
errors in the source sentences.
• Structural and Technical: From a technical perspective, the original dataset was dificult to use
and lacked features essential for reproducibility. Its data structure was missing key information
and contained formatting errors, while the ontologies themselves was undocumented and used
an overly complicated URI scheme.</p>
      <p>To address these shortcomings, this work makes the following primary contributions:
• We introduce Text2KGBench-LettrIA, a rigorously corrected and enriched benchmark for
ontology-guided KG construction. This new version rectifies annotation errors, ensures
ontological compliance, and improves overall data quality to facilitate more accurate and meaningful
model evaluation. The benchmark is available upon request to the authors.
• We conduct an extensive empirical evaluation of diverse language models, including
proprietary APIs and open-weights models, on Text2KGBench-LettrIA. Our findings reveal that
ifne-tuned open models can consistently outperform larger, proprietary models in zero- or
fewshot settings, demonstrating their efectiveness for structured information extraction.</p>
      <p>The remainder of this paper is organized as follows. Section 2 reviews related work on KG construction
from text. Section 3 details our methodology for revising the benchmark. Section 4 presents our
experimental setup and comparative results. Finally, Section 5 concludes with a summary of our
ifndings and outlines directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The task of automatically constructing Knowledge Graphs (KGs) from unstructured text, commonly
known as Text-to-Knowledge-Graph (Text2KG), has become a central challenge in natural language
processing and semantic web research. This process facilitates the transformation of textual information
into structured, machine-readable knowledge representations. It is a composite task that typically
integrates sub-problems such as Named Entity Recognition (NER), Relation Extraction (RE), and Entity
Linking (EL), which are orchestrated within either pipeline or end-to-end architectures. For a
comprehensive formalization of the problem and an extensive literature review, we direct the reader to the
systematic survey by Regino et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The growing interest in this field is evidenced by sustained community eforts, including the Text2KG
workshop series, held annually since 2022 and approaching its fifth edition in 2025 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and the yearly
Knowledge Base Construction from Pre-trained Language Models (LM-KBC) challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        These eforts are supported by the development of standardized datasets. One of the earliest and
most influential is WebNLG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which pairs textual descriptions with RDF-style triples. WebNLG
inspired subsequent work like TekGen [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which expanded the corpus with synthetically generated
data. More recently, Text2KGBench [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] established a benchmark to evaluate the generation of
ontologycompliant triples grounded in source text. However, as we will detail, Text2KGBench exhibits limitations
concerning data quality and ontological rigor, which directly motivates the development of our proposed
benchmark. Another significant contribution in this domain is the REBEL dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which was
specifically designed to advance the task of open relation extraction from unstructured text. REBEL
introduces a large-scale, fine-grained benchmark that captures a wide range of relations and entities,
enabling more comprehensive evaluations of models’ ability to extract structured knowledge from
natural language.
      </p>
      <p>
        Methodologies for relation extraction have evolved significantly. Early approaches progressed from
rule-based systems to feature-engineered machine learning and subsequently to deep learning
architectures. Seminal neural models introduced sequence labeling and multi-task learning frameworks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
More advanced architectures like Seq2RDF [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] later framed the task as a sequence-to-sequence problem
to translate natural language directly into RDF triples. The advent of transformer-based encoders led to
powerful models for joint entity and relation extraction [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. A critical shortcoming of many of these
models, however, is their frequent lack of explicit integration with ontological constraints, limiting
their utility for constructing semantically coherent KGs.
      </p>
      <p>
        To address this gap, the paradigm of schema-aware extraction has emerged, where generated triples
must conform to a predefined ontology. Recent studies have explored leveraging external schema
constraints during training, for example through few-shot perspective transfer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or
knowledgedriven synthetic data generation for zero-shot extraction [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Others have investigated the use of
structured prompts or ontology-guided decoding to improve the alignment of LLM outputs with a target
schema. For instance, Ding et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed model collaboration strategies to mitigate hallucinations
and enhance recall.
      </p>
      <p>
        Large Language Models (LLMs) such as GPT-4 and Claude have demonstrated impressive in-context
learning capabilities for information extraction. Nonetheless, their application to Text2KG is hampered
by a propensity for factual hallucination and inconsistent adherence to structured output formats [
        <xref ref-type="bibr" rid="ref14 ref5">14, 5</xref>
        ].
While eforts to evaluate and mitigate these issues are ongoing, existing benchmarks often lack the
ontological precision required for a fair and rigorous assessment. The benchmark introduced in this
paper is specifically designed to fill this void.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Revision of Text2KGBench</title>
      <p>This section details the revision and re-annotation of the Text2KGBench benchmark, undertaken to
address critical limitations in its original version and enhance its utility for evaluating modern
textto-graph models. Our eforts focused on two key areas: a comprehensive revision of the underliying
ontologies and a complete re-annotation of the corpus based on a new, rigorous set of guidelines. A team
of four experts specializing in knowledge representation and natural language processing conducted
both activities. The process began with an independent pass by each expert, followed by a reconciliation
phase to resolve disagreements by discussing and finding majority consensus on a particular solution.
Once all individual annotations were complete, the team convened to review the entire set, discuss any
discrepancies, and reach a final consensus.</p>
      <sec id="sec-3-1">
        <title>3.1. Ontologies Refinement</title>
        <p>The original Text2KGBench ontologies, while extensive, sufered from structural and semantic issues that
limited its precision. It was organized into 19 ontologies, one for each domain, but lacked hierarchical
depth and formal consistency. We conducted a thorough revision to address these limitations, focusing
on improving its coherence, structural integrity, and semantic expressiveness.</p>
        <p>Semantic Coherence and Granularity A primary objective was to ensure each domain ontology
was self-contained and conceptually coherent. We systematically identified and pruned concepts and
relations not directly relevant to their specified domains. For example, within the Film ontology,
entities such as Club and Station, and relations like spokenIn, were removed as they are better
situated in other contexts. This curation ensures that each domain ontology accurately models its core
concepts, improving the benchmark’s overall focus. The reason behind this decision is because we are
internally using ontologies that are focus on a specific client domain or use-case and we do not want
extract informations that are not related to this specific client domain or use-case.</p>
        <p>To reduce ambiguity and improve clarity, we harmonized property names. For instance, the property
campus was renamed to address to more accurately reflect its semantic role, and staff was specified
as academicStaffSize for explicitness. Similarly, the generic location property was refined into
more specific relations such as city or country, depending of the context, thereby increasing the
precision of the knowledge graph.</p>
        <p>Structural and Formal Enhancements A significant structural enhancement was the
introduction of a formal class hierarchy using rdfs:subClassOf relationships. In the original flat structure,
University was an isolated class. It is now explicitly defined as a subclass of AcademicInstitution,
which itself is a subclass of Organization. This hierarchical structure is not merely a formal
improvement; it enables more nuanced evaluation metrics. For instance, we can now measure hierarchical
precision, rewarding a model for predicting a correct superclass (e.g. AcademicInstitution even if
the specific subclass University) is missed.</p>
        <p>Further, properties were rigorously typed as either ObjectProperty (linking two entities) and
DatatypeProperty (linking an entity to a literal value), with explicit domains and ranges defined for
each. Datatype ranges were specified using standard XML Schema types (e.g. xsd:string, xsd:date,
or xsd:integer), enforcing data consistency and aiding downstream processing. To improve usability,
we added rdfs:comment annotations for all properties and classes and simplified the URIs by removing
the intermediate relations and /concepts path segments. The rdfs:comment annotations have
been generated by the authors altogether with our own words to define each annotations.</p>
        <p>Finally, to support reproducibility and tracking, the new ontology includes metadata for contributors
and is explicitly versioned as version 2.0 using owl:versionIRI. A comprehensive comparison of
these changes is presented in Table 4, in the Appendix.</p>
        <p>In the appendix, Table 2 presents an overview of the main statistics for each ontology in
Text2KGBench-LettrIA and Text2KGBench. The Text2KGBench-LettrIA dataset is significantly lighter,
with approximately 21.80% fewer classes and approximately 37.81% fewer properties. Additionally,
datatype properties are exclusively present in Text2KGBench-LettrIA.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Re-annotations Guidelines</title>
        <p>A robust benchmark requires annotation guidelines that are consistent, unambiguous, and
computationally tractable. We established a comprehensive rulebook for the re-annotation process to ensure
high-quality, reproducible data.</p>
        <p>Normalization of Literals To ensure uniformity, we normalized literal values. Dates are
standardized to the ISO 8601 format (yyyy-mm-dd). Ambiguous formats like xx/xx/xxxx are interpreted as
mm/dd/yyyy, a common default in digital systems; if the first value exceeds 12, it is interpreted as
dd/mm/yyyy. Partial dates (e.g. only a year, or only month plus year) associated to the xsd:gYear or
xsd:gYearMonth datatypes. Durations are also standardized to the XSD notation (e.g. 20 minutes
is turned into PT20M).</p>
        <sec id="sec-3-2-1">
          <title>Entity and Relation Extraction</title>
          <p>• Location Handling: Our guidelines for locations proritize capturing geographical containment.</p>
          <p>When a text lists a hierarchy of locations (e.g. “Caen, Normandy, France”), we extract each as
distinct entity. We then generate isPartOf relations to model their relationship of inclusion
(e.g. Caen isPartOf Normandy, Normandy isPartOf France, and Caen isPartOf France). Even
though, we take the full string "Caen, Normandy, France" to define a location. For example,
Antoine livesAt "Caen, Normandy, France". Finally, definite articles are omitted from place
names (e.g., "the Philippines" becomes Philippines).
• Strict Adherence to Textual Evidence: Annotations are strictly confined to information
explicitly present in the source text, avoiding reliance on external world knowledge. For example,
in “Lettria was founded in Paris, France,” Paris is typed as Place. However, in “Lettria was
founded in the city of Paris, France,”, the explicit mention allows for the more specific type City.
This principle ensures that the benchmark evaluates a model’s ability to extract information from
the provided context alone. This rule ensures that the text-2-graph task can be solved relying on
the sole information in the benchmark.</p>
          <p>Entity Scoping
• Organization names: Corporate sufixes (“Inc.”, “Co.”) are preserved as part of the entity name
to maintain fidelity to the source text (e.g. Caterpillar Inc.).
• Pronoun Resolution: We resolve pronouns to their antecedent entity within the extracted
triple. For ambiguous pronouns like “which,” we employ a heuristic of selecting the immediately
preceding noun phrase as the antecedent. For example, in “...beef kway teow which comes from
the region of Indonesia,” the pronoun “which” is resolved to beef kway teow.
• Multiple Entities: When a single statement applies to multiple entities, we create a separate
triple for each. “Huseyin Butuner and Hilmi Guner designed...” yields two distinct designer
relations, one for each person.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. The Resulting Benchmark: Curation and Structural Enhancement</title>
        <p>The culmination of the re-annotation process, guided by the revised ontology and the new annotation
principles, is a benchmark of significantly higher quality and consistency. The resulting dataset
comprises a total of 4860 sentences, which correspond to 14882 extracted triples.</p>
        <p>In addition to the primary re-annotation, the benchmark underwent a comprehensive data curation
and enhancement phase to address artifacts present in the original version and to enrich its structure
for more rigorous model evaluation. These post-processing enhancements are detailed as follows:
• Data Sanitization and Canonicalization: A systematic normalization process was applied to
entity and literal values to ensure uniformity and eliminate parsing inconsistencies. This included
several key transformations:
– Entity Name Normalization: Underscores used as word separators in entity
names were replaced with spaces to form canonical, human-readable identifiers (e.g.,
"AWH_Engineering_College" was corrected to "AWH Engineering College").
– Literal Value Cleaning: Superfluous quotation marks that erroneously encapsulated object
values in the original data were removed (e.g., {"obj": "\"Kuttikkattoor\""} was
corrected to {"obj": "Kuttikkattoor"}).
– Numeric Data Typing: String representations of numbers were parsed into their correct
numeric types (e.g., "2000" became 2000). Numerical Values are stripped of punctuation;
for example, “18,527” is annotated as 18527 (distinguishing the cases in which the comma
was used as thousand or decimal separator).
– Textual Harmonization: Spelling inconsistencies and diacritical variations in names were
corrected to ensure a true reproduction of what is in the text (e.g., "Hüseyin Bütüner"
in the text is kept as it is and not turned into "Huseyin Butuner").
• Explicit Ontological Typing: To improve the formal alignment between data instances and
the ontology, each triple was enriched with new keys. The subType and objType fields now
explicitly declare the ontological class of the subject and the datatype of the object, respectively.
This structural addition is critical for enabling type-aware evaluation metrics and enforcing
semantic consistency.
• Corpus and Linguistic Refinement: The source text corpus itself was subject to a final review.</p>
        <p>Minor grammatical and punctuation errors were corrected to improve linguistic quality.</p>
        <p>The cumulative efect of these enhancements is illustrated in Figure 1 in the Appendix, which presents
a side-by-side comparison of a data entry before and after the revision process. Table 3 in the Appendix
presents a comparison between the original and new datasets. Text2KGBench-LettrIA maintains the
same number of sentences as Text2KGBench, while the number of triples varies, showing both additions
and reductions respect to Text2KGBench.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation with LLMs</title>
      <p>Our study evaluates the performance of contemporary Large Language Models (LLMs) on the
Text-toKnowledge-Graph (Text2KG) task, which involves extracting knowledge graph triples from unstructured
text. The evaluation is conducted using the Text2KGBench-LettrIA benchmark. We assess two distinct
categories of models under diferent conditions.</p>
      <p>
        First, we assessed a comprehensive suite of proprietary models in a zero-shot setting, where models
perform the task without any specific fine-tuning. The evaluated models, grouped by provider, included
several from Anthropic, such as the Claude 3 family (Haiku, Sonnet, Opus) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the Claude 3.5 series
(Haiku, Sonnet V1, Sonnet V2), the Claude 3.7 Sonnet, and the Claude 4 series (Sonnet, Opus). From
Google, we evaluated the Gemini 2.0 family (Flash-Lite, Flash, Pro) and the Gemini 2.5 family (Flash-Lite,
Flash, Pro) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Our assessment also covered OpenAI’s GPT-4.1 series (Full, Mini, Nano) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and
GPT-4o series (Full, Mini) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Finally, from Mistral AI, we included the Mistral Medium 2505 model1.
      </p>
      <p>
        In parallel, we fine-tuned and subsequently evaluated a selection of prominent open-weights models
to gauge their performance after task-specific adaptation. This set comprised Gemma 3 (4B-IT, 12B-IT,
27B-IT) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]2, Mistral Small 3.2 (24B-Instruct)3, Phi-4 (14B) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and Qwen 3 in several parameter sizes
(0.6B, 1.7B, 4B, 8B, 14B, 32B) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. We have decided to go with these models because they were at that
moment the best instructed pre-trained models on the huggingface leaderboard.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Fine-Tuning Methodology</title>
        <p>
          We employed a Supervised Fine-Tuning (SFT) methodology to adapt the selected Large Language
Models (LLMs) for the relation extraction task, utilizing the Unsloth4 framework for eficient training.
The fine-tuning process is based on Low-Rank Adaptation (LoRA) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and involved providing each
model with an input prompt containing two components: (1) a natural language sentence and (2) a
compact representation of the relevant ontology. To mitigate the verbosity of the standard Turtle
syntax and ensure the input fits within the models’ context windows, we adopted a format inspired
by Manchester syntax for representing the ontology schema. The target output for the SFT process
was a JSON object containing the knowledge graph triples extracted from the sentence, mirroring the
ground-truth annotations in our dataset.
        </p>
        <p>To assess model performance under diferent data conditions, we designed and evaluated three
distinct fine-tuning configurations:
Classic Models were fine-tuned on the complete, original training dataset. This configuration serves
as our performance baseline.</p>
        <p>Extended This configuration incorporates data augmentation. The original training set was
supplemented with synthetic data generated by the Gemini 2.5 Pro model. The objective of this
augmentation was to enrich the training data for each ontology, ensuring a number of 500 training
examples per ontology, bringing the training set to 9500 examples in total.</p>
        <p>Generalization This configuration evaluates the models’ zero-shot generalization capabilities to
unseen ontologies using a leave-one-out strategy. Models were trained on a dataset comprising
1Model details available at: https://mistral.ai/news/mistral-medium-3
2Model card: https://huggingface.co/google/gemma-3-12b-it
3Model card: https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
4https://unsloth.ai/
18 of the 19 ontologies. The held-out ontology (the City ontology) was then used exclusively for
testing. The final test set for this scenario was composed of all examples (both original training
and test splits) associated with the unseen City ontology.</p>
        <p>All the fine-tuning runs 5 for each model have been conducted on a Nvidia H100 GPU.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation</title>
        <p>To provide a multifaceted evaluation of our relation extraction approach, we introduce a suite of metrics
that extends beyond the traditional F1-score. Our methodology first categorizes the components of the
knowledge graph into four distinct types:
Entities (E) The classes that serve as the domain and range for object properties, or as the domain for
datatype properties.</p>
        <p>Attributes (A) The literal values that constitute the range of datatype properties.
Properties (P) The datatype properties that link entities to attributes.</p>
        <p>Relations (R) The object properties that link entities to other entities.</p>
        <p>Based on this categorization, we assess model performance across six key dimensions:
• F1-Score: The macro-averaged F1-score for the correct identification and classification of each
extracted entity, attribute, property, and relation.
• Ontological Fidelity: A measure to quantify hallucinations, defined as the generation of types,
properties, or relations that are not present in the reference ontology.
• Domain/Range Adherence: Assesses whether the model’s outputs respect the domain and
range constraints defined in the ontology for all properties (datatype properties) and relations
(object properties). This metric accounts for subclass hierarchies; for instance, if an ontology
specifies a domain of Place and the model predicts City, the prediction is considered valid
provided City is a subclass of Place.
• Structural Validity: Measures whether the generated output conforms to the required JSON
schema, ensuring it is well-formed and parseable.
• Latency: The average inference time in seconds required to generate a response, calculated
across all examples in the test set.
• Cost: The average monetary cost per query. For proprietary models, this is the API cost. For
open-weights models, we estimate the cost based on the hourly price of the required hardware
from a cloud provider (e.g., a OVH Cloud instance at 2.80 €/hour).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Performance and Insights</title>
        <p>Performance was evaluated using three distinct fine-tuning configurations. The first two configurations
were tested on our "full benchmark," a revised and comprehensive version of the new benchmark. The
third configuration was subsequently tested on a single ontology in a "generalization" scenario. All
experiments involving closed models utilized the most recent, optimized prompt from our internal
text-to-graph production framework.
Entities Attributes Properties Relations</p>
        <p>Types</p>
        <p>Relations Properties Relations Properties
source models using a 1-shot prompting strategy. The second section presents results for open-weights models
after two finetuning variants: "Classic" (unmarked) and "Extended" (marked with
(ext.)).</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Full Benchmark</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Performance</title>
          <p>The most striking finding is the significant performance gap between the two groups.
Fine-tuned models operate in a diferent league, with most achieving an Entity F1 score exceeding
0.80. This underscores the immense power of specialization. The top performer, Mistral-Small-3.2
(ext.), achieved an outstanding Entity F1 of 0.8837, with other models from the Qwen3 and gemma-3
families clustering in the impressive 0.85–0.87 range. In contrast, the proprietary models, which
test general-purpose reasoning without task-specific training, top out with an Entity F1 below 0.70.
Within this group, a clear performance hierarchy emerges. gemini-2.5-pro stands out as the best
allrounder, with consistently high F1 scores across all categories (E=0.6595, A=0.8762, P=0.8627, R=0.7076).
Other models act as high-performing specialists: claude-sonnet-4 excels at understanding
complex connections with the highest Relations score (R=0.7126), while gpt-4.1-mini-2025-04-14
is best at identifying discrete items (E=0.6866). Meanwhile, models like gemini-2.0-flash and
claude-3-haiku struggle with the task’s complexity, proving unsuitable for this type of detailed
extraction.</p>
          <p>Linear
5Fine-Tuning Hyper-Parameters: Lora Rank: 128 Lora Alpha: 512 Batch Size: 1 Gradient Accumulation: 8 Epochs: 3
Safety and Reliability Beyond raw performance, fine-tuning proves to be a profound method for
ensuring safety and reliability. Nearly all fine-tuned models achieved over 99% validly formatted outputs—
with several reaching a perfect 100%—demonstrating that specialization is an exceptionally efective
way to guarantee adherence to a specific output format. Furthermore, we observed an "extended efect"
in fine-tuned variants: these models often trade a slight dip in Entity F1 for improved scores in other
categories and, crucially, lower hallucination rates and better adherence to the ontology. This suggests
the -extended process prioritizes overall robustness and safety. Among the proprietary models, the
top performers also demonstrate strong reliability. gemini-2.5-pro and claude-opus-4 lead in
producing validly formatted outputs (99.80% and 99.20%, respectively) and show superior adherence to
the ontology. However, safety is not a given in this category. While models like claude-3.7-sonnet
and gemini-2.5-pro boast extremely low hallucination scores, gpt-4.1-nano exhibits a catastrophic
failure with a hallucination precision of just 0.4698, making it a high risk for generating false information.
Eficiency The eficiency profiles of the two groups present starkly diferent trade-ofs. For the
API-based proprietary models, the balance is between performance, latency, and cost-per-call. The
gemini-flash models are the fastest, with response times around 2 seconds, while the
powerful claude-opus-4 is the slowest at a substantial 37.4 seconds. A similar trade-of exists in cost:
gemini-2.0-flash-lite (0.0002¢) is one of the cheapest, whereas claude-opus-4 (0.1682¢) is by
far the most expensive, illustrating the classic balance between capability and operational cost. This
dynamic shifts entirely with the fine-tuned models, which run on dedicated local hardware. Latencies
are astonishingly low, with all models completing the task in under 0.02 seconds—orders of magnitude
faster than API calls. The trade-of here is the high, amortized cost of the fine-tuning process and
hosting the model on powerful GPU infrastructure. This cost scales directly with model size, making
larger models like gemma-3-27b and Qwen3-32B the most expensive to operate.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.2. Generalization Benchmark</title>
          <p>The Generalization Benchmark results are displayed in Table C in the Appendix.</p>
          <p>Robust Generalization to Unseen Ontologies The fine-tuned models demonstrate a remarkable
capacity for generalization, adeptly applying their learned skills to novel ontologies with only a minimal
drop in performance. A direct comparison reveals that the top-performing models maintain their
elite status even on unfamiliar schemas. For instance, gemma-3-12b-it achieves an outstanding
Entity F1 of 0.8376 on the generalization set, a marginal decrease from its 0.8606 score on the full
benchmark. Crucially, this level of performance significantly surpasses that of the best closed-source
models on the same generalization task, with gemma-3-12b-it outperforming the top proprietary
model, claude-sonnet-4 (0.7829), by a substantial margin. This robustness extends beyond raw F1
scores to safety and reliability; the fine-tuned models maintain their near-zero hallucination rates and
high adherence to ontological constraints (e.g., gemma-3-27b-it scores 0.9325 for relations respect),
with valid output rates remaining at or near 100%. This indicates that the fine-tuning process instills a
deep, transferable understanding of the text-to-graph task structure, creating models that are not only
specialized but also highly adaptable and reliable when faced with new, unseen challenges.</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>4.3.3. Lessons Learned</title>
          <p>This comprehensive benchmark reveals a clear and instructive dichotomy between specialized,
finetuned models and general-purpose, proprietary models, ofering several key lessons for practitioners.</p>
          <p>First, specialization is paramount for peak performance and reliability. The fine-tuned
openweights models operate in a separate, higher-performance tier, unambiguously demonstrating that
for complex, structured tasks like text-to-graph conversion, task-specific training is the most efective
strategy. This superiority is not confined to accuracy metrics like F1 scores; it extends crucially to
output reliability, where fine-tuned models achieve near-perfect adherence to formatting and ontological
constraints, efectively eliminating structural errors and minimizing hallucinations.</p>
          <p>Second, efective fine-tuning teaches generalization, not just memorization . A critical finding
is that fine-tuned models maintain their performance advantage even when confronted with entirely
unseen ontologies. Their ability to robustly generalize the underlying task structure surpasses even the
most advanced proprietary models on the same out-of-domain test set. This proves that the fine-tuning
process instills a deep, transferable understanding of the task’s logic, making it a viable strategy for
building adaptable and scalable systems.</p>
          <p>Finally, the choice between the two approaches hinges on a fundamental trade-of between
accessibility and eficiency . Proprietary models ofer an invaluable, zero-setup solution for rapid
prototyping and tasks where the overhead of fine-tuning is prohibitive. Within this group, a clear
hierarchy exists, with models like gemini-2.5-pro and the claude-4 family providing a strong
baseline of general reasoning. However, this convenience comes at the cost of higher latency and
a pay-per-call model. In contrast, fine-tuned models represent a strategic investment. While they
require significant upfront and ongoing infrastructure costs for training and hosting, they deliver
inference speeds that are orders of magnitude faster and are economically superior for high-volume,
production-level applications, all while providing unparalleled performance and safety.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we introduced Text2KGBench-LettrIA, a rigorously revised benchmark for evaluating
ontology-guided Text-to-Knowledge-Graph systems. By systematically overhauling the
DBpediaWebNLG portion of Text2KGBench, we addressed critical limitations in its ontological design, annotation
quality, and structural consistency. The resulting benchmark features 19 refined ontologies with enforced
hierarchical relationships and strict typing, alongside over 14,000 high-fidelity triples re-annotated
under stringent guidelines to ensure textual grounding and reproducibility. This work provides the
community with a resource that enables a more precise and nuanced evaluation of model capabilities in
structured knowledge extraction.</p>
      <p>Our experiments yield a significant finding: smaller, open-weights language models, when properly
ifne-tuned on our high-quality benchmark, can outperform larger, proprietary models in terms of
F1-score for triple extraction. This result underscores the pivotal role that task-specific data quality
and model adaptation play in achieving state-of-the-art performance. Nevertheless, our analysis also
highlights a persistent challenge: even high-performing models exhibit a tendency to hallucinate or
deviate from ontological constraints, indicating that high accuracy on individual components does not
guarantee perfect schema adherence.</p>
      <p>Building on this work, we identify several key directions for future research.</p>
      <p>• Post-Hoc Alignment: The prevalence of schema violations and hallucinations, even after
supervised fine-tuning (SFT), suggests the need for a subsequent alignment phase. Investigating
reinforcement learning-based techniques such as Proximal Policy Optimization (PPO) or Direct
Preference Optimization (DPO) could further refine model outputs to improve ontological fidelity.
• Explainability and Reasoning: Future work could focus on developing a reasoning layer
atop the extraction models. Such a component would not only extract triples but also generate
explanations for its predictions, thereby increasing the transparency and trustworthiness of the
KG construction process.
• Context Window Extension: A current limitation of many open-weights models is their
relatively small context window compared to proprietary counterparts. Future experiments
should explore methods to extend the efective context size of fine-tuned models, enabling them
to process larger and more complex documents and ontologies.
• Ontology: The ontologies have only binary relations (they cannot describe complex entities
such as event), an improvement would be to create n-ary relations with reification, in order to
have more realistic ontologies, and see if the LLMs, even fine-tuned, can properly handle such
complex ontologies.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and LeChat by MistralAI in order to:
Grammar and spelling check; Paraphrase and reword. After using these tools, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments References</title>
      <p>This work was supported by the French Public Investment Bank (Bpifrance) i-Demo program within
the LettRAGraph project (Grant ID DOS0256163/00).</p>
    </sec>
    <sec id="sec-8">
      <title>A. Dataset Statistics</title>
      <p>Ontology Name
B. Ontology and Annotation Comparison
0.7661
0.7829
0.7825
0.7823
0.7777
0.7775
0.7764
0.7748
0.7731
0.771
0.7661
0.7539
0.7499
0.7489
0.6968
0.6898
0.68
0.6756
0.5572
0.4522</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lata</surname>
          </string-name>
          ,
          <article-title>Text2KGBench: A Benchmark for OntologyDriven Knowledge Graph Generation from Text, in: The Semantic Web - ISWC</article-title>
          <year>2023</year>
          : 22nd International Semantic Web Conference, Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2023</year>
          , p.
          <fpage>247</fpage>
          -
          <lpage>265</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -47243-5_
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shakeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <article-title>Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3554</fpage>
          -
          <lpage>3565</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>278</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gardent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shimorina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Perez-Beltrachini</surname>
          </string-name>
          ,
          <article-title>Creating Training Corpora for NLG Micro-Planners</article-title>
          , in: R. Barzilay, M.-Y. Kan (Eds.),
          <source>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>188</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P17</fpage>
          -1017.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Regino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossanez</surname>
          </string-name>
          , R. da Silva Torres,
          <string-name>
            <given-names>J. C.</given-names>
            dos
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Systematic</given-names>
            <surname>Literature</surname>
          </string-name>
          <article-title>Review on RDF Triple Generation from Natural Language Text</article-title>
          , Semantic
          <string-name>
            <surname>Web</surname>
          </string-name>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kejriwal</surname>
          </string-name>
          , et al.,
          <source>Preface for the Third International Workshop on Knowledge Graph Generation from Text</source>
          , in: 3rd International workshop
          <article-title>one knowledge graph generation from text</article-title>
          .
          <source>Data Quality meets Machine Learning and Knowledge Graphs</source>
          <year>2024</year>
          , volume
          <volume>3747</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Preface:
          <article-title>Lm-kbc challenge 2024, in: KBC-LM-LM-KBC 2024 Joint proceedings of the KBC-LM workshop and the LM-KBC challenge 2024, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>P.-L. Huguet</surname>
            <given-names>Cabot</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , REBEL:
          <article-title>Relation Extraction By End-to-end Language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          . URL: https://github.com/ Babelscape/rebel/blob/main/docs/EMNLP_2021_REBEL__Camera_Ready_.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Relation Classification via Convolutional Deep Neural Network</article-title>
          , in: J.
          <string-name>
            <surname>Tsujii</surname>
          </string-name>
          , J. Hajic (Eds.),
          <source>Proceedings of COLING</source>
          <year>2014</year>
          ,
          <source>the 25th International Conference on Computational Linguistics: Technical Papers</source>
          , Dublin City University and Association for Computational Linguistics, Dublin, Ireland,
          <year>2014</year>
          , pp.
          <fpage>2335</fpage>
          -
          <lpage>2344</lpage>
          . URL: https://aclanthology.org/C14-1220/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <surname>Seq2rdf:</surname>
          </string-name>
          <article-title>An end-to-end application for deriving triples from natural language text</article-title>
          ,
          <source>in: Proceedings of the ISWC</source>
          <year>2018</year>
          <article-title>Posters &amp; Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference (ISWC</article-title>
          <year>2018</year>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1706</fpage>
          -
          <lpage>1721</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>133</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>133</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <article-title>Few-Shot Relational Triple Extraction with Perspective Transfer Network</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management, CIKM '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>488</fpage>
          -
          <lpage>498</lpage>
          . doi:
          <volume>10</volume>
          .1145/3511808.3557323.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zero-Shot Relation Triplet Extraction via KnowledgeDriven LLM Synthetic</surname>
          </string-name>
          <article-title>Data Generation</article-title>
          , in: D.
          <string-name>
            <surname>-S. Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Si</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Advanced Intelligent Computing Technology and Applications</source>
          , Springer Nature, Singapore,
          <year>2024</year>
          , pp.
          <fpage>329</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Turin</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          , pp.
          <fpage>8890</fpage>
          -
          <lpage>8901</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>778</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ananya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <article-title>Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text</article-title>
          ,
          <source>in: TEXT2KG 2024: Third International Workshop on Knowledge Graph Generation from Text</source>
          , Hersonissos, Greece,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <source>The Claude 3 Model Family: Opus</source>
          , Sonnet, Haiku,
          <year>2024</year>
          . URL: https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Comanici</surname>
          </string-name>
          , et al.,
          <source>Gemini</source>
          <volume>2</volume>
          .
          <article-title>5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context</article-title>
          , and Next Generation Agentic Capabilities,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2507. 06261. arXiv:
          <volume>2507</volume>
          .
          <fpage>06261</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , et al.,
          <source>GPT-4 Technical Report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , et al.,
          <source>GPT-4o System Card</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.21276. arXiv:
          <volume>2410</volume>
          .
          <fpage>21276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Gemma</given-names>
            <surname>Team</surname>
          </string-name>
          , et al.,
          <source>Gemma 3 Technical Report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2503.19786. arXiv:
          <volume>2503</volume>
          .
          <fpage>19786</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gunasekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Hewett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Javaheripi</surname>
          </string-name>
          , P. Kaufmann,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C. T.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , E. Price, G. de Rosa,
          <string-name>
            <given-names>O.</given-names>
            <surname>Saarikivi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Zhang, Phi-4
          <source>Technical Report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2412.08905. arXiv:
          <volume>2412</volume>
          .
          <fpage>08905</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <source>Qwen3 Technical Report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.09388. arXiv:
          <volume>2505</volume>
          .
          <fpage>09388</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-Rank Adaptation of Large Language Models</article-title>
          , in: International Conference on Learning Representations,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>mistral-medium-2505 claude-sonnet-4 claude-3-opus claude-3.5-sonnet-v2 claude-3-sonnet claude-3.7-sonnet gpt-4</source>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-mini-</surname>
          </string-name>
          2025
          <source>-04-14 gemini-2.5-pro gpt-4</source>
          .
          <fpage>1</fpage>
          -2025
          <source>-04-14 gemini-2</source>
          .
          <fpage>5</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          flash-lite
          <source>mistral-medium-2505 claude-3.5-sonnet-v1 claude-3</source>
          .5-haiku gpt-4o
          <source>-2024-11-20 gemini-2</source>
          .0
          <article-title>-flash-lite claude-3-haiku gpt-4o-</article-title>
          <string-name>
            <surname>mini-</surname>
          </string-name>
          2024
          <source>-07-18 gpt-4</source>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-nano-</surname>
          </string-name>
          2025
          <source>-04-14 gemini-2.5-flash gemini-2</source>
          .0-flash
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>