Fine-Tuning vs. Prompting: Evaluating the Knowledge Graph Construction with LLMs

Fine-Tuning vs. Prompting: Evaluating the Knowledge Graph Construction with LLMs HussamGhanem hussam.ghanem@u-bourgogne.fr ICB UMR 6306 CNRS Université de Bourgogne

21000 Dijon France

DAVI The Humanizers

Puteaux France

ChristopheCruz christophe.cruz@u-bourgogne.fr ICB UMR 6306 CNRS Université de Bourgogne

21000 Dijon France

Fine-Tuning vs. Prompting: Evaluating the Knowledge Graph Construction with LLMs 1613-0073 C4DD0769A476769AD0B797157BB4E275 GROBID - A machine learning software for extracting information from scholarly documents Text-to-Knowledge Graph Large Language Models Zero-Shot Prompting Few-Shot Prompting Fine-Tuning

This paper explores Text-to-Knowledge Graph (T2KG) construction" assessing Zero-Shot Prompting (ZSP), Few-Shot Prompting (FSP), and Fine-Tuning (FT) methods with Large Language Models (LLMs). Through comprehensive experimentation with Llama2, Mistral, and Starling, we highlight the strengths of FT, emphasize dataset size's role, and introduce nuanced evaluation metrics. Promising perspectives include synonym-aware metric refinement, and data augmentation with LLMs. The study contributes valuable insights to KG construction methodologies, setting the stage for further advancements. 1

Introduction

The term "knowledge graph" has been around since 1972, but its current definition can be traced back to Google in 2012. This was followed by similar announcements from companies such as Airbnb, Amazon, eBay, Facebook, IBM, LinkedIn, Microsoft, and Uber, among others, leading to an increase in the adoption of Knowledge graphs(KGs) by various industries. As a result, academic research in this field has seen a surge in recent years, with an increasing number of scientific publications on KGs [1]. These graphs utilize a graph-based data model to effectively manage, integrate, and extract valuable insights from large and diverse datasets [2].

KGs serve as repositories for structured knowledge, organized into a collection of triples, denoted as 𝐾𝐺 = (ℎ, 𝑟, 𝑡) ⊆ 𝐸 × 𝑅 × 𝐸, where E represents the set of entities, and R represents the set of relations [1]. Within a graph, nodes represent various levels, entities, or concepts. These nodes encompass diverse types, including person, book, or city, and are interconnected by relationships such as located in, lives in, or works with. The essence of a KG emerges when it incorporates multiple types of relationships rather than being confined to a single type. The overarching structure of a KG constitutes a network of entities, featuring their semantic types, properties, and interconnections. Thus, constructing a KG necessitates information about entities (along with their types and properties) and the semantic relationships that bind them. For the extraction of entities and relationships, practitioners often turn to NLP tasks like Named Entity Recognition (NER), Coreference Resolution (CR), and Relation Extraction (RE).

KGs play a crucial role in organizing complex information across diverse domains, such as question answering, recommendations, semantic search, etc. However, the ongoing challenge persists in constructing them, particularly as the primary sources of knowledge are embedded in unstructured textual data such as press articles, emails, and scientific journals. This challenge can be addressed by adopting an information extraction approach, sometimes implemented as a pipeline. It involves taking textual inputs, processing them using Natural Language Processing (NLP) techniques, and leveraging the acquired knowledge to construct or enhance the KG.

If we envision the Text-to-Knowledge Graph (T2KG) construction task as a black box, the input is textual data, and the output is a knowledge graph. Achieving this can be approached through methods that directly convert text into a graph or by implementing NLP tasks in two ways: 1) through an information extraction pipeline incorporating the mentioned tasks independently, or 2) by adopting an end-to-end approach, also known as joint prediction, using Large Language Models (LLMs) for example. In the realm of LLMs and KGs, their mutual enhancement is evident. LLMs can assist in the construction of KGs. Conversely, KGs can be employed to validate outputs from LLMs or provide explanations for them [3]. LLMs can be adapted to KG construction task (T2KG) through various approaches, such as fine-tuning [4] (FT), zero-shot prompting [5] (ZSP), or few-shot prompting (FSP) [6] with a limited number of examples. Each of these approaches has their pros and cons with respect to the performance, computation resources, training time, domain adaption and training data required.

In-context learning, as discussed by [7], coupled with prompt design, involves telling a model to execute a new task by presenting it with only a few demonstrations of input-output pairs during inference. Instruction fine-tuning methods, exemplified by InstructGPT [8] and Reinforcement Learning from Human Feedback (RLHF) [9], markedly enhance the model's ability to comprehend and follow a diverse range of written instructions. Numerous LLMs have been introduced in the last year, as highlighted by [3], particularly within the ChatGPT [10] like models, which includes GPT-3 [11], LLaMA [12], BLOOM [13], PaLM [14], Mistral [15], Starling [16] and Zephyr [17]. These models can be readily repurposed for KG construction from text by employing a prompt design that incorporates instructions and contextual information.

This study does not entail a comparison with traditional methods of constructing KGs; rather, it delves into the developments and challenges associated with KG construction methodologies, and aiming at providing formal evaluation of T2KG task. Specifically, we focus on the utilization of LLMs, and explore the three approaches mentioned before, Zero-shot, Few-shot and Finetuning (Fig. 1). Each of these approaches addresses specific challenges, contributing significantly to the evolution of T2KG construction techniques.

The present study is organized as follows, Section 2 presents a comprehensive overview of the current state-of-the-art approaches for Text to KG (T2KG) Construction. In the Section 3, we present the general architecture of our proposed implementation (method), with datasets, metrics, and experiments. Section 4 then encapsulates the findings and discussions, presenting the culmination of results. Finally, Section 5 critically examines the strengths and limitations of these techniques.

Background

The current state of research on knowledge graph construction using LLMs is discussed. Three main approaches are identified: Zero-Shot, Few-Shot, and Fine-Tuning. Each approach has its own challenges, such as maintaining accuracy without specific training data or ensuring the robustness of models in diverse real-world scenarios. Evaluation metrics used to assess the quality of constructed KGs are also discussed, including semantic consistency and linguistic coherence. This section highlight methods and metrics to construct KGs and evaluate the result.

The figure 1 illustrates the black box joint prediction of the T2KG construction process using LLMs. It demonstrates how two French examples on the left are converted into an expected result (KG) on the right using ZSP, FSP or FT approaches with LLMs.

Zero Shot

Zero Shot methods enable KG construction without task-specific training data, leveraging the inherent capabilities of large language models. [18] introduce an innovative approach using large language models (LLMs) for knowledge graph construction, employing iterative zeroshot prompting for scalable and flexible KG construction. [19] evaluate the performance of LLMs, specifically GPT-4 and ChatGPT, in KG construction and reasoning tasks, introducing the Virtual Knowledge Extraction task and the VINE dataset, but they do not take into account open sourced LLMs as LLaMA [12]. [20] assess ChatGPT's abilities in information extraction tasks, identifying overconfidence as an issue and releasing annotated datasets. [21] tackle zero-shot information extraction using ChatGPT, achieving impressive results in entity relation triple extraction. [22] propose a method for Knowledge Graph Construction (KGC) using an analogy-based approach, demonstrating superior performance on Wikidata. [23] address the limitations of existing generative knowledge graph construction methods by leveraging large generative language models trained on structured data. The most of these approaches having the same limitation, which is the use of closed and huge LLMs as ChatGPT or GPT4 for this task. Challenges in this area include maintaining accuracy without specific training data and addressing nuanced relationships between entities in untrained domains.

Few Shot

Few Shot methods focus on constructing KGs with limited training examples, aiming to achieve accurate knowledge representation with minimal data. [6] introduce PiVe, a framework enhancing the graph-based generative capabilities of LLMs, and the authors create a verifier which is responsable to verifie the results of LLMs with multi-iteration type. [24] explore the potential of LLMs for knowledge graph completion, treating triples as text sequences and utilizing LLM responses for predictions. [25] automate the process of generating structured knowledge graphs from natural language text using foundation models. [26] present OpenBG, an open business knowledge graph derived from Alibaba Group, containing 2.6 billion triples with over 88 million entities. [27] explore the integration of LLMs with semantic technologies for reasoning and inference. [28] investigate LLMs' application in relation labeling for e-commerce Knowledge Graphs (KGs). As ZSP approaches, FSP approaches use closed and huge LLMs as ChatGPT or GPT4 [10] for this task. Challenges in this area include achieving high accuracy with minimal training data and ensuring the robustness of models in diverse real-world scenarios.

Fine-Tuning

Fine-Tuning methods involve adapting pre-trained language models to specific knowledge domains, enhancing their capabilities for constructing KGs tailored to particular contexts. [4] present a case study automating KG construction for compliance using BERT-based models. This study emphasizes the importance of machine learning models in interpreting rules for compliance automation. [29] propose an approach for knowledge extraction and analysis from biomedical clinical notes, utilizing the BERT model and a Conditional Random Field layer, showcasing the effectiveness of leveraging BERT models for structured biomedical knowledge graphs. [30] propose Knowledge Graph-Enhanced Large Language Models (KGLLMs), enhancing LLMs with KGs for improved factual reasoning capabilities. These approaches that applied FT, they do not use new generations of LLMs, specially, decoder only LLMs as Llama, and Mistral. Challenges in this domain include ensuring the scalability, interpretability, and robustness of fine-tuned models across diverse knowledge domains.

Evaluation metrics

As we employ LLMs to construct KGs, and given that LLMs function as Natural Language Generation (NLG) models, it becomes imperative to discuss NLG criteria. In NLG, two criteria [31] are used to assess the quality of the produced answers (triples in our context).

The first criterion is semantic consistency or Semantic Fidelity which quantifies the fidelity of the data produced against the input data. The most common indicators are :

• Hallucination: It is manifested by the Presence of information (facts) in the generated text that is absent in the input data. In our scenario, hallucination exists if the generated triples (GT) contain triples not present in the ground truth triples (ET) (T in GT and not in ET);

• Omission: It is manifested by the omission of one of the pieces of information (facts) in the generated text. In our case, omission occurs if a triple is present in ET but not in GT;

• Redundancy: This is manifested by the repetition of information in the generated text.

In our case, the redundancy exists if a triple appears more than once in GT;

• Accuracy: The lack of accuracy is manifested by the modification of information such as the inversion of the subject and the direct object complement in the generated text. Accuracy increases if there is an exact match between ET and GT. ;

• Ordering: It occurs when the sequence of information is different from the input data.

In our case, the ordering of GT is not considered.

The second criterion is linguistic coherence or Output Fluency to evaluate the fluidity of the text and the linguistic constructions of the generated text, the segmentation of the text into different sentences, the use of anaphoric pronouns to reference entities and to have linguistically correct sentences. However, in our evaluation, we do not take into account the second criterion.

In their experiments, [3] calculated three hallucination metrics -subject hallucination, relation hallucination, and object hallucination -using certain preprocessing steps such as stemming. They used the ground truth ontology alongside the ground truth test sentence to determine if an entity or relation is present in the text. However, a limitation could arise when there is a disparity in entities or relations between the ground truth ontology and the ground truth test sentence. If the generated triples contain entities or relations not present in the ground truth text, even if they exist in the ground truth ontology, it will be considered a hallucination.

The authors of [6] evaluate their experiments using several evaluation metrics, including Triple Match F1 (T-F1), Graph Match F1 (G-F1), G-BERTScore (G-BS) from [32] which extends BertScore [33] for graph matching, and Graph Edit Distance (GED) from [34]. The GED metric measures the distance between the predicted graph and the ground-truth graph, which is equivalent to computing the number of edit operations (addition, deletion, or replacement of nodes and edges) needed to transform the predicted graph into a graph that is identical to the ground-truth graph, but it does not provide a specific path for these operations to calculate the exact number of operations. To adhere with semantic consistency criterion, we use the terms "omission" and "hallucination" in place of "addition" and "deletion, " respectively.

Propositions

This section describes our approach to evaluate the quality of generated KGs. We explain how we use evaluation metrics such as T-F1, G-F1, G-BS, GED, Bleu-F1 [35] and ROUGE-F1 [36] to assess the quality of the generated KGs in comparison to ground-truth KGs. Additionally, we discuss the use of Optimal Edit Paths (OEP) metric 1 to determine the precise number of operations required to transform the predicted graph into an identical representation of the ground-truth graph. This metric serves as a basis for calculating omissions and hallucinations in the generated graphs. We employ examples from the WebNLG+2020 dataset [37] for testing with ZSP and FSP techniques. Additionally, we utilize the training dataset of WebNLG+2020 to train LLMs using the FT technique. Subsequent subsections delve into a detailed discussion of each phase.

Overall experimentation's process

We leverage the WebNLG+2020 dataset, specifically the version curated by [6]. Their preparation of graphs in lists of triples proves beneficial for evaluation purposes. We utilize these lists and employ NetworkX [38] to transform them back into graphs, facilitating evaluations on the resultant graphs. This step is instrumental in performing ZSP, FSP, and FT LLMs on this dataset.

The figure 2 illustrates the different stages of our experimentation process, including data preparation, model selection, training, validation, and evaluation. The process begins with data preparation, where the WEBNLG dataset is preprocessed and split into training, validation, and test sets. Next, the learning type is selected, and different models are trained using the training set. The trained models are then evaluated on the validation set to evaluate their performance. Finally, the best-performing model is selected and validated on the test set to estimate its generalization ability.

Prompting learning

During this phase, we employ the ZSP and FSP techniques on LLMs to evaluate their proficiency in extracting triples (e.g. construction of the KG). The application of these techniques involves merging examples from the test dataset of WebNLG+2020 with our adapted prompt. Our prompt is strategically modified to provide contextual guidance to the LLMs, facilitating the effective extraction of triples, without the inclusion of a support ontology description, as demonstrated in [3]. The specific prompts used for ZSP and FSP are illustrated in In our approach for ZSP, we began with the methodology outlined in [6], initiating our prompt with the directive "Transform the text into a semantic graph. " However, we enhanced For FSP, we executed 7-shots learning. The rationale behind employing 7-shots learning lies in the fact that the maximum KG size in WebNLG+2020 is 7 triples. Consequently, we fed our prompt with 7 examples of varying sizes; example 1 with size 1, example 2 with size 2, example 3 with size 3, and so forth. In Figure 3-b, we depict a prompt containing two examples.

To demonstrate the efficacy of our refined prompt (including additional sentences), we conducted zero-shot experiments on ChatGPT [10], comparing the outcomes with those of [6]. Our results consistently reveal that our prompt yields more coherent answers in terms of structure.

Finetuning

If the initial results from the ZSP and FSP on LLMs prove reasonable, we proceed to the FT phase. This phase aims to provide the LLMs with a more specific context and knowledge related to the task of extracting triples within the domains covered by the WebNLG+2020 dataset. Using the example "a)" illustrated in Fig 3, we passe in the FT prompt, at once for each line of the training dataset, the input text and the corresponding KG (the list of triples). To do this phase (FT), we employ QLoRA [39], a methodology that integrates quantization [40] and Low-Rank Adapters (LoRA) [41]. The LLM is loaded with 4-bit precision using bitsandbytes [42], and the training process incorporates LoRA through the PEFT library (Parameter-Efficient Fine-Tuning) [43] provided by Hugging Face.

Postprocessing

Given our focus on KG construction, our evaluation process involves assessing the generated KGs against ground-truth KGs. To facilitate this evaluation, we take a cleaning process for the LLMs output. This involves transforming the graphs generated by LLMs into organized lists of triples, subsequently transferred to textual documents.

The transformation is executed through rule-based processing. This step is applied to remove corrupted text (outside the lists of triples) from the whole text generated by LLMs in the preceding step. The output is then presented in a list of lists of triples format, optimizing our evaluation process. This approach proves especially effective when calculating metrics such as G-F1, GED, and OEP, as we will see in more detail in 3.5

A potential problem arises when instructing LLMs to produce lists of triples (KGs), as there may be instances where the generated text lacks the desired structure. In such cases, we address this issue by substituting the generated text with an empty list of triples, represented as '[["","",""]]', allowing us to effectively evaluate omissions. However, this approach tends to underestimate hallucinations compared to the actual occurrences.

Experiment's evaluation

For assessing the quality of the generated graphs in comparison to ground-truth graphs, we adopt evaluation metrics as employed in [6]. These metrics encompass T-F1, G-F1, G-BS [32], and GED [34]. Additionally, we incorporate the Optimal Edit Paths (OEP) metric, a tool aiding in the calculation of omissions and hallucinations within the generated graphs.

Our evaluation procedure aligns with the methodology outlined in [6], particularly in the computation of GED and G-F1. This involves constructing directed graphs from lists of triples, referred to as linearized graphs, utilizing NetworkX [38].

In contrast to [3], our methodology diverges by not relying on the ground truth test sentence of an ontology. As previously mentioned, we opt for a distinct approach wherein we assess omissions and hallucinations in the generated graphs using the OEP metric. Unlike the global edit distance provided by GED, OEP gives the precise path of the edit, enabling the exact quantification of omissions and hallucinations, either in absolute terms or as a percentage across the entire test dataset.

For example, in the illustrated nodes path labeled 'a)' in , we observe 2 omissions, while the edges path in Fig 4 -(a) exhibits 1 hallucination. In our evaluation, the criterion for incrementing the global hallucination metric for all graphs is set at finding >=1 hallucinations or 1 omission in a generated graph. This approach ensures a comprehensive assessment of the presence of omissions and hallucinations across the entirety of the generated graphs.

As mentioned earlier, the evaluation of the three methods is conducted using examples sourced from the test dataset of WebNLG+2020. The primary goal is to enhance G-F1, T-F1, G-BS, Bleu-F1, and ROUGE-F1 metrics, while reducing GED, Hallucination, and Omission.

Mathematical representation of the used metrics

We mathematically represent the used metrics as follows:

Graph Matching (𝐺-𝐹 1 ). Let 𝑀 𝑐ℎ be the number of matches between predicted and gold graphs. And let 𝑇 𝑜𝐺𝑟𝑎𝑝ℎ𝑠 be the total number of predicted graphs. Then, the accuracy for entire graph matches 𝐴𝑐𝑐 𝑔𝑟𝑎𝑝ℎ can be calculated as:

𝐴𝐶𝐶 𝑔𝑟𝑎𝑝ℎ = 𝑀 𝑐ℎ 𝑇 𝑜𝐺𝑟𝑎𝑝ℎ𝑠

Triples Matcning (𝑇 -𝐹 1). The 𝐹 1 score for triple matches 𝑇 -𝐹 1 is calculated in the following:

𝑇 -𝐹 1 = 2 × 𝑇 𝑃 2 × 𝑇 𝑃 + 𝐹 𝑃 + 𝐹 𝑁 Where

• TP is the number of true positive triple matches.

• FP is the number of false positive triple matches.

• FN is the number of false negative triple matches.

Graph Edit Distance (GED).

The following equation calculate GED between two given graphs :

𝐺𝐸𝐷(𝑔1, 𝑔2) = min 𝑒 1 ,...,𝑒 𝑘 ∈𝛾(𝑔1,𝑔2) 𝑘 ∑︁ 𝑖=1 𝑐(𝑒 𝑖 )

Where:

• GED(𝑔 This part calculates the sum of the costs of each individual edit operation 𝑒 𝑖 in the selected edit path. The cost function 𝑐(𝑒 𝑖 ) measures the cost or strength of each edit operation. The objective is to find the edit path with the minimum total cost, which represents the least amount of transformation required to convert 𝑔 1 into 𝑔 2 .

In our experiments, we calculate the overall GED which is computed as follows:

overall_ged = 1 𝑁 𝑁 ∑︁ 𝑖=1 GED ED 𝑖

Where:

• 𝑁 is the total number of graphs.

• GED ED 𝑖 is the graph edit distance for the 𝑖th graph.

Graph BERTScore (G-BS)

. G-BS takes graphs as a set of edges and solve a matching problem which finds the best alignment between the edges in predicted graph and those in ground-truth graph. Each edge is considered as a sentence and BERTScore is used to calculate the score between a pair of predicted and ground-truth edges, Based on the best alignment and the overall matching score, the computed F1 score is used as the final G-BERTScore. Considering 𝑥 𝑖 as reference token (entity or relation) and 𝑥 ˆ𝑖 as generated token (entity or relation), the complete score matches each token in 𝑥 to a generated token in 𝑥 ˆto compute recall, and each token in 𝑥 ˆto a token in 𝑥 to compute precision. A greedy matching is used to maximize the matching similarity score, where each token is matched to the most similar token in the other graph. Then precision and recall are combined to compute an F1 measure. For a reference 𝑥 and candidate 𝑥 ˆ, the recall, precision, and F1 scores are:

𝑅 BERT = 1 |𝑥| ∑︁ 𝑥 𝑖 ∈𝑥 max 𝑥 ^𝑗 ∈𝑥 ^𝑥𝑇 𝑖 𝑥 ˆ𝑗, 𝑃 BERT = 1 |𝑥 ˆ| ∑︁ 𝑥 ^𝑗 ∈𝑥 ^max 𝑥 𝑖 ∈𝑥 𝑥 𝑇 𝑖 𝑥 ˆ𝑗, 𝐹 1 BERT = 2 • 𝑃 BERT • 𝑅 BERT 𝑃 BERT + 𝑅 BERT .

Bleu-F1 Score (𝐹 1 𝐵𝑙𝑒𝑢 ). Let 𝐶 𝑔𝑒𝑛 be the count of 4-grams in the generated graph , Let 𝐶 𝑟𝑒𝑓 be the count of 4-grams in the reference graph, and Let 𝐶 𝑚𝑎𝑡𝑐ℎ be the count of matching 4-grams in both texts

𝑃 𝐵𝑙𝑒𝑢 = 𝐶 𝑚𝑎𝑡𝑐ℎ 𝐶 𝑔𝑒𝑛 𝑅 𝐵𝑙𝑒𝑢 = 𝐶 𝑚𝑎𝑡𝑐ℎ 𝐶 𝑟𝑒𝑓 𝐹 1 𝐵𝑙𝑒𝑢 = 2 × 𝑃 𝐵𝑙𝑒𝑢 × 𝑅 𝐵𝑙𝑒𝑢 𝑃 𝐵𝑙𝑒𝑢 + 𝑅 𝐵𝑙𝑒𝑢 ROUGE-F1 Score (𝐹 1 𝑅𝑂𝑈 𝐺𝐸 ).

In our experiments, we calculate F1-score for Rouge-2 (bigram), which is presented in the following equation:

𝑃 𝑅𝑂𝑈 𝐺𝐸 = 𝑏𝑖𝑔𝑟𝑎𝑚 𝑐𝑎𝑛𝑑. ∩ 𝑏𝑖𝑔𝑟𝑎𝑚 𝑟𝑒𝑓. 𝑏𝑖𝑔𝑟𝑎𝑚 𝑐𝑎𝑛𝑑. 𝑅 𝑅𝑂𝑈 𝐺𝐸 = 𝑏𝑖𝑔𝑟𝑎𝑚 𝑐𝑎𝑛𝑑. ∩ 𝑏𝑖𝑔𝑟𝑎𝑚 𝑟𝑒𝑓. 𝑏𝑖𝑔𝑟𝑎𝑚 𝑟𝑒𝑓. 𝐹 1 𝑅𝑂𝑈 𝐺𝐸 = 2. 𝑅 𝑅𝑂𝑈 𝐺𝐸 .𝑃 𝑅𝑂𝑈 𝐺𝐸 𝑅 𝑅𝑂𝑈 𝐺𝐸 + 𝑃 𝑅𝑂𝑈 𝐺𝐸

Hallucination and Omission.

As mentioned before, we calculate hallucination and omission using OEP, which is the optimal edit paths between the gold and predicted graphs. Each edit operation (ei) in OEP represents an action required to transform the predicted graph into the gold graph.

• Hallucination: An edit operation 𝑒 𝑖 is considered a hallucination if it involves adding an entity or a relation that is not present in the gold graph but exists in the predicted graph. In our work, we take into account the overall hallucination ℎ𝑎𝑙𝑙., this metric is represented by the following equation :

𝐻𝑎𝑙𝑙. = ℎ𝑎𝑙𝑙 𝑇 𝑜𝐺𝑟𝑠

Where ℎ𝑎𝑙𝑙 is the number of graphs with hallucination, and 𝑇 𝑜𝐺𝑟𝑠 in the total number of generated graphs • Omission: An edit operation 𝑒𝑖 is considered an omission if it involves deleting an entity or a relation that exists in the gold graph but is missing in the predicted graph. In ou work, we do the same as the hallucicnation, we calculate the overall omission 𝑜𝑚𝑖𝑠., presented by the following equation :

𝑂𝑚𝑖𝑠. = 𝑜𝑚𝑖𝑠𝑠/𝑇 𝑜𝐺𝑟𝑠

Where 𝑜𝑚𝑖𝑠𝑠 is the number of graphs with omission.

Experiments

This section provides insights into the LLMs utilized in our study for ZSP, FSP, or FT, followed by the presentation of our experimental results.

In this section, we provide a brief overview of the LLMs utilized in our experiments. Our selection criteria focused on employing small, open-source, and easily accessible LLMs. All models were sourced from the HuggingFace platform2

• Llama 2 [12] is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. In our experiments, we deploy the 7B and 13B pretrained models, which have been converted to the Hugging Face Transformers format. • Introduced by [15], Mistral-7B-v0.1 is a pretrained generative text model featuring 7 billion parameters. Notably, Mistral-7B-v0.1 exhibits superior performance to Llama 2 13B across all benchmark tests in their experiments. • In the work presented by [16], Starling-7B is introduced as an open LLM trained through Reinforcement Learning from AI Feedback (RLAIF). This model leverages the GPT-4 labeled ranking dataset, berkeley-nest/Nectar, and employs a novel reward training and policy tuning pipeline.

In our review of the state-of-the-art, we observed that, apart from [3], which incorporates hallucination evaluation in their experiments, other studies primarily focus on metrics such as precision, recall, F1 score, triple matching, or graph matching. In our approach to evaluating experiments, we consider also hallucination and omission through a linguistic lens.

Upon examining Table 1, we observe the superior performance of the FT method compared to ZSP and FSP for the T2KG construction task. Of particular interest is the finding that, with the exception of Llama2-7b, applying ZSP to the fine-tuned Llama2-7b results in worse performance than FSP on the original Llama2-7b. Overall, this table provides a clear visualization of the relative performance of each method, highlighting the strengths and limitations of each approach for T2KG construction.

Furthermore, it is evident that better results are achieved by providing more examples (more shots) to the same model, whether original or fine-tuned. The results underscore the positive correlation between the quantity of examples and the model's performance. Comparing the fine-tuned Mistral and fine-tuned Starling, they exhibit similar performance when given 7 shots, surpassing the two Llama2 models by a significant margin. The standout performer with ZSP on the fine-tuned LLM is Mistral, showcasing a considerable lead over other LLMs, including Starling. To corroborate these findings, future versions of our study plan to assess our fine-tuned models using an alternative dataset with diverse domains.

As depicted in Figure 2, Hall. represents Hallucinations, while Omis. denotes Omissions.

Taking into account our strategy of introducing an empty graph when LLMs fail to produce triples, we note that even with LLama2-13b with ZSP exhibiting the least favorable results across all metrics, it displays minimal hallucinations. Nonetheless, it's crucial to recognize that the model with the fewest hallucinations may not necessarily be the most suitable choice. To overcome this limitation in our evaluation metric, we aim to improve it by considering the prevalence of empty graphs in the generated results before assessing them against ground truth graphs.

The G-BS consistently remains high, indicating that LLMs frequently generate text with words (entities or relations) very similar to those in the ground truth graphs. Among the models, the finetuned Starling with 7 shots achieves the highest G-F1, which focuses on the entirety of the graph and evaluates how many graphs are exactly produced the same, suggesting that it accurately generates approximately 36% of graphs identical to the ground truth. For various metrics, the finetuned Mistral with 7 shots performs exceptionally well, particularly in T-F1, where F1 scores are computed for all test samples and averaged for the final Triple Match F1 score. Additionally, it excels in metrics such as "Omis.," F1-Bleu, and F1-Rouge. F1-Bleu and F1-Rouge represent n-gram-based metrics encompassing precision (Bleu), recall (Rouge), and F-score (Bleu and Rouge). These metric could potentially yield even better results if synonyms of entities or relations are considered as exact matches.

The authors in [6] conduct evaluations using WebNLG+2020. Consequently, we adopt their approach (PiVE) as a baseline for comparison with our experiments. Upon analyzing the results, it becomes evident that nearly all fine-tuned LLMs outperform PiVE, which is applied on both ChatGPT and GPT-4 as mentioned before.

In Table 2, we present the evaluation results of original LLMs with 7 shots and fine-tuned LLMs with zero-shot and 7 shots on the KELM-sub dataset prepared by [6], building upon [44]. It's crucial to note that the experiments utilized the same prompts as previously described. 2 indicate that our fine-tuned LLMs perform less effectively than the original LLMs with 7 shots. Furthermore, all LLMs' results on KELM-sub are inferior to those on WebNLG+2020. This disparity can be attributed to the presence of different relation types, where some types are expressed differently in Kelm, utilizing synonyms not considered in the current metrics. Addressing this, our forthcoming versions aim to refine metrics to accommodate synonyms in entities and relations.

We also observe that the evaluation of PiVE on Sub-Kelm yields better results, leveraging examples from the Sub-Kelm training dataset in their few-shot experiments, providing LLMs with insights into certain relation types.

One of the future experimentations will be to use examples from KELM-sub for few-shot prompts to investigate whether the generalization issue stems from WebNLG domains, relation types, or prompts that need improvement to disregard the relation types provided by the examples.

Conclusion and perspectives

This study delves into the Text-to-Knowledge Graph (T2KG) construction task, exploring the efficacy of three distinct approaches: Zero-Shot Prompting (ZSP), Few-Shot Prompting (FSP), and Fine-Tuning (FT) of Large Language Models (LLMs). Our comprehensive experimentation, employing models such as Llama2, Mistral, and Starling, sheds light on the strengths and limitations of each approach. The results demonstrate the remarkable performance of the FT method, particularly when compared to ZSP and FSP across various models. Notably, the fine-tuned Llama2-7b with ZSP gaved worst results than FSP with the original Llama2. Additionally, the positive correlation between the quantity of examples and model performance underscores the significance of dataset size in training. An essential part of our study involves the evaluation metrics employed to assess the generated graphs. Particularly, we introduced nuanced considerations for refining these metrics to measuring hallucination and omission in the generated graphs, offering valuable insights into the fidelity of the constructed knowledge graphs.

Looking forward, there are promising perspectives for further enhancement. One is to involve refining evaluation metrics to accommodate synonyms of entities or relations in generated graphs, employing advanced methods or tools for synonym detection. Furthermore, leveraging LLMs for data augmentation in the T2KG construction task shows promise. Notably, during experimentation, LLMs, particularly Starling, exhibited the ability to provide continuity in generated results for T2KG, proposing texts alongside corresponding KGs (triples).

Figure 1 :1Figure 1: T2KG Task

Figure 2 :2Figure 2: Overall experimentation's process

Fig 3(a) and Fig 3(b),

Figure 3 :3Figure 3: Prompting examples

Figure 4 :4Figure 4: Results examples

The 7 -7shot experiments sourced examples from the WebNLG+2020 training dataset. These new experiments aim to assess the generalization ability of original LLMs with 7 shots and fine-tuned LLMs with zero-shot and 7 shots across diverse domains in the T2KG construction task.The results in Table

Table 11Comparison of performance metrics and modelsModel | MetricG-F1T-F1G-BS GED F1-Bleu F1-RougeHall. Omis.PiVE14.00 18.57 89.82 11.22----Mistral-02.300.0077.87 15.9354.9755.1520.63 31.48Mistral-718.72 28.44 87.54 10.1355.0963.9417.88 21.14Mistral-FT-031.93 44.08 86.898.2563.8869.0813.55 18.27Mistral-FT-734.68 49.11 91.99 6.6971.7877.4315.01 14.45Starling-05.237.8386.29 13.3534.6414.6117.48 33.24Starling-721.30 33.77 90.418.9660.4769.3417.31 14.61Starling-FT-021.47 28.29 72.86 11.8744.0747.6910.17 42.78Starling-FT-735.69 48.49 91.95 6.6071.5176.6711.35 18.27Llama2-7b-00.000.4654.20 18.2920.2317.984.8381.53Llama2-7b-711.80 20.88 82.78 12.6645.4854.2920.74 30.02Llama2-7b-FT-03.8215.41 59.19 15.7816.8217.956.0779.20Llama2-7b-FT-718.77 32.63 87.19 10.1658.4866.3525.24 18.66Llama2-13b-00.000.7957.42 17.7920.5018.234.7881.23Llama2-13b-713.49 23.99 84.89 11.5950.1858.7126.36 19.06Llama2-13b-FT-020.52 32.18 75.88 11.3846.5350.7811.64 39.63Llama2-13b-FT-723.55 37.29 88.778.9463.2670.1223.55 16.19

Table 22Results on KELM-subModel | MetricG-F1T-F1G-BS GED F1-Bleu F1-Rouge Hall. Omis.PiVE23.117.5087.70 11.35----Mistral-75.6110.89 71.29 14.2856.5661.112.3377.33Mistral-FT-02.288.0269.29 14.9224.2435.702.0677.22Mistral-FT-72.838.7368.55 14.5426.3538.761.7878.17Starling-75.6113.82 83.16 12.8565.7971.205.33 59.44Starling-FT-02.005.7664.87 16.5117.6424.290.7279.39Starling-FT-73.119.8267.79 14.5327.3739.4978.67Llama2-7b-75.066.2067.49 15.5552.1856.712.2876.83Llama2-7b-FT-00.221.7158.85 18.846.547.810.56 80.28Llama2-7b-FT-75.288.3367.29 15.0926.8638.753.6775.33Llama2-13b-75.177.8271.66 15.1255.3960.063.4475.56Llama2-13b-FT-01.727.7363.37 15.8020.5929.531.5679.44Llama2-13b-FT-74.508.6367.44 14.8126.3338.092.0677.22

NetworkX -optimal edit paths : https://networkx.org/documentation/stable/index.html Hugging Face: https://huggingface.co/

Acknowledgments

The authors thank the French company DAVI (Davi The Humanizers, Puteaux, France) for their support, and the French government for the plan France Relance funding.

Knowledge graphs AHogan EBlomqvist MCochez CAmato GDMelo CGutierrez SKirrane JE LGayo RNavigli SNeumaier ACM Computing Surveys (Csur) 54 2021 Industry-scale knowledge graphs: Lessons and challenges: Five diverse technology companies show how it's done NNoy YGao AJain ANarayanan APatterson JTaylor Queue 17 2019 Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text NMihindukulasooriya STiwari CFEnguix KLata International Semantic Web Conference Springer 2023 VErshov arXiv:2302.01842 A case study for compliance as code with graphs and language models: Public release of the regulatory knowledge graph 2023 arXiv preprint JHCaufield HHegde VEmonet NLHarris MPJoachimiak NMatentzoglu HKim SAMoxon JTReese MAHaendel arXiv:2304.02711 Structured prompt interrogation and recursive extraction of semantics (spires): A method for populating knowledge bases using zero-shot learning 2023 arXiv preprint JHan NCollier WBuntine EShareghi arXiv:2305.12392 Pive: Prompting with iterative verification improving graph-based generative capability of llms 2023 arXiv preprint SMin XLyu AHoltzman MArtetxe MLewis HHajishirzi LZettlemoyer arXiv:2202.12837 Rethinking the role of demonstrations: What makes in-context learning work? 2022 arXiv preprint Training language models to follow instructions with human feedback LOuyang JWu XJiang DAlmeida CWainwright PMishkin CZhang SAgarwal KSlama ARay Advances in neural information processing systems 35 2022 Learning to summarize with human feedback NStiennon LOuyang JWu DZiegler RLowe CVoss ARadford DAmodei PFChristiano Advances in Neural Information Processing Systems 33 2020 Gpt-4 ROpenai arxiv 2303.08774 2023 technical report View in Article 2 Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell Advances in neural information processing systems 33 2020 HTouvron TLavril GIzacard XMartinet M.-ALachaux TLacroix BRozière NGoyal EHambro FAzhar arXiv:2302.13971 Llama: Open and efficient foundation language models 2023 arXiv preprint BWorkshop TLScao AFan CAkiki EPavlick SIlić DHesslow RCastagné ASLuccioni FYvon arXiv:2211.05100 Bloom: A 176b-parameter open-access multilingual language model 2022 arXiv preprint Palm: Scaling language modeling with pathways AChowdhery SNarang JDevlin MBosma GMishra ARoberts PBarham HWChung CSutton SGehrmann Journal of Machine Learning Research 24 2023 AQJiang ASablayrolles AMensch CBamford DSChaplot DCasas FBressand GLengyel GLample LSaulnier arXiv:2310.06825 Mistral 7b 2023 arXiv preprint BZhu EFrick TWu HZhu JJiao Starling-7b: Improving llm helpfulness & harmlessness with rlaif 2023 LTunstall EBeeching NLambert NRajani KRasul YBelkada SHuang LWerra CFourrier NHabib arXiv:2310.16944 Zephyr: Direct distillation of lm alignment 2023 arXiv preprint SCarta AGiuliani LPiano ASPodda LPompianu SGTiddia arXiv:2307.01128 Iterative zero-shot llm prompting for knowledge graph construction 2023 arXiv preprint YZhu XWang JChen SQiao YOu YYao SDeng HChen NZhang arXiv:2305.13168 Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities 2023 arXiv preprint BLi GFang YYang QWang WYe WZhao SZhang arXiv:2304.11633 Evaluating chatgpt's information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness 2023 arXiv preprint XWei XCui NCheng XWang XZhang SHuang PXie JXu YChen MZhang arXiv:2302.10205 Zero-shot information extraction via chatting with chatgpt 2023 arXiv preprint Relevant entity selection: Knowledge graph bootstrapping via zero-shot analogical pruning LJarnac MCouceiro PMonnin Proceedings of the 32nd ACM International Conference on Information and Knowledge Management the 32nd ACM International Conference on Information and Knowledge Management 2023 Codekgc: Code language model for generative knowledge graph construction ZBi JChen YJiang FXiong WGuo HChen NZhang ACM Transactions on Asian and Low-Resource Language Information Processing 23 2024 LYao JPeng CMao YLuo arXiv:2308.13916 Exploring large language models for knowledge graph completion 2023 arXiv preprint HKhorashadizadeh NMihindukulasooriya STiwari JGroppe SGroppe arXiv:2305.08804 Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text 2023 arXiv preprint Construction and applications of billion-scale pre-trained multimodal business knowledge graph SDeng CWang ZLi NZhang ZDai HChen FXiong MYan QChen MChen IEEE 39th International Conference on Data Engineering (ICDE) IEEE 2023. 2023 MTrajanoska RStojanov DTrajanov arXiv:2305.04676 Enhancing knowledge graph construction using large language models 2023 arXiv preprint JChen LMa XLi NThakurdesai JXu JHCho KNag EKorpeoglu SKumar KAchan arXiv:2305.09858 Knowledge graph completion models are few-shot learners: An empirical study of relation labeling in e-commerce with llms 2023 arXiv preprint Bert based clinical knowledge extraction for biomedical knowledge graph construction and analysis AHarnoune MRhanoui MMikram SYousfi ZElkaimbillah BElAsri Computer Methods and Programs in Biomedicine Update 1 100042 2021 LYang HChen ZLi XDing XWu arXiv:2306.11489 Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling 2023 arXiv preprint TCFerreira CVan Der Lee EVan Miltenburg EKrahmer arXiv:1908.09022 Neural data-to-text generation: A comparison between pipeline and end-to-end architectures 2019 arXiv preprint SSaha PYadav LBauer MBansal arXiv:2104.07644 Explagraphs: An explanation graph generation task for structured commonsense reasoning 2021 arXiv preprint TZhang VKishore FWu KQWeinberger YArtzi arXiv:1904.09675 Bertscore: Evaluating text generation with bert 2019 arXiv preprint An exact graph edit distance algorithm for solving pattern recognition problems ZAbu-Aisheh RRaveaux J.-YRamel PMartineau 4th International Conference on Pattern Recognition Applications and Methods 2015 2015 Bleu: a method for automatic evaluation of machine translation KPapineni SRoukos TWard W.-JZhu Proceedings of the 40th annual meeting of the Association for Computational Linguistics the 40th annual meeting of the Association for Computational Linguistics 2002 Rouge: A package for automatic evaluation of summaries C.-YLin Text summarization branches out 2004 The webnlg challenge: Generating text from rdf data CGardent AShimorina SNarayan LPerez-Beltrachini Proceedings of the 10th international conference on natural language generation the 10th international conference on natural language generation 2017 Exploring network structure, dynamics, and function using NetworkX AHagberg PSwart DChult 2008 Los Alamos, NM (United States Los Alamos National Lab ; LANL) Technical Report Qlora: Efficient finetuning of quantized llms TDettmers APagnoni AHoltzman LZettlemoyer Advances in Neural Information Processing Systems 36 2024 XZhang SLiu RZhang CLiu DHuang SZhou JGuo YKang QGuo ZDu arXiv:1911.00361 Adaptive precision training: Quantify back propagation in neural networks with fixed-point numbers 2019 arXiv preprint EJHu YShen PWallis ZAllen-Zhu YLi SWang LWang WChen arXiv:2106.09685 Lora: Low-rank adaptation of large language models 2021 arXiv preprint -bit quantization and qlora YBelkada TDettmers APagnoni SGugger SMangrulkar Making llms even more accessible with bitsandbytes 2023 4 SMangrulkar SGugger LDebut YBelkada SPaul BBossan Peft: State-of-theart parameter-efficient fine-tuning methods, Younes Belkada and Sayak Paul 2022 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods OAgarwal HGe SShakeri RAl-Rfou arXiv:2010.12688 Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training 2020 arXiv preprint