<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fine-Tuning vs. Prompting: Evaluating the Knowledge Graph Construction with LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hussam Ghanem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Cruz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAVI The Humanizers</institution>
          ,
          <addr-line>Puteaux</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ICB, UMR 6306, CNRS, Université de Bourgogne</institution>
          ,
          <addr-line>21000 Dijon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper explores Text-to-Knowledge Graph (T2KG) construction„ assessing Zero-Shot Prompting (ZSP), Few-Shot Prompting (FSP), and Fine-Tuning (FT) methods with Large Language Models (LLMs). Through comprehensive experimentation with Llama2, Mistral, and Starling, we highlight the strengths of FT, emphasize dataset size's role, and introduce nuanced evaluation metrics. Promising perspectives include synonym-aware metric refinement, and data augmentation with LLMs. The study contributes valuable insights to KG construction methodologies, setting the stage for further advancements.1 The term ”knowledge graph” has been around since 1972, but its current definition can be traced back to Google in 2012. This was followed by similar announcements from companies such as Airbnb, Amazon, eBay, Facebook, IBM, LinkedIn, Microsoft, and Uber, among others, leading to an increase in the adoption of Knowledge graphs(KGs) by various industries. As a result, academic research in this field has seen a surge in recent years, with an increasing number of scientific publications on KGs [ 1]. These graphs utilize a graph-based data model to efectively manage, integrate, and extract valuable insights from large and diverse datasets [2]. KGs serve as repositories for structured knowledge, organized into a collection of triples, denoted as  = (ℎ, , ) ⊆  ×  × , where E represents the set of entities, and R represents the set of relations [1]. Within a graph, nodes represent various levels, entities, or concepts. These nodes encompass diverse types, including person, book, or city, and are interconnected by relationships such as located in, lives in, or works with. The essence of a KG emerges when it incorporates multiple types of relationships rather than being confined to a single type. The overarching structure of a KG constitutes a network of entities, featuring their semantic types, properties, and interconnections. Thus, constructing a KG necessitates information about</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text-to-Knowledge Graph</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Zero-Shot Prompting</kwd>
        <kwd>Few-Shot Prompting</kwd>
        <kwd>FineTuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>entities (along with their types and properties) and the semantic relationships that bind them.
For the extraction of entities and relationships, practitioners often turn to NLP tasks like Named
Entity Recognition (NER), Coreference Resolution (CR), and Relation Extraction (RE).</p>
      <p>KGs play a crucial role in organizing complex information across diverse domains, such as
question answering, recommendations, semantic search, etc. However, the ongoing challenge
persists in constructing them, particularly as the primary sources of knowledge are embedded
in unstructured textual data such as press articles, emails, and scientific journals. This challenge
can be addressed by adopting an information extraction approach, sometimes implemented as a
pipeline. It involves taking textual inputs, processing them using Natural Language Processing
(NLP) techniques, and leveraging the acquired knowledge to construct or enhance the KG.</p>
      <p>
        If we envision the Text-to-Knowledge Graph (T2KG) construction task as a black box, the
input is textual data, and the output is a knowledge graph. Achieving this can be approached
through methods that directly convert text into a graph or by implementing NLP tasks in
two ways: 1) through an information extraction pipeline incorporating the mentioned tasks
independently, or 2) by adopting an end-to-end approach, also known as joint prediction, using
Large Language Models (LLMs) for example. In the realm of LLMs and KGs, their mutual
enhancement is evident. LLMs can assist in the construction of KGs. Conversely, KGs can be
employed to validate outputs from LLMs or provide explanations for them [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. LLMs can be
adapted to KG construction task (T2KG) through various approaches, such as fine-tuning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
(FT), zero-shot prompting [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (ZSP), or few-shot prompting (FSP) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with a limited number of
examples. Each of these approaches has their pros and cons with respect to the performance,
computation resources, training time, domain adaption and training data required.
      </p>
      <p>
        In-context learning, as discussed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], coupled with prompt design, involves telling a
model to execute a new task by presenting it with only a few demonstrations of input-output
pairs during inference. Instruction fine-tuning methods, exemplified by InstructGPT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and
Reinforcement Learning from Human Feedback (RLHF) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], markedly enhance the model’s
ability to comprehend and follow a diverse range of written instructions. Numerous LLMs have
been introduced in the last year, as highlighted by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], particularly within the ChatGPT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] like
models, which includes GPT-3 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], LLaMA [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], BLOOM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], PaLM [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Mistral [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Starling
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and Zephyr [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. These models can be readily repurposed for KG construction from text
by employing a prompt design that incorporates instructions and contextual information.
      </p>
      <p>This study does not entail a comparison with traditional methods of constructing KGs; rather,
it delves into the developments and challenges associated with KG construction methodologies,
and aiming at providing formal evaluation of T2KG task. Specifically, we focus on the utilization
of LLMs, and explore the three approaches mentioned before, Zero-shot, Few-shot and
Finetuning (Fig. 1). Each of these approaches addresses specific challenges, contributing significantly
to the evolution of T2KG construction techniques.</p>
      <p>The present study is organized as follows, Section 2 presents a comprehensive overview of
the current state-of-the-art approaches for Text to KG (T2KG) Construction. In the Section 3,
we present the general architecture of our proposed implementation (method), with datasets,
metrics, and experiments. Section 4 then encapsulates the findings and discussions, presenting
the culmination of results. Finally, Section 5 critically examines the strengths and limitations of
these techniques.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>The current state of research on knowledge graph construction using LLMs is discussed. Three
main approaches are identified: Zero-Shot, Few-Shot, and Fine-Tuning. Each approach has its
own challenges, such as maintaining accuracy without specific training data or ensuring the
robustness of models in diverse real-world scenarios. Evaluation metrics used to assess the
quality of constructed KGs are also discussed, including semantic consistency and linguistic
coherence. This section highlight methods and metrics to construct KGs and evaluate the result.</p>
      <p>The figure 1 illustrates the black box joint prediction of the T2KG construction process using
LLMs. It demonstrates how two French examples on the left are converted into an expected
result (KG) on the right using ZSP, FSP or FT approaches with LLMs.</p>
      <sec id="sec-2-1">
        <title>2.1. Zero Shot</title>
        <p>
          Zero Shot methods enable KG construction without task-specific training data, leveraging the
inherent capabilities of large language models. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] introduce an innovative approach using
large language models (LLMs) for knowledge graph construction, employing iterative
zeroshot prompting for scalable and flexible KG construction. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] evaluate the performance of
LLMs, specifically GPT-4 and ChatGPT, in KG construction and reasoning tasks, introducing
the Virtual Knowledge Extraction task and the VINE dataset, but they do not take into account
open sourced LLMs as LLaMA [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] assess ChatGPT’s abilities in information extraction
tasks, identifying overconfidence as an issue and releasing annotated datasets. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] tackle
zero-shot information extraction using ChatGPT, achieving impressive results in entity relation
triple extraction. [22] propose a method for Knowledge Graph Construction (KGC) using an
analogy-based approach, demonstrating superior performance on Wikidata. [23] address the
limitations of existing generative knowledge graph construction methods by leveraging large
generative language models trained on structured data. The most of these approaches having
the same limitation, which is the use of closed and huge LLMs as ChatGPT or GPT4 for this
task. Challenges in this area include maintaining accuracy without specific training data and
addressing nuanced relationships between entities in untrained domains.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Few Shot</title>
        <p>
          Few Shot methods focus on constructing KGs with limited training examples, aiming to achieve
accurate knowledge representation with minimal data. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] introduce PiVe, a framework
enhancing the graph-based generative capabilities of LLMs, and the authors create a verifier which is
responsable to verifie the results of LLMs with multi-iteration type. [ 24] explore the potential
of LLMs for knowledge graph completion, treating triples as text sequences and utilizing LLM
responses for predictions. [25] automate the process of generating structured knowledge graphs
from natural language text using foundation models. [26] present OpenBG, an open business
knowledge graph derived from Alibaba Group, containing 2.6 billion triples with over 88 million
entities. [27] explore the integration of LLMs with semantic technologies for reasoning and
inference. [28] investigate LLMs’ application in relation labeling for e-commerce Knowledge
Graphs (KGs). As ZSP approaches, FSP approaches use closed and huge LLMs as ChatGPT or
GPT4 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for this task. Challenges in this area include achieving high accuracy with minimal
training data and ensuring the robustness of models in diverse real-world scenarios.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Fine-Tuning</title>
        <p>
          Fine-Tuning methods involve adapting pre-trained language models to specific knowledge
domains, enhancing their capabilities for constructing KGs tailored to particular contexts. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
present a case study automating KG construction for compliance using BERT-based models.
This study emphasizes the importance of machine learning models in interpreting rules for
compliance automation. [29] propose an approach for knowledge extraction and analysis from
biomedical clinical notes, utilizing the BERT model and a Conditional Random Field layer,
showcasing the efectiveness of leveraging BERT models for structured biomedical knowledge
graphs. [30] propose Knowledge Graph-Enhanced Large Language Models (KGLLMs), enhancing
LLMs with KGs for improved factual reasoning capabilities. These approaches that applied FT,
they do not use new generations of LLMs, specially, decoder only LLMs as Llama, and Mistral.
Challenges in this domain include ensuring the scalability, interpretability, and robustness of
ifne-tuned models across diverse knowledge domains.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Evaluation metrics</title>
        <p>As we employ LLMs to construct KGs, and given that LLMs function as Natural Language
Generation (NLG) models, it becomes imperative to discuss NLG criteria. In NLG, two criteria
[31] are used to assess the quality of the produced answers (triples in our context).</p>
        <p>The first criterion is semantic consistency or Semantic Fidelity which quantifies the fidelity
of the data produced against the input data. The most common indicators are :
• Hallucination: It is manifested by the Presence of information (facts) in the generated
text that is absent in the input data. In our scenario, hallucination exists if the generated
triples (GT) contain triples not present in the ground truth triples (ET) (T in GT and not
• Omission: It is manifested by the omission of one of the pieces of information (facts)
in the generated text. In our case, omission occurs if a triple is present in ET but not in GT;
• Redundancy: This is manifested by the repetition of information in the generated text.</p>
        <p>In our case, the redundancy exists if a triple appears more than once in GT;
• Accuracy: The lack of accuracy is manifested by the modification of information such
as the inversion of the subject and the direct object complement in the generated text.</p>
        <p>Accuracy increases if there is an exact match between ET and GT. ;
• Ordering: It occurs when the sequence of information is diferent from the input data.</p>
        <p>In our case, the ordering of GT is not considered.</p>
        <p>The second criterion is linguistic coherence or Output Fluency to evaluate the fluidity of the
text and the linguistic constructions of the generated text, the segmentation of the text into
diferent sentences, the use of anaphoric pronouns to reference entities and to have linguistically
correct sentences. However, in our evaluation, we do not take into account the second criterion.</p>
        <p>
          In their experiments, [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] calculated three hallucination metrics - subject hallucination, relation
hallucination, and object hallucination - using certain preprocessing steps such as stemming.
They used the ground truth ontology alongside the ground truth test sentence to determine if
an entity or relation is present in the text. However, a limitation could arise when there is a
disparity in entities or relations between the ground truth ontology and the ground truth test
sentence. If the generated triples contain entities or relations not present in the ground truth
text, even if they exist in the ground truth ontology, it will be considered a hallucination.
        </p>
        <p>
          The authors of [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] evaluate their experiments using several evaluation metrics, including
Triple Match F1 (T-F1), Graph Match F1 (G-F1), G-BERTScore (G-BS) from [32] which extends
BertScore [33] for graph matching, and Graph Edit Distance (GED) from [34]. The GED metric
measures the distance between the predicted graph and the ground-truth graph, which is
equivalent to computing the number of edit operations (addition, deletion, or replacement of
nodes and edges) needed to transform the predicted graph into a graph that is identical to the
ground-truth graph, but it does not provide a specific path for these operations to calculate the
exact number of operations. To adhere with semantic consistency criterion, we use the terms
"omission" and "hallucination" in place of "addition" and "deletion," respectively.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Propositions</title>
      <p>This section describes our approach to evaluate the quality of generated KGs. We explain how
we use evaluation metrics such as T-F1, G-F1, G-BS, GED, Bleu-F1 [35] and ROUGE-F1 [36]
to assess the quality of the generated KGs in comparison to ground-truth KGs. Additionally,
we discuss the use of Optimal Edit Paths (OEP) metric 1 to determine the precise number of
1NetworkX - optimal edit paths : https://networkx.org/documentation/stable/index.html
operations required to transform the predicted graph into an identical representation of the
ground-truth graph. This metric serves as a basis for calculating omissions and hallucinations
in the generated graphs. We employ examples from the WebNLG+2020 dataset [37] for testing
with ZSP and FSP techniques. Additionally, we utilize the training dataset of WebNLG+2020 to
train LLMs using the FT technique. Subsequent subsections delve into a detailed discussion of
each phase.</p>
      <sec id="sec-3-1">
        <title>3.1. Overall experimentation’s process</title>
        <p>
          We leverage the WebNLG+2020 dataset, specifically the version curated by [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Their preparation
of graphs in lists of triples proves beneficial for evaluation purposes. We utilize these lists and
employ NetworkX [38] to transform them back into graphs, facilitating evaluations on the
resultant graphs. This step is instrumental in performing ZSP, FSP, and FT LLMs on this dataset.
        </p>
        <p>The figure 2 illustrates the diferent stages of our experimentation process, including data
preparation, model selection, training, validation, and evaluation. The process begins with data
preparation, where the WEBNLG dataset is preprocessed and split into training, validation,
and test sets. Next, the learning type is selected, and diferent models are trained using the
training set. The trained models are then evaluated on the validation set to evaluate their
performance. Finally, the best-performing model is selected and validated on the test set to
estimate its generalization ability.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompting learning</title>
        <p>
          During this phase, we employ the ZSP and FSP techniques on LLMs to evaluate their proficiency
in extracting triples (e.g. construction of the KG). The application of these techniques involves
merging examples from the test dataset of WebNLG+2020 with our adapted prompt. Our prompt
is strategically modified to provide contextual guidance to the LLMs, facilitating the efective
extraction of triples, without the inclusion of a support ontology description, as demonstrated
in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The specific prompts used for ZSP and FSP are illustrated in Fig 3(a) and Fig 3(b),
        </p>
        <p>
          In our approach for ZSP, we began with the methodology outlined in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], initiating our
prompt with the directive "Transform the text into a semantic graph." However, we enhanced
this prompt by incorporating additional sentences tailored for our LLMs, as illustrated in Fig
3.(a).
        </p>
        <p>For FSP, we executed 7-shots learning. The rationale behind employing 7-shots learning lies
in the fact that the maximum KG size in WebNLG+2020 is 7 triples. Consequently, we fed our
prompt with 7 examples of varying sizes; example 1 with size 1, example 2 with size 2, example
3 with size 3, and so forth. In Figure 3-b, we depict a prompt containing two examples.</p>
        <p>
          To demonstrate the eficacy of our refined prompt (including additional sentences), we
conducted zero-shot experiments on ChatGPT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], comparing the outcomes with those of
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Our results consistently reveal that our prompt yields more coherent answers in terms of
structure.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Finetuning</title>
        <p>If the initial results from the ZSP and FSP on LLMs prove reasonable, we proceed to the FT
phase. This phase aims to provide the LLMs with a more specific context and knowledge related
to the task of extracting triples within the domains covered by the WebNLG+2020 dataset. Using
the example "a)" illustrated in Fig 3, we passe in the FT prompt, at once for each line of the
training dataset, the input text and the corresponding KG (the list of triples). To do this phase
(FT), we employ QLoRA [39], a methodology that integrates quantization [40] and Low-Rank
Adapters (LoRA) [41]. The LLM is loaded with 4-bit precision using bitsandbytes [42], and the
training process incorporates LoRA through the PEFT library (Parameter-Eficient Fine-Tuning)
[43] provided by Hugging Face.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Postprocessing</title>
        <p>Given our focus on KG construction, our evaluation process involves assessing the generated
KGs against ground-truth KGs. To facilitate this evaluation, we take a cleaning process for the
LLMs output. This involves transforming the graphs generated by LLMs into organized lists of
triples, subsequently transferred to textual documents.</p>
        <p>The transformation is executed through rule-based processing. This step is applied to remove
corrupted text (outside the lists of triples) from the whole text generated by LLMs in the
preceding step. The output is then presented in a list of lists of triples format, optimizing our
evaluation process. This approach proves especially efective when calculating metrics such as
G-F1, GED, and OEP, as we will see in more detail in 3.5</p>
        <p>A potential problem arises when instructing LLMs to produce lists of triples (KGs), as there
may be instances where the generated text lacks the desired structure. In such cases, we
address this issue by substituting the generated text with an empty list of triples, represented
as ’[["","",""]]’, allowing us to efectively evaluate omissions. However, this approach tends to
underestimate hallucinations compared to the actual occurrences.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Experiment’s evaluation</title>
        <p>
          For assessing the quality of the generated graphs in comparison to ground-truth graphs, we
adopt evaluation metrics as employed in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. These metrics encompass T-F1, G-F1, G-BS [32],
and GED [34]. Additionally, we incorporate the Optimal Edit Paths (OEP) metric, a tool aiding
in the calculation of omissions and hallucinations within the generated graphs.
        </p>
        <p>
          Our evaluation procedure aligns with the methodology outlined in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], particularly in the
computation of GED and G-F1. This involves constructing directed graphs from lists of triples,
referred to as linearized graphs, utilizing NetworkX [38].
        </p>
        <p>
          In contrast to [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], our methodology diverges by not relying on the ground truth test sentence
of an ontology. As previously mentioned, we opt for a distinct approach wherein we assess
omissions and hallucinations in the generated graphs using the OEP metric. Unlike the global
edit distance provided by GED, OEP gives the precise path of the edit, enabling the exact
quantification of omissions and hallucinations, either in absolute terms or as a percentage across
the entire test dataset.
        </p>
        <p>For example, in the illustrated nodes path labeled ’a)’ in Fig 4-(b), we observe 2 omissions,
while the edges path in Fig 4-(a) exhibits 1 hallucination. In our evaluation, the criterion for
incrementing the global hallucination metric for all graphs is set at finding &gt;=1 hallucinations
or 1 omission in a generated graph. This approach ensures a comprehensive assessment of the
presence of omissions and hallucinations across the entirety of the generated graphs.</p>
        <p>As mentioned earlier, the evaluation of the three methods is conducted using examples
sourced from the test dataset of WebNLG+2020. The primary goal is to enhance G-F1, T-F1,
G-BS, Bleu-F1, and ROUGE-F1 metrics, while reducing GED, Hallucination, and Omission.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Mathematical representation of the used metrics</title>
        <p>We mathematically represent the used metrics as follows:
Graph Matching (- 1 ). Let  ℎ be the number of matches between predicted and gold
graphs. And let  ℎ be the total number of predicted graphs. Then, the accuracy for
entire graph matches ℎ can be calculated as:
 ℎ
ℎ =  ℎ
 - 1 =</p>
        <p>2 ×  
2 ×   +   +  
Triples Matcning ( - 1). The  1 score for triple matches  - 1 is calculated in the
following:
• TP is the number of true positive triple matches.
• FP is the number of false positive triple matches.</p>
        <p>• FN is the number of false negative triple matches.</p>
        <p>Where</p>
        <sec id="sec-3-6-1">
          <title>Where:</title>
        </sec>
        <sec id="sec-3-6-2">
          <title>Where:</title>
          <p>•  is the total number of graphs.</p>
          <p>• GEDED is the graph edit distance for the th graph.</p>
          <p>Graph Edit Distance (GED). The following equation calculate GED between two given
graphs :
(1, 2) =</p>
          <p>min ∑︁ ()
1,...,∈ (1,2) =1
• GED(1, 2): This represents the graph edit distance between two graphs 1 and 2.
• min1,...,∈ (1,2): This part denotes taking the minimum over all possible edit paths
1, . . . ,  in the set  (1, 2). The set  (1, 2) contains all possible edit paths that
transform 1 into 2.
• ∑︀</p>
          <p>=1 (): This part calculates the sum of the costs of each individual edit operation 
in the selected edit path. The cost function () measures the cost or strength of each
edit operation. The objective is to find the edit path with the minimum total cost, which
represents the least amount of transformation required to convert 1 into 2.
In our experiments, we calculate the overall GED which is computed as follows:
overall_ged =</p>
          <p>1 ∑︁ GEDED

=1
Graph BERTScore (G-BS) . G-BS takes graphs as a set of edges and solve a matching problem
which finds the best alignment between the edges in predicted graph and those in ground-truth
graph. Each edge is considered as a sentence and BERTScore is used to calculate the score
between a pair of predicted and ground-truth edges, Based on the best alignment and the overall
matching score, the computed F1 score is used as the final G-BERTScore. Considering  as
reference token (entity or relation) and ˆ as generated token (entity or relation), the complete
score matches each token in  to a generated token in ˆ to compute recall, and each token in
ˆ to a token in  to compute precision. A greedy matching is used to maximize the matching
similarity score, where each token is matched to the most similar token in the other graph. Then
precision and recall are combined to compute an F1 measure. For a reference  and candidate ˆ,
the recall, precision, and F1 scores are:
BERT =
BERT =
 1BERT =
1 ∑︁ max  ˆ ,</p>
          <p>|| ∈ ^∈^
1 ∑︁ max  ˆ ,</p>
          <p>|ˆ| ^∈^ ∈
2 · BERT · BERT .</p>
          <p>BERT + BERT
 =
 =
ℎ

ℎ</p>
          <p>1 =
2 ×  ×</p>
          <p>+ 
Bleu-F1 Score ( 1). Let  be the count of 4-grams in the generated graph ,
Let  be the count of 4-grams in the reference graph, and
Let ℎ be the count of matching 4-grams in both texts
ROUGE-F1 Score ( 1 ). In our experiments, we calculate F1-score for Rouge-2
(bigram), which is presented in the following equation:
 =
 =
 1 = 2.</p>
          <p>. ∩ .</p>
          <p>.
. ∩ .</p>
          <p>.</p>
          <p>.
 + 
Hallucination and Omission. As mentioned before, we calculate hallucination and omission
using OEP, which is the optimal edit paths between the gold and predicted graphs. Each edit
operation (ei) in OEP represents an action required to transform the predicted graph into the
gold graph.</p>
          <p>• Hallucination: An edit operation  is considered a hallucination if it involves adding
an entity or a relation that is not present in the gold graph but exists in the predicted
graph. In our work, we take into account the overall hallucination ℎ., this metric is
represented by the following equation :
. =</p>
          <p>ℎ
 
Where ℎ is the number of graphs with hallucination, and   in the total number
of generated graphs
• Omission: An edit operation  is considered an omission if it involves deleting an entity
or a relation that exists in the gold graph but is missing in the predicted graph. In ou
work, we do the same as the hallucicnation, we calculate the overall omission .,
presented by the following equation :</p>
          <p>. = /</p>
        </sec>
        <sec id="sec-3-6-3">
          <title>Where  is the number of graphs with omission.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>This section provides insights into the LLMs utilized in our study for ZSP, FSP, or FT, followed
by the presentation of our experimental results.</p>
      <p>
        In this section, we provide a brief overview of the LLMs utilized in our experiments. Our
selection criteria focused on employing small, open-source, and easily accessible LLMs. All
models were sourced from the HuggingFace platform2
• Llama 2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a collection of pretrained and fine-tuned generative text models ranging
in scale from 7 billion to 70 billion parameters. In our experiments, we deploy the 7B and
13B pretrained models, which have been converted to the Hugging Face Transformers
format.
• Introduced by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Mistral-7B-v0.1 is a pretrained generative text model featuring 7
billion parameters. Notably, Mistral-7B-v0.1 exhibits superior performance to Llama 2
13B across all benchmark tests in their experiments.
• In the work presented by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Starling-7B is introduced as an open LLM trained through
Reinforcement Learning from AI Feedback (RLAIF). This model leverages the GPT-4
labeled ranking dataset, berkeley-nest/Nectar, and employs a novel reward training and
policy tuning pipeline.
      </p>
      <p>
        In our review of the state-of-the-art, we observed that, apart from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which incorporates
hallucination evaluation in their experiments, other studies primarily focus on metrics such as
precision, recall, F1 score, triple matching, or graph matching. In our approach to evaluating
experiments, we consider also hallucination and omission through a linguistic lens.
      </p>
      <p>Upon examining Table 1, we observe the superior performance of the FT method compared
to ZSP and FSP for the T2KG construction task. Of particular interest is the finding that,
with the exception of Llama2-7b, applying ZSP to the fine-tuned Llama2-7b results in worse
performance than FSP on the original Llama2-7b. Overall, this table provides a clear visualization
of the relative performance of each method, highlighting the strengths and limitations of each
approach for T2KG construction.</p>
      <p>Furthermore, it is evident that better results are achieved by providing more examples (more
shots) to the same model, whether original or fine-tuned. The results underscore the positive
correlation between the quantity of examples and the model’s performance. Comparing the
ifne-tuned Mistral and fine-tuned Starling, they exhibit similar performance when given 7 shots,
surpassing the two Llama2 models by a significant margin. The standout performer with ZSP
on the fine-tuned LLM is Mistral, showcasing a considerable lead over other LLMs, including
Starling. To corroborate these findings, future versions of our study plan to assess our fine-tuned
models using an alternative dataset with diverse domains.</p>
      <p>As depicted in Figure 2, Hall. represents Hallucinations, while Omis. denotes Omissions.</p>
      <p>Taking into account our strategy of introducing an empty graph when LLMs fail to produce
triples, we note that even with LLama2-13b with ZSP exhibiting the least favorable results
across all metrics, it displays minimal hallucinations. Nonetheless, it’s crucial to recognize that
the model with the fewest hallucinations may not necessarily be the most suitable choice. To
2Hugging Face: https://huggingface.co/
overcome this limitation in our evaluation metric, we aim to improve it by considering the
prevalence of empty graphs in the generated results before assessing them against ground truth
graphs.</p>
      <p>The G-BS consistently remains high, indicating that LLMs frequently generate text with
words (entities or relations) very similar to those in the ground truth graphs. Among the models,
the finetuned Starling with 7 shots achieves the highest G-F1, which focuses on the entirety of
the graph and evaluates how many graphs are exactly produced the same, suggesting that it
accurately generates approximately 36% of graphs identical to the ground truth. For various
metrics, the finetuned Mistral with 7 shots performs exceptionally well, particularly in T-F1,
where F1 scores are computed for all test samples and averaged for the final Triple Match F1
score. Additionally, it excels in metrics such as "Omis.," F1-Bleu, and F1-Rouge. F1-Bleu and
F1-Rouge represent n-gram-based metrics encompassing precision (Bleu), recall (Rouge), and
F-score (Bleu and Rouge). These metric could potentially yield even better results if synonyms
of entities or relations are considered as exact matches.</p>
      <p>
        The authors in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] conduct evaluations using WebNLG+2020. Consequently, we adopt their
approach (PiVE) as a baseline for comparison with our experiments. Upon analyzing the results,
it becomes evident that nearly all fine-tuned LLMs outperform PiVE, which is applied on both
ChatGPT and GPT-4 as mentioned before.
      </p>
      <p>
        In Table 2, we present the evaluation results of original LLMs with 7 shots and fine-tuned
LLMs with zero-shot and 7 shots on the KELM-sub dataset prepared by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], building upon [44].
It’s crucial to note that the experiments utilized the same prompts as previously described. The
7-shot experiments sourced examples from the WebNLG+2020 training dataset. These new
experiments aim to assess the generalization ability of original LLMs with 7 shots and fine-tuned
LLMs with zero-shot and 7 shots across diverse domains in the T2KG construction task.
      </p>
      <p>The results in Table 2 indicate that our fine-tuned LLMs perform less efectively than the
original LLMs with 7 shots. Furthermore, all LLMs’ results on KELM-sub are inferior to those on
WebNLG+2020. This disparity can be attributed to the presence of diferent relation types, where
some types are expressed diferently in Kelm, utilizing synonyms not considered in the current
metrics. Addressing this, our forthcoming versions aim to refine metrics to accommodate
synonyms in entities and relations.</p>
      <p>We also observe that the evaluation of PiVE on Sub-Kelm yields better results, leveraging
examples from the Sub-Kelm training dataset in their few-shot experiments, providing LLMs
with insights into certain relation types.</p>
      <p>One of the future experimentations will be to use examples from KELM-sub for few-shot
prompts to investigate whether the generalization issue stems from WebNLG domains, relation
types, or prompts that need improvement to disregard the relation types provided by the
examples.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and perspectives</title>
      <p>This study delves into the Text-to-Knowledge Graph (T2KG) construction task, exploring the
eficacy of three distinct approaches: Zero-Shot Prompting (ZSP), Few-Shot Prompting (FSP),
and Fine-Tuning (FT) of Large Language Models (LLMs). Our comprehensive experimentation,
employing models such as Llama2, Mistral, and Starling, sheds light on the strengths and
limitations of each approach. The results demonstrate the remarkable performance of the
FT method, particularly when compared to ZSP and FSP across various models. Notably,
the fine-tuned Llama2-7b with ZSP gaved worst results than FSP with the original Llama2.
Additionally, the positive correlation between the quantity of examples and model performance
underscores the significance of dataset size in training. An essential part of our study involves
the evaluation metrics employed to assess the generated graphs. Particularly, we introduced
nuanced considerations for refining these metrics to measuring hallucination and omission in
the generated graphs, ofering valuable insights into the fidelity of the constructed knowledge
graphs.</p>
      <p>Looking forward, there are promising perspectives for further enhancement. One is to involve
refining evaluation metrics to accommodate synonyms of entities or relations in generated
graphs, employing advanced methods or tools for synonym detection. Furthermore, leveraging
LLMs for data augmentation in the T2KG construction task shows promise. Notably, during
experimentation, LLMs, particularly Starling, exhibited the ability to provide continuity in
generated results for T2KG, proposing texts alongside corresponding KGs (triples).</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>The authors thank the French company DAVI (Davi The Humanizers, Puteaux, France) for their
support, and the French government for the plan France Relance funding.
et al., Zero-shot information extraction via chatting with chatgpt, arXiv preprint
arXiv:2302.10205 (2023).
[22] L. Jarnac, M. Couceiro, P. Monnin, Relevant entity selection: Knowledge graph
bootstrapping via zero-shot analogical pruning, in: Proceedings of the 32nd ACM International
Conference on Information and Knowledge Management, 2023, pp. 934–944.
[23] Z. Bi, J. Chen, Y. Jiang, F. Xiong, W. Guo, H. Chen, N. Zhang, Codekgc: Code language
model for generative knowledge graph construction, ACM Transactions on Asian and
Low-Resource Language Information Processing 23 (2024) 1–16.
[24] L. Yao, J. Peng, C. Mao, Y. Luo, Exploring large language models for knowledge graph
completion, arXiv preprint arXiv:2308.13916 (2023).
[25] H. Khorashadizadeh, N. Mihindukulasooriya, S. Tiwari, J. Groppe, S. Groppe, Exploring
in-context learning capabilities of foundation models for generating knowledge graphs
from text, arXiv preprint arXiv:2305.08804 (2023).
[26] S. Deng, C. Wang, Z. Li, N. Zhang, Z. Dai, H. Chen, F. Xiong, M. Yan, Q. Chen, M. Chen,
et al., Construction and applications of billion-scale pre-trained multimodal business
knowledge graph, in: 2023 IEEE 39th International Conference on Data Engineering
(ICDE), IEEE, 2023, pp. 2988–3002.
[27] M. Trajanoska, R. Stojanov, D. Trajanov, Enhancing knowledge graph construction using
large language models, arXiv preprint arXiv:2305.04676 (2023).
[28] J. Chen, L. Ma, X. Li, N. Thakurdesai, J. Xu, J. H. Cho, K. Nag, E. Korpeoglu, S. Kumar,
K. Achan, Knowledge graph completion models are few-shot learners: An empirical study
of relation labeling in e-commerce with llms, arXiv preprint arXiv:2305.09858 (2023).
[29] A. Harnoune, M. Rhanoui, M. Mikram, S. Yousfi, Z. Elkaimbillah, B. El Asri, Bert based
clinical knowledge extraction for biomedical knowledge graph construction and analysis,
Computer Methods and Programs in Biomedicine Update 1 (2021) 100042.
[30] L. Yang, H. Chen, Z. Li, X. Ding, X. Wu, Chatgpt is not enough: Enhancing large
language models with knowledge graphs for fact-aware language modeling, arXiv preprint
arXiv:2306.11489 (2023).
[31] T. C. Ferreira, C. van der Lee, E. Van Miltenburg, E. Krahmer, Neural data-to-text
generation: A comparison between pipeline and end-to-end architectures, arXiv preprint
arXiv:1908.09022 (2019).
[32] S. Saha, P. Yadav, L. Bauer, M. Bansal, Explagraphs: An explanation graph generation task
for structured commonsense reasoning, arXiv preprint arXiv:2104.07644 (2021).
[33] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text
generation with bert, arXiv preprint arXiv:1904.09675 (2019).
[34] Z. Abu-Aisheh, R. Raveaux, J.-Y. Ramel, P. Martineau, An exact graph edit distance
algorithm for solving pattern recognition problems, in: 4th International Conference on
Pattern Recognition Applications and Methods 2015, 2015.
[35] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of
machine translation, in: Proceedings of the 40th annual meeting of the Association for
Computational Linguistics, 2002, pp. 311–318.
[36] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[37] C. Gardent, A. Shimorina, S. Narayan, L. Perez-Beltrachini, The webnlg challenge:
Generating text from rdf data, in: Proceedings of the 10th international conference on natural
language generation, 2017, pp. 124–133.
[38] A. Hagberg, P. Swart, D. S Chult, Exploring network structure, dynamics, and function
using NetworkX, Technical Report, Los Alamos National Lab.(LANL), Los Alamos, NM
(United States), 2008.
[39] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of
quantized llms, Advances in Neural Information Processing Systems 36 (2024).
[40] X. Zhang, S. Liu, R. Zhang, C. Liu, D. Huang, S. Zhou, J. Guo, Y. Kang, Q. Guo, Z. Du,
et al., Adaptive precision training: Quantify back propagation in neural networks with
ifxed-point numbers, arXiv preprint arXiv:1911.00361 (2019).
[41] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
[42] Y. Belkada, T. Dettmers, A. Pagnoni, S. Gugger, S. Mangrulkar, Making llms even more
accessible with bitsandbytes, 4-bit quantization and qlora (2023).
[43] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, Peft:
State-of-theart parameter-eficient fine-tuning methods, Younes Belkada and Sayak Paul," PEFT:
State-of-the-art Parameter-Eficient Fine-Tuning methods (2022).
[44] O. Agarwal, H. Ge, S. Shakeri, R. Al-Rfou, Knowledge graph based synthetic
corpus generation for knowledge-enhanced language model pre-training, arXiv preprint
arXiv:2010.12688 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>graphs</given-names>
          </string-name>
          ,
          <source>ACM Computing Surveys (Csur) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>Industry-scale knowledge graphs: Lessons and challenges: Five diverse technology companies show how it's done</article-title>
          ,
          <source>Queue</source>
          <volume>17</volume>
          (
          <year>2019</year>
          )
          <fpage>48</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Enguix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lata</surname>
          </string-name>
          ,
          <article-title>Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2023</year>
          , pp.
          <fpage>247</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ershov</surname>
          </string-name>
          ,
          <article-title>A case study for compliance as code with graphs and language models: Public release of the regulatory knowledge graph</article-title>
          ,
          <source>arXiv preprint arXiv:2302</source>
          .
          <year>01842</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Caufield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Emonet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Joachimiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Moxon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Reese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Haendel</surname>
          </string-name>
          , et al.,
          <article-title>Structured prompt interrogation and recursive extraction of semantics (spires): A method for populating knowledge bases using zero-shot learning</article-title>
          ,
          <source>arXiv preprint arXiv:2304.02711</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Buntine</surname>
          </string-name>
          , E. Shareghi,
          <article-title>Pive: Prompting with iterative verification improving graph-based generative capability of llms</article-title>
          ,
          <source>arXiv preprint arXiv:2305.12392</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>Rethinking the role of demonstrations: What makes in-context learning work?</article-title>
          ,
          <source>arXiv preprint arXiv:2202.12837</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Stiennon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <article-title>Learning to summarize with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>3008</fpage>
          -
          <lpage>3021</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>R. OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>technical report. arxiv 2303</source>
          .08774, View in Article 2 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Workshop</surname>
          </string-name>
          , T. L.
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Akiki</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ilić</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hesslow</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Castagné</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Luccioni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Yvon</surname>
          </string-name>
          , et al.,
          <article-title>Bloom: A 176b-parameter open-access multilingual language model</article-title>
          ,
          <source>arXiv preprint arXiv:2211.05100</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , et al.,
          <article-title>Palm: Scaling language modeling with pathways</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , E. Frick, T. Wu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiao</surname>
          </string-name>
          , Starling-7b:
          <article-title>Improving llm helpfulness &amp; harmlessness with rlaif</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tunstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Beeching</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , L. von
          <string-name>
            <surname>Werra</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fourrier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Habib</surname>
          </string-name>
          , et al.,
          <article-title>Zephyr: Direct distillation of lm alignment</article-title>
          ,
          <source>arXiv preprint arXiv:2310.16944</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Carta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giuliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Podda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pompianu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Tiddia</surname>
          </string-name>
          ,
          <article-title>Iterative zero-shot llm prompting for knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2307.01128</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities</article-title>
          ,
          <source>arXiv preprint arXiv:2305.13168</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Zhang,</surname>
          </string-name>
          <article-title>Evaluating chatgpt's information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness</article-title>
          ,
          <source>arXiv preprint arXiv:2304.11633</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cui</surname>
          </string-name>
          , N. Cheng,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Huang,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , M. Zhang,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>