=Paper=
{{Paper
|id=Vol-3747/paper1
|storemode=property
|title=Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text
|pdfUrl=https://ceur-ws.org/Vol-3747/text2kg_paper1.pdf
|volume=Vol-3747
|authors=Ananya Ananya,Sanju Tiwari,Nandana Mihindukulasooriya,Tommaso Soru,Ziwei Xu,Diego Moussallem
|dblpUrl=https://dblp.org/rec/conf/text2kg/Ananya24
}}
==Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text==
<pdf width="1500px">https://ceur-ws.org/Vol-3747/text2kg_paper1.pdf</pdf>
<pre>
                                Towards Harnessing Large Language Models as
                                Autonomous Agents for Semantic Triple Extraction
                                from Unstructured Text
                                Ananya Ananya1,∗ , Sanju Tiwari2 , Nandana Mihindukulasooriya3 , Tommaso Soru4 ,
                                Ziwei Xu5 and Diego Moussallem6
                                1
                                  Indian Institute of Technology, Bhilai, India
                                2
                                  BVICAM, New Delhi, India & UAT, Mexico
                                3
                                  IBM Research, New York, United States
                                4
                                  Serendipity AI Ltd, London, United Kingdom
                                5
                                  National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
                                6
                                  Jusbrasil, Salvador, Brazil


                                            Abstract
                                           The use of Large Language Models as autonomous agents interacting with tools has shown to improve the
                                           performance of several tasks from code generation to API calling and sequencing. This paper proposes a
                                           framework for using Large Language Models as autonomous agents for the task of Knowledge Graph
                                           construction from unstructured text. Specifically, it focuses on triple extraction, which involves identify-
                                           ing entities and their relationships from text to construct a Knowledge Graph. Our novel framework
                                           “Auto-KG agent” incorporates two relation extraction tools, REBEL and KnowGL, in conjunction with
                                           Large Language Models. Experimental results on the CONLL04 dataset demonstrate that while multi-tool
                                           approaches face challenges like hallucination, LLM-based agents are promising in mitigating biases, major
                                           event identification, handling negations and modalities thus enhancing extraction accuracy, particularly
                                           for complex linguistic structures. The impetus for this research is to overcome the current limitations
                                           of existing systems for Knowledge Graph construction and propose a roadmap for developing a robust
                                           framework capable of handling the intricacies of natural language with minimal human interference.
                                           The paper also discusses future directions, such as emulating Large Language Model training using
                                           reinforcement learning with human feedback, incorporating query decomposition, and integrating a
                                           re-ranking module. Through this research, the authors aim to set a new direction for future endeavours in
                                           building advanced, reliable systems for knowledge extraction. Overall, this work highlights the potential
                                           of LLM-based agents for knowledge graph construction and proposes a framework for harnessing their
                                           capabilities.

                                            Keywords
                                            Triple extraction, Knowledge Graph, Knowledge Graph Construction, LLM Agents, Reasoning, Handling
                                            modalities and negations, Mitigating biases


                                TEXT2KG 2024: Third International Workshop on Knowledge Graph Generation from Text, May 26-30, 2024, co-located
                                with Extended Semantic Web Conference (ESWC), Hersonissos, Greece
                                ∗
                                    Corresponding author.
                                Envelope-Open ananyah@iitbhilai.ac.in (A. Ananya); tiwarisanju18@ieee.org (S. Tiwari); nandana@ibm.com
                                (N. Mihindukulasooriya); tom@tommaso-soru.it (T. Soru); xuxiaowei23@hotmail.com (Z. Xu);
                                diegomoussallem@gmail.com (D. Moussallem)
                                Orcid 0009-0002-2431-2511 (A. Ananya); 0000-0003-1707-4842 (N. Mihindukulasooriya)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
The advent of Knowledge Graphs has revolutionized the way we represent and utilize informa-
tion in the digital age. By structuring data as triples—consisting of a head entity, a relationship,
and a tail entity (h, r, t) — Knowledge Graphs provide a semantic framework to describe the
varied and countless entities and their interrelations in the objective world. This structured
approach to data organization underpins intelligent applications and has garnered significant at-
tention in both academic and industrial spheres due to its potent semantic processing capabilities
and open organizational structure [1].
   In the field of Natural Language Processing, extracting relational facts from text is crucial.
Understanding the semantic relationships between entities in unstructured text helps convert
raw data into structured formats. This structured data is extremely valuable for several tasks,
such as building and enhancing Knowledge Bases. These bases are essential for powering
applications that rely on knowledge [2].
   In the realm of information extraction, frameworks like REBEL [2] and KnowGL [3] have
emerged as powerful tools for converting unstructured text into structured relational data. These
frameworks leverage the advancements in machine learning and natural language processing to
perform tasks that traditionally required separate models for Named Entity Recognition (NER)
and Relation Classification (RC). REBEL, which stands for Relation Extraction By End-to-end
Language Generation, utilizes an autoregressive sequence-to-sequence (seq2seq) model, specifi-
cally a BART-large model, to extract relationships between entities in a text. The architecture of
REBEL is designed to represent relations as a linearized sequence that includes entity mentions,
labels, types, and the relation label. Similarly, KnowGL is a comprehensive framework that
aims to transform natural language text into structured data that aligns with the schema of a
Knowledge Graph like Wikidata. KnowGL consists of three main components: “Knowledge
Generation”, “Fact Ranking”, and “Linking to Wikidata”. We focus on the Knowledge Generation
component of the KnowGL framework. The Knowledge Generation component uses fine-tuned,
pre-trained language models to identify entity mentions and generate facts, including entity
labels, types, and relationships.
   Despite their innovative approaches, REBEL and KnowGL have certain limitations. The
performance of these models is heavily influenced by the quality of their pre-training data.
Biases or inaccuracies present in the training datasets can propagate through the models,
affecting the accuracy of the extracted relations. Furthermore, the ability of these frameworks to
generalize across various domains and text types hinges on the extent to which the pre-trained
language models are fine-tuned or further pre-trained on domain-specific datasets. While Large
Language Models inherently possess a broad understanding of language, their performance
in specialized contexts improves significantly with targeted fine-tuning. Additionally, these
systems may struggle with complex sentence structures and fail to identify all relevant major
events, particularly in sentences laden with modalities or multiple clauses.
   To address these challenges, we introduce a novel framework that synergizes the capabilities
of REBEL and KnowGL with the nuanced understanding of Large Language Models. Large
Language Models demonstrate remarkable efficacy in processing sentences with modalities
and complex structures, leading to accurate event identification and triple extraction. This
integration not only enhances event detection but also aids in mitigating biases inherent in
training data, ensuring a more comprehensive extraction of triples. Apart from introducing a
novel framework, we also aim to answer the following research questions:

RQ1: How effective are Large Language Models in mitigating biases for extracting triples which
     are present in the datasets used for training information extraction tools?

RQ2: To what extent do Large Language Models accurately handle modalities and negations in
     natural language, and how does this capability affect the quality of triple extraction?

RQ3: Can Large Language Models enhance the identification of events within unstructured
     text, thereby improving the accuracy and completeness of triple extraction?

RQ4: How well do Large Language Models generalize across different datasets without the
     need for extensive training or fine-tuning, particularly in the context of triple extraction
     for knowledge graph construction?

RQ5: What is the impact of using multiple tools versus a single tool on the performance of
     triple extraction?

   The impetus for our research is to overcome the current limitations of existing systems and
chart a course for the development of a robust framework capable of handling the intricacies of
natural language. Our experiments are designed with this objective in mind and are conducted
with the resources available to us. Through this research, we aim to set a new direction for
future endeavors in building advanced, reliable systems for knowledge extraction and reasoning.


2. Related Work
Using Large Language Models as autonomous agents has become increasingly popular in recent
research. Large Language Models possess advanced reasoning abilities and skills in utilizing
tools, making them well-suited for autonomous operations. They excel in tasks like acquiring
knowledge, understanding instructions, generalizing information, planning, and reasoning,
showcasing their potential for autonomous tasks [4]. However, Large Language Models do
have limitations, such as performing arithmetic operations and staying updated with the latest
information, which cannot be fully addressed through simple fine-tuning alone. This highlights
the need for designing autonomous agent frameworks that can complement LLMs by integrating
external data and supplementary tools [5].
   This section has covered the existing studies on LLM-based Agents for Knowledge Graph Gen-
eration from Text. Jiang et. al [6] have introduced KG-Agent, an autonomous framework based
on Large Language Models, designed to enable a small Large Language Model to independently
make decisions throughout the reasoning process over Knowledge Graphs until completion.
Within KG-Agent, a Large Language Model is combined with a versatile toolbox, a knowledge
memory system and a KG-based executor. Jiang et. al [7] also introduced StructGPT tool, an
Iterative Reading-then-Reasoning (IRR) framework aimed at addressing question-answering
tasks using structured data. In this framework, a specialized interfaces has been designed to
acquire relevant information from structured data and allowing Large Language Models to
focus on the reasoning tasks based on the acquired information. This research also introduced
a procedure termed invoking linearization generation to promote Large Language Models in
reasoning over structured data with the help of provided interfaces. Zhu et. al [8] explore
Large Language Models for Knowledge Graph construction with reasoning and introduced
an innovative approach called AutoKG, which utilizes multiple agents to efficiently handle
both Knowledge Graph construction and reasoning tasks. There are several LLM-based Agents
which are shown in Table 1.
 Methodology                   Description                                        Tasks                          Dataset
 Tool Learning [9]             In-context demonstration and Generation Reg-      Tool manipulation, multi-tool   ToolBench [9]
                               ulation                                           usage
 Instruction tuning [10]       Learning on high-quality instruction datasets     Tool manipulation, multi-tool   ToolBench [9]
                                                                                 usage
 Instruction Tuning with Hu-   Instruction data mimicking human learning         Reasoning and knowledge-        CORGI [11]
 man Curriculum [11]           progression                                       based tasks
 ReAct [12]                    Prompting LLMs for Decision Making                Reasoning algorithm             HotPotQA [13], FEVER
                                                                                                                 [14]
 DFSDT [10]                    Prompting LLMs for Decision Making                Reasoning algorithm             ToolBench [9]
 CoT [15]                      Prompting LLMs for Decision Making                Reasoning algorithm             GSM8K [16], SVAMP
                                                                                                                 [17], ASDiv [18], AQuA,
                                                                                                                 MAWPS [19]
 AutoGPT [20]                  Online Decision Making                            Any decision-making task        ALFRED [21], DAgger
                                                                                                                 [22]
 WebGPT [23]                   Text-based web browsing environment               Long-form QA                    ELI5 [24]
 MMREACT [25]                  A system integrated with ChatGPT with a pool      Multimodal reasoning and ac-    Self
                               of vision experts                                 tion
 ProgPrompt [26]               Prompt with program-like specifications of the    Generate situated Robot Task    Self
                               available actions and objects in an environment   Plans
 LLM-ARK [27]                  KG reasoning agent                                Conversational Reasoning on     OpenDialKG [28]
                                                                                 Knowledge Graph, predic-
                                                                                 tions on KG paths as a
                                                                                 decision-making task
 KG-Agent [29]                 Enables a small LLM to actively make decisions    Improve the reasoning ability   WebQSP            [30],
                               until finishing the reasoning process over KGs    and complex QA                  CWQ               [31],
                                                                                                                 GrailQA [32]
 KnowAgent [33]                Enhances the planning capabilities of LLMs by     A     knowledgeable    self-    HotPotQA [13]
                               incorporating explicit action knowledge from      learning strategy for path
                               KGs                                               planning

Table 1
A comparison of work which augments LLMs with tool usage


3. Background
In the construction of LLM agents, an LLM acts as the primary controller or “brain,” orchestrating
the sequence of operations required to accomplish a task or respond to a user query. These LLM
agents may require additional modules like planning, memory, and tool utilization to enhance
their functionality [34]. To activate the LLM component, a prompt template containing essential
operational details and tool access specifications is utilized. While not mandatory, agents can
be characterized or given a persona to define their role. This profiling information is typically
embedded within the prompt and may include details such as role description, personality traits,
social characteristics, and other demographic attributes [35].
   Our research focuses on extracting triples from unstructured text to facilitate Knowledge
Graph construction. Triple extraction entails identifying and extracting structured information
in the form of triples, which comprise a subject entity, a relation or predicate, and an object
entity. This process is crucial in converting text into a structured format suitable for tasks like
knowledge representation and information retrieval in natural language processing.
   Relation extraction is another vital task we explore, involving the identification and extraction
of semantic relationships between entities mentioned in unstructured text. These relationships,
such as “is married to” or “works for”, capture meaningful associations and are represented as
triples for downstream applications like Knowledge Graph construction.
   REBEL [2] REBEL, which stands for Relation Extraction By End-to-end Language generation,
is an extraction technique to pull out relationship details from raw text.
   REBEL uses a special kind of model called autoregressive sequence-to-sequence (seq2seq)
models. These models are good at making text and also understanding natural language. The
main part of REBEL is a seq2seq model based on BART [36]. It represents relations between
entities in the input text as a linearized sequence following a specific schema involving entity
mentions, labels, types, and the relation label. REBEL uses BART-large as the base model
which is first pre-trained on a large distantly supervised dataset called REBEL which was
created by extracting over 800K training instances with 220 relation types from Wikipedia
abstracts aligned with Wikidata facts. The pre-trained BART is then fine-tuned on this REBEL
dataset to maximize the likelihood of generating the correct linearized triplet representation
given the input text. REBEL demonstrates several advantages - it frames relation extraction in
an end-to-end manner, can extract open-ended relation types, allows quickly fine-tuning on
new datasets across domains, and achieves state-of-the-art performance on multiple relation
extraction benchmarks while being simpler than prior complex pipeline approaches.
   KnowGL [3] KnowGL is a comprehensive framework designed to convert natural language
text into structured relational data that aligns with the schema of a Knowledge Graph like
Wikidata. This framework comprises three key components: “Knowledge Generation”, “Fact
Ranking”, and “Linking to Wikidata”. The Knowledge Generation aspect focuses on extracting
facts by fine-tuning pre-trained language models to identify entity mentions and generate
sets of facts including entity labels, types, and relationships. Fact Ranking involves parsing
generated sequences to create a ranked list of distinct facts based on scores assigned to each fact.
Lastly, Linking to Wikidata facilitates retrieving Wikidata IDs associated with the generated
semantic annotations. By enabling the conversion of text into Wikidata statements in JSON
format, KnowGL demonstrates the potential of pre-trained language models for generating
structured data from text, offering an alternative to traditional information extraction pipelines.


4. System Architecture
The goal of our framework is to facilitate automatic triple extraction from text inputs. This
framework is designed as a multi-tool system utilizing Large Language Models to execute the
task of triple extraction.
   Figure 1 outlines a framework for training a Large Language Model (LLM) using a method
referred to as RLHF, which stands for Reinforcement Learning from Human Feedback. The
flowchart is divided into two main sections:

   1. The Large Language Model training procedure using RLHF [37]
   2. RLHF LLM Based Autonomous Agents for triple Extraction for Knowledge Graph con-
      struction

   LLM Training Procedure Using RLHF
   The training procedure begins with raw text to pretrain the Large Language Model. This
pre-training step is where the model learns from a large corpus of text to understand language
patterns and structures.
   After pretraining, the model becomes a “Pretrained LLM” which is then subjected to “Su-
pervised Fine-tuning” using Demonstration data. Demonstration data consists of prompts and
response pairs. This step involves training the model on specific examples to perform certain
tasks and understand particular domains better. This form an instruction-following Chatbot of
low quality.
   The fine-tuned model, referred to as “SFT LLM” is then used in conjunction with “Human
Preference Data” to train a “Reward Model”. The Human Preference data consist of prompts,
winner tuples and loser tuples. This reward model evaluates the outputs of the LLM and provides
feedback on its performance.
   The feedback from the reward model is used in Reinforcement Learning, where the Large
Language Model is further trained using RLHF (Reinforcement Learning with Human Feedback)
training prompts to improve its outputs based on human preferences and feedback. The RLHF
training prompts also consists of prompt, winner tuples and loser tuples. In the end we get a
high-quality instruction-following RLHF LLM [38]. The diagram presented in the first half of
the Figure 1
   RLHF LLM Based Autonomous Agents for Triple Extraction for KG Construction
   The bottom section of the flowchart shows the application of the trained RLHF LLM for a
specific task: Autonomous agents for triple extraction to construct Knowledge Graph.
   A User inputs a complex query, a command which is then decomposed by the query-
decomposition LLM. The decomposed query is processed by the “RLHF LLM” which interacts
with a Tool DB (Tool Database). Here the tools in the Tool DB are “REBEL” and “KnowGL”.
   The RLHF LLM then performs re-ranking of triples extracted from the complex query which
involves selecting the most relevant and accurate triples based on their count of occurrences of
triples in the Knowledge Graph. The final output are then re-ranked which would be used in
the construction of a Knowledge Graph. The system prompt for RLHF LLM is provided in the
Appendix A.
   It is a comprehensive framework for training a Large Language Model using human feedback
and then applying this model to extract triples for building Knowledge Graph.
   We implement the “Auto-KG Agent” framework to facilitate automatic triple extraction from
text inputs. This framework is designed as multi-tool system utilizing Large Language Models
to extract triples. We utilise REBEL and KnowGL as tools for triple extraction (relation along
with entity). More such frameworks can be added as tools in the Tool DB. Large Language
Model is asked to return the entities in JSON format. The diagram presented in the second half
of the Figure 1 serves as the visual representation for this section.
   The current system comprises the second half of the the Figure 1, without the query decom-
position Large Language Model and Large Language Model training using RLHF. As of now we
only incorporate Tool DB, LLM without RLHF (only system prompt) and re-ranking of triples
Figure 1: System architecture for utilising Large Language Models as autonomous agents enabling
tools for Knowledge Graph from unstructured text


based on the length of relation extracted. We state in the section 7 for incorporating other
modules in the “Auto-KG Agent” framework.


5. Preliminary Experimental Setup
Dataset We evaluate our system’s performance on the CONLL04 dataset [39], which comprises
sentences extracted from news articles. Each sentence is annotated with four entity types
(person, organization, location, and other) and five relation types (kill, work for, organization
based in, live in, and located in). Our evaluation focuses on the test split consisting of 288
instances [40], the ground truth, comparing the performance of our model against the REBEL
model. The dataset statistics is described in Table 2.
       Dataset      Entity Types    Relation Types       Train      Validation     Test
       CONLL04            4                5          1,290 (922)    343 (231)   422 (288)
Table 2
Dataset statistics for CONLL04 (The number of instances are in brackets and the number of triples
outside the bracket)


Evaluation Metrics The evaluation process compares the predicted triples extracted from test
data with the ground truth triples. Each instance in both datasets is represented as a dictionary,
with a unique identifier and a set of triples. Each sentence has an object corresponding to it
which stores the triple.
   To calculate the true positives (correctly predicted triples), we iterate through each instance
in the ground truth data. For each instance, we check if the corresponding instance exists in
the predicted data. If it does, we find the intersection of the triples in the ground truth and
predicted data, which gives us the number of correct predictions (true positives).
   Additionally, we also calculate the number of extra predictions made by the model that is the
count of triples not present in the ground truth. However, we don’t calculate scores for them as
it would require Human evaluation.
   After calculating count of true positives, we compute both micro and macro scores for
precision, recall, and F1 score. Micro scores consider the total number of triples in the entire
dataset for calculating precision, recall, and F1 score, while macro scores average these metrics
across each instance in the dataset.
   Overall, this evaluation process enables us to assess the performance of the triple extraction
model by quantifying its precision, recall, and F1 score, considering both individual instances
and the entire dataset.
   We did a strict evaluation where the correctness for triples extracted as a whole is compared
with the corresponding head entity, tail entity and relation in the ground truth. Following are
the counts of unique relations extracted for different frameworks:

       • REBEL - 68 relations mapped to 5 relations

       • KnowGL - 58 relations mapped to 5 relations

       • REBEL + KnowGL - 90 relations mapped to 5 relations

Triple Extraction Tools REBEL and KnowGL frameworks are used as triple extraction tools
in our experiment. In our evaluation of the REBEL model on the CONLL04 test dataset, we
encountered a diverse set of 68 unique relations extracted by the model. To align these with
the CONLL04 dataset’s five predefined relations, we undertook a manual mapping process.
This process was guided by semantic similarity and contextual relevance, ensuring that each
extracted relation was correctly associated with one of the canonical relations such as ‘killed
by’, ‘residence’, ‘location’, ‘headquarters location’, and ‘employer’, as originally formatted in the
REBEL paper. The necessity for this manual mapping arose from the fact that the REBEL model,
trained on multiple datasets, identified relations beyond the scope of the CONLL04 dataset,
requiring careful consideration to maintain semantic integrity. The manual mappings for these
relations, based on their semantic similarity to the corresponding five relations were carried out.
Similarly, KnowGL had 54 unique relations being extracted. We followed the same mapping
procedure for it. Figure 2 shows the distribution based on the number of occurrences for 5 types
of relations extracted for different experiment design settings.
   For details, readers can refer to Appendix A. The code for experiments is available in GitHub 1 .


1
    https://github.com/Ananyaiitbhilai/Text2Triple-LLM-Agent
Figure 2: Count of Types of Relations for different Files


RLHF LLM In our experimental setup, we use Large Language Models without any fine-tuning
or instruction tuning. However, in future we plan for RLHF LLMs being used to orchestrate
the tool execution for triple extraction. Specifically, we used two open-source Large Language
Models for benchmarking: “Gemma” and “Mistral”
   Gemma [41] is a family of lightweight LLMs built from the same research and technology
Google used to create the Gemini models. Gemma models are available in two sizes, 2 billion
and 7 billion parameters. These models are trained on up to 6T tokens of primarily English web
documents, mathematics, and code, using a transformer architecture with enhancements like
Multi-Query Attention, RoPE Embeddings, GeGLU Activations, and advanced normalization
techniques. We use the 2B one.
   Mistral-7B-Instruct-v0.2 [42] Large Language Model (LLM) is an improved instruct fine-
tuned version of Mistral-7B-Instruct-v0.1. The Mistral-7B-v0.1 Large Language Model is a
pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama
2 13B.
   For additional experimental set-up refer Appendix 3.
   The Table 3 illustrates the number of relations and instances for different experiment design
settings.
   In Table 4, REBEL refers to all triples extracted by REBEL, REBEL (subset Mistral) refers to
removing all the triples which could not be extracted because of the hallucination (not returning
triples in expected JSON format rather string format or other format) in the Mistral LLM. So
those triples are removed from ground truth and then evaluation is carried out. Similar follows
for REBEL (subset Gemma). Here the tools like KnowGL, REBEL are used as a single-tool in
conjuction with LLM (Mistral and Gemma), REBEL and KnowGL as a multi-tool in conjuction
with LLM (Mistral and Gemma).
 Description                                                            Number of Instances   Number of Relations
 Full ground truth dataset with 5 relation types                               288                    5
 Subset of ground truth for Gemma                                              187                    5
 Subset of ground truth for Mistral                                            262                    5
 Predictions extracted from Rebel                                              288                   68
 Predictions from Mistral                                                      268                    -
 Predictions from Gemma                                                        187                    -
 For comparison with Mistral                                                   262                   67
 For comparison with Gemma                                                     187                   56
 Predictions from Mistral without hallucination                                262                   66
 Predictions from Gemma without hallucination                                  187                   56
 For comparison of REBEL with same subset as with Mistral, clustered           262                    5
 For comparison of REBEL with same subset as Gemma, clustered                  187                    5
 Mistral predictions, clustered                                                262                    5
 Gemma predictions, clustered                                                  187                    5
 Rebel predictions, clustered                                                  288                    5

Table 3
A summary of statistics for different experiment design settings.

     Model                                                     Micro                   Macro         Extras
                                                        P       R        F1     P       R    F1
     REBEL                                              0.16    0.16     0.16   0.18    0.18 0.18    356
     REBEL (subset Mistral)                             0.15    0.15     0.15   0.17    0.17 0.17    330
     REBEL (subset Gemma)                               0.18    0.19     0.18   0.21    0.21 0.21    235
     MISTRAL (single-tool REBEL)                        0.15    0.15     0.15   0.17    0.17 0.17    323
     GEMMA (single-tool REBEL)                          0.15    0.16     0.16   0.18    0.18 0.18    241
     KNOWGL                                             0.05    0.05     0.05   0.07    0.06 0.07    372
     Mistral (KNOWGL + REBEL multi-tool)                0.05    0.07     0.06   0.08    0.09 0.08    533
Table 4
Precision, Recall and F1 Scores (rounded upto 2 decimal places)


   From Table 5 it can be observed that Gemma, a significant number of hallucinations were
observed, accounting for 99 out of 288 total responses. This higher incidence of hallucination in
Gemma was primarily attributed to incorrect JSON format returned by the model. Conversely,
Mistral exhibited a lower occurrence of hallucination, with only 26 out of 288 total responses
displaying such phenomena. The single-tool here refers to REBEL, and multi-tool has both
REBEL and KnowGL. “Hallucination” refers to a phenomenon where the model generates text
that is incorrect, nonsensical, or not real.

                Model                      Total Responses             Number of Hallucinations
                Gemma                             288                            99
                Mistral (single-tool)             288                            26
                Mistral (multi-tool)              288                            42
Table 5
Comparison of Hallucination (particularly not giving response in expected JSON format) Occurrences
6. Results
We investigated the performance of multiple tools versus single tools for relation extraction
and observed a notable decline in scores with multi-tool usage, suggesting that single-tool
approaches may yield better results as shown in Table 4. We attributed this drop to increased
hallucination, particularly more prevalent when employing multiple tools due to hallucination.
However, single-tool usage also presented challenges, as occasionally, the returned format did
not align with the one specified in the system prompt. Moreover, From Table 4 , it can be
also be observed that REBEL with Large Language Model and only REBEL has almost same
performance. It is due to the fact that REBEL and KnowGL is being used as a tool to trigger the
action of extracting the relations. Both KnowGL and REBEL have the same architecture thus
similar biases.
   Our findings underscore the need for integrating Large Language Models with extraction
tools to harness their full potential. While tools exhibits shortcomings in certain contexts,
Large Language Models offer a complementary approach, particularly in mitigating biases and
enhancing extraction accuracy. By empowering Large Language Models to engage in more
nuanced planning or decomposing the query, we anticipate significant improvements in relation
extraction performance.
   Gemma’s higher scores are because of a larger number of responses being generated as
strings rather than in the expected JSON format as shown in Table 5. Consequently, these non-
formatted responses are removed, resulting in fewer triples available for evaluation as compared
to Mistral. This phenomenon contributes to Gemma’s higher scores in triple extraction tasks
compared to Mistral.
   In order to answer our aforementioned Research Questions, we took certain examples that
had multiple events, complex clauses, negation, modalities and then evaluation by human was
carried out manually to check for the correctness of triples extracted from the pre-existing tools
and our Auto-KG agent.
   Event identification and mitigating biases in Triple Extraction Our investigation uncov-
ered instances of flawed relation extraction within the REBEL and KnowGL tools. In a sample
sentence, “While Marie Curie and Albert Einstein conducted groundbreaking experiments
in their laboratories at the University of Paris, Leonardo da Vinci’s sketches of Renaissance
architecture in the bustling streets of Florence sparked inspiration across Italy,” REBEL solely
identified the triple “Florence, located in, Italy”. However, this sentence encompasses two
distinct events: “The experimentation conducted by Marie Curie and Albert Einstein at the
University of Paris”, and “the inspiration sparked by Leonardo da Vinci’s sketches across Italy”.
REBEL’s oversight stemmed from its inclination towards location-centric relations, influenced
by biases within the training data.
   In contrast, our Auto-KG agent showcased promise in mitigating such limitations. They
accurately extracted triples from the sentence, capturing nuanced relations such as “Marie Curie,
experimented at, University of Paris”, “Albert Einstein, experimented at, University of Paris”,
“Leonardo da Vinci’s sketches, located in, Florence”, and “Leonardo da Vinci’s sketches, sparked
inspiration in, Italy”. This underscores the Auto-KG agent’s proficiency in comprehending
complex linguistic structures and discerning meaningful relations, showcasing their potential
for enhancing relation extraction accuracy.
   Negation Handling Discrepancies in Triple Extraction In our comparative analysis,
we also observed a notable discrepancy in the handling of negation between REBEL and
KnowGL tools, as opposed to our Auto-KG agent in triple extraction tasks. Both REBEL and
KnowGL demonstrated limitations in effectively managing negation cues within text, resulting
in erroneous extraction of triples. Conversely, our tool exhibited robust performance in negation
handling, yielding more accurate triple extractions even in the presence of negation cues.
   We considered the sentence “Fado does not work at IIT”. REBEL and KnowGL erroneously
extract a triple indicating that Fado works at IIT, failing to account for the negation. In contrast,
our tool was adept at discerning the negation cue “not” and appropriately adjusting the extracted
triple to reflect the absence of the stated relationship, thereby accurately capturing the intended
semantics of the sentence.
   This discrepancy underscores the nuanced understanding of language exhibited by Large
Language Models, enabling them to effectively navigate linguistic complexities such as negation
cues in triple extraction tasks. It highlights the potential of leveraging LLM-based approaches to
enhance the accuracy and reliability of triple extraction processes in natural language processing
applications.
   Generalising well on various datasets A significant disparity in the performance of seq2seq
based approaches such as REBEL and KnowGL when trained or fine-tuned on specific datasets.
While these models exhibit impressive performance within the confines of their training data,
they demonstrate limited generalization capabilities beyond the dataset they were trained on.
Conversely, Large Language Models showcase remarkable generalization prowess even without
explicit training on a particular dataset. This discrepancy underscores the inherent adaptability
and robustness of LLMs, enabling them to effectively handle diverse datasets and tasks without
the need for extensive training or fine-tuning.


7. Future Directions
For future, our focus lies on emulating the Large Language Model training methodology using
Reinforcement Learning with Human Feedback (RLHF) as detailed in Section 4
  Additionally, we intend to incorporate a query-decomposition LLM to partition complex user
queries into sub-queries, facilitating more precise event identification and subsequent triple
extraction.
  Furthermore, our proposed future work entails synergizing LLMs with multiple extraction
tools to enhance their generalization capabilities across diverse datasets without requiring
explicit training. This approach holds potential to surpass the performance of seq2seq models
such as REBEL and KnowGL.
  Moreover, we aim to integrate a re-ranking module into our framework. This module will
prioritize all extracted triples based on their confidence levels, ensuring a more refined and
accurate output.
  We also aim to develop a diverse dataset that encompasses a wide range of relations and
includes a variety of sentence structures. This dataset is intended to serve as a robust benchmark
for evaluating performance in triple extraction tasks.
8. Limitations and Conclusion
The paper presents a novel framework that integrates Large Language Models (LLMs) with
existing tools like REBEL and KnowGL for the task of triple extraction from unstructured text
to construct knowledge graphs. The proposed framework aims to leverage the strengths of
LLMs in understanding complex linguistic structures, handling modalities and negations, and
mitigating biases inherent in training data. The experimental results on the CONLL04 dataset
indicate that while multi-tool approaches face challenges such as hallucination, the integration
of LLMs shows promising results in enhancing extraction accuracy.
   There are certain limitation of our research work:
   1. Limited LLM Models Evaluated: The experiments were confined to using the Gemma
(2B parameters) and Mistral-7B LLMs. The performance of other large language models like
LaMDA or models with higher parameter counts (e.g., GPT-4) remains unexplored. Future work
could extend these experiments to a broader range of LLM architectures and sizes to provide a
more comprehensive evaluation.
   2. Limited Task Coverage: The current study focused on a specific task: triple extraction
for knowledge graph construction. However, knowledge graph construction and reasoning
encompass a wide range of tasks, and the performance of LLMs on other tasks, such as entity
linking, relation classification, or multi-hop reasoning, remains unexplored. Future research
could extend the evaluation to a broader set of tasks to provide a more comprehensive under-
standing of Large Language Model capabilities in the context of knowledge graph construction
and reasoning.
   3. Limited Evaluation Dataset: The paper evaluates the proposed framework on the
CONLL04 dataset, which comprises sentences extracted from news articles with a limited set of
entity types and relation types. This dataset may not fully represent the diversity and complexity
of real-world text, potentially limiting the generalizability of the findings to other domains and
contexts. In Future, evaluation can be carried out on other datasets
   4. Reliance on Manual Mapping: The paper mentions that manual mapping was required
to align the relations extracted by REBEL and KnowGL with the canonical relations in the
CONLL04 dataset. This manual intervention introduces potential biases and inconsistencies, as
the mapping process may not be entirely objective or scalable across larger datasets or domains.
   The authors acknowledge these limitations and express their anticipation for future research
opportunities that would allow them to further explore these areas and provide a more com-
prehensive evaluation of LLM capabilities in the context of knowledge graph construction and
reasoning. The research sets a new direction for future work in building advanced, reliable
systems for knowledge extraction and reasoning. It highlights the potential of LLM-based agents
for knowledge graph construction and proposes a comprehensive framework for harnessing
their capabilities.


References
 [1] S. Jia, Y. Xiang, X. Chen, K. Wang, Triple trustworthiness measurement for knowledge
     graph, in: The World Wide Web Conference, WWW ’19, ACM, 2019. URL: http://dx.doi.
     org/10.1145/3308558.3313586. doi:10.1145/3308558.3313586 .
 [2] P.-L. Huguet Cabot, R. Navigli, REBEL: Relation extraction by end-to-end language gener-
     ation, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Findings of the Association
     for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics,
     Punta Cana, Dominican Republic, 2021, pp. 2370–2381. URL: https://aclanthology.org/2021.
     findings-emnlp.204. doi:10.18653/v1/2021.findings- emnlp.204 .
 [3] G. Rossiello, M. F. M. Chowdhury, N. Mihindukulasooriya, O. Cornec, A. M. Gliozzo,
     Knowgl: Knowledge generation and linking from text, 2022. arXiv:2210.13952 .
 [4] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin,
     W. Zhao, Z. Wei, J.-R. Wen, A survey on large language model based autonomous agents,
     2023.
 [5] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, A. Mian,
     A comprehensive overview of large language models, 2024. arXiv:2307.06435 .
 [6] J. Jiang, K. Zhou, W. X. Zhao, Y. Song, C. Zhu, H. Zhu, J.-R. Wen, Kg-agent: An efficient
     autonomous agent framework for complex reasoning over knowledge graph, arXiv preprint
     arXiv:2402.11163 (2024).
 [7] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, J.-R. Wen, Structgpt: A general framework
     for large language model to reason over structured data, arXiv preprint arXiv:2305.09645
     (2023).
 [8] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, N. Zhang, Llms for
     knowledge graph construction and reasoning: Recent capabilities and future opportunities,
     arXiv preprint arXiv:2305.13168 (2023).
 [9] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, J. Zhang, On the tool manipulation capability of
     open-source large language models, 2023. arXiv:2305.16504 .
[10] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao,
     L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, M. Sun, Toolllm: Facilitating
     large language models to master 16000+ real-world apis, 2023. arXiv:2307.16789 .
[11] B. W. Lee, H. Cho, K. M. Yoo, Instruction tuning with human curriculum, 2024.
     arXiv:2310.09518 .
[12] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing
     reasoning and acting in language models, arXiv preprint arXiv:2210.03629 (2022).
[13] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, C. D. Manning, Hotpotqa:
     A dataset for diverse, explainable multi-hop question answering, 2018. arXiv:1809.09600 .
[14] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for
     fact extraction and verification, 2018. arXiv:1803.05355 .
[15] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of-
     thought prompting elicits reasoning in large language models, 2023. arXiv:2201.11903 .
[16] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,
     J. Hilton, R. Nakano, C. Hesse, J. Schulman, Training verifiers to solve math word problems,
     2021. arXiv:2110.14168 .
[17] A. Patel, S. Bhattamishra, N. Goyal, Are nlp models really able to solve simple math word
     problems?, 2021. arXiv:2103.07191 .
[18] S.-y. Miao, C.-C. Liang, K.-Y. Su, A diverse corpus for evaluating and developing En-
     glish math word problem solvers, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault
     (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 975–984. URL:
     https://aclanthology.org/2020.acl-main.92. doi:10.18653/v1/2020.acl- main.92 .
[19] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, H. Hajishirzi, MAWPS: A math word
     problem repository, in: K. Knight, A. Nenkova, O. Rambow (Eds.), Proceedings of the
     2016 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, Association for Computational Linguistics,
     San Diego, California, 2016, pp. 1152–1157. URL: https://aclanthology.org/N16-1136. doi:10.
     18653/v1/N16- 1136 .
[20] H. Yang, S. Yue, Y. He, Auto-gpt for online decision making: Benchmarks and additional
     opinions, 2023. arXiv:2306.02224 .
[21] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, D. Fox,
     Alfred: A benchmark for interpreting grounded instructions for everyday tasks, 2020.
     arXiv:1912.01734 .
[22] S. Ross, G. J. Gordon, J. A. Bagnell, A reduction of imitation learning and structured
     prediction to no-regret online learning, 2011. arXiv:1011.0686 .
[23] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju,
     W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess,
     J. Schulman, Webgpt: Browser-assisted question-answering with human feedback, 2022.
     arXiv:2112.09332 .
[24] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, M. Auli, Eli5: Long form question
     answering, 2019. arXiv:1907.09190 .
[25] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, L. Wang, Mm-
     react: Prompting chatgpt for multimodal reasoning and action, 2023. arXiv:2303.11381 .
[26] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason,
     A. Garg, Progprompt: Generating situated robot task plans using large language models,
     2022. arXiv:2209.11302 .
[27] Y. Huang, L. Shi, A. Liu, H. Xu, Evaluating and enhancing large language models for
     conversational reasoning on knowledge graphs, 2024. arXiv:2312.11282 .
[28] S. Moon, P. Shah, A. Kumar, R. Subba, OpenDialKG: Explainable conversational reasoning
     with attention-based walks over knowledge graphs, in: A. Korhonen, D. Traum, L. Màrquez
     (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 845–854.
     URL: https://aclanthology.org/P19-1081. doi:10.18653/v1/P19- 1081 .
[29] J. Jiang, K. Zhou, W. X. Zhao, Y. Song, C. Zhu, H. Zhu, J.-R. Wen, Kg-agent: An effi-
     cient autonomous agent framework for complex reasoning over knowledge graph, 2024.
     arXiv:2402.11163 .
[30] W. Yih, M. Richardson, C. Meek, M. Chang, J. Suh, The value of semantic parse labeling for
     knowledge base question answering, in: Proceedings of the 54th Annual Meeting of the
     Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany,
     Volume 2: Short Papers, The Association for Computer Linguistics, 2016.
[31] A. Talmor, J. Berant, The web as a knowledge-base for answering complex questions,
     in: M. A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1
     (Long Papers), Association for Computational Linguistics, 2018, pp. 641–651.
[32] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang, X. Yan, Y. Su, Beyond I.I.D.: three levels of
     generalization for question answering on knowledge bases, in: J. Leskovec, M. Grobelnik,
     M. Najork, J. Tang, L. Zia (Eds.), WWW ’21: The Web Conference 2021, Virtual Event /
     Ljubljana, Slovenia, April 19-23, 2021, ACM / IW3C2, 2021, pp. 3477–3488.
[33] Y. Zhu, S. Qiao, Y. Ou, S. Deng, N. Zhang, S. Lyu, Y. Shen, L. Liang, J. Gu, H. Chen,
     Knowagent: Knowledge-augmented planning for llm-based agents, arXiv preprint
     arXiv:2403.03101 (2024).
[34] E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine,
     K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev-Shwartz,
     A. Shashua, M. Tenenholtz, Mrkl systems: A modular, neuro-symbolic architecture that
     combines large language models, external knowledge sources and discrete reasoning, 2022.
     arXiv:2205.00445 .
[35] L. Weng, Llm-powered autonomous agents, lilianweng.github.io (2023). URL: https:
     //lilianweng.github.io/posts/2023-06-23-agent/.
[36] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation,
     translation, and comprehension, 2019. arXiv:1910.13461 .
[37] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
     P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu,
     J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini,
     R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura,
     M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
     P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten,
     R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X.
     Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez,
     R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned chat models,
     2023. arXiv:2307.09288 .
[38] M. Payne, Fine-tuning open llms with reinforcement learning from human feedback,
     https://www.width.ai/post/reinforcement-learning-from-human-feedback (2023). URL:
     https://www.width.ai/post/reinforcement-learning-from-human-feedback.
[39] D. Roth, W.-t. Yih, A linear programming formulation for global inference in natural
     language tasks, in: Proceedings of the Eighth Conference on Computational Natural
     Language Learning (CoNLL-2004) at HLT-NAACL 2004, Association for Computational
     Linguistics, Boston, Massachusetts, USA, 2004, pp. 1–8. URL: https://aclanthology.org/
     W04-2401.
[40] P. Gupta, H. Schütze, B. Andrassy, Table filling multi-task recurrent neural network for
     joint entity and relation extraction, in: Y. Matsumoto, R. Prasad (Eds.), Proceedings of
     COLING 2016, the 26th International Conference on Computational Linguistics: Technical
     Papers, The COLING 2016 Organizing Committee, Osaka, Japan, 2016, pp. 2537–2547. URL:
     https://aclanthology.org/C16-1239.
[41] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière,
     M. S. Kale, J. Love, P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts, A. Barua, A. Botev,
     A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahri-
     ari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya,
     E. Ni, E. Noland, G. Yan, G. Tucker, G.-C. Muraru, G. Rozhdestvenskiy, H. Michalewski,
     I. Tenney, I. Grishchenko, J. Austin, J. Keeling, J. Labanowski, J.-B. Lespiau, J. Stanway,
     J. Brennan, J. Chen, J. Ferret, J. Chiu, J. Mao-Jones, K. Lee, K. Yu, K. Millican, L. L. Sjoe-
     sund, L. Lee, L. Dixon, M. Reid, M. Mikuła, M. Wirth, M. Sharman, N. Chinaev, N. Thain,
     O. Bachem, O. Chang, O. Wahltinez, P. Bailey, P. Michel, P. Yotov, P. G. Sessa, R. Chaabouni,
     R. Comanescu, R. Jana, R. Anil, R. McIlroy, R. Liu, R. Mullins, S. L. Smith, S. Borgeaud,
     S. Girgin, S. Douglas, S. Pandya, S. Shakeri, S. De, T. Klimenko, T. Hennigan, V. Feinberg,
     W. Stokowiec, Y. hui Chen, Z. Ahmed, Z. Gong, T. Warkentin, L. Peran, M. Giang, C. Fara-
     bet, O. Vinyals, J. Dean, K. Kavukcuoglu, D. Hassabis, Z. Ghahramani, D. Eck, J. Barral,
     F. Pereira, E. Collins, A. Joulin, N. Fiedel, E. Senter, A. Andreev, K. Kenealy, Gemma: Open
     models based on gemini research and technology, 2024. arXiv:2403.08295 .
[42] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bres-
     sand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao,
     T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825 .


A. Appendix: Additional Details
Additional details about System and parameters for the preliminary Experimental
set-up

    • The context size(n_ctx) is the maximum number of tokens that the model can account
      for when processing a response. this includes the prompt, and the response itself. In our
      case the context size was set to 2048.

    • The maximum number of tokens to generate is 2000 in our case. If 𝑚𝑎𝑥_𝑡𝑜𝑘𝑒𝑛𝑠 ≤ 0 or
      None, the maximum number of tokens to generate is unlimited and depends on n_ctx.

    • Average inference time per context/sentence for CONLL04 test dataset for extracting
      triples in conjunction with LLMs was 25 seconds.

    • Temperature was set to 0

    • The gguf files for Mistral and Gemma were run locally on Mac M1.

  The system prompt is shown in Figure 3
  Relation Mappings

 Key                         Values
employer                derivative work, inception, instance of, owned by, owner of, part of,
                        participant, participant in, performer, twinned administrative body,
                        occupation, field of this occupation, member of political party, work
                        location, language used, participant in, participant, owner of, owned
                        by, member of, notable work, instance of, interested in, office held
                        by head of government, chief executive officer, educated at, subclass
                        of, part of, office held by head of state, chairperson, executive body,
                        industry, officeholder, position held, practiced by, language of work or
                        name, director / manager, employer, field of work, language of work or
                        name, notable work, occupation, member of, member of political party,
                        officeholder, operator, position held, educated at, founded by, product
                        or material produced, subsidiary, work location, author, office held by
                        head of government, used by, uses, candidacy in election, candidate,
                        chairperson, head of government
headquarters location   headquarters location, twinned administrative body, applies to jurisdic-
                        tion, legislative body, military branch, contains administrative territorial
                        entity, parent organization, operating area, legislative body, contains
                        administrative territorial entity, headquarters location, located in the
                        administrative territorial entity, ethnic group, language used, military
                        branch, parent organization, applies to jurisdiction
killed by               cause of death, perpetrator, convicted of, killed by, place of death, facet
                        of, date of death, main subject, place of death, facet of, significant event
location                location, capital, continent, located in time zone, shares border with,
                        mountain range, located in or next to body of water, candidate, signifi-
                        cant place, spouse, place of publication, target, country, located in or
                        next to body of water, location, mouth of the watercourse, point in time,
                        capital, capital of, shares border with, tributary, diplomatic relation,
                        place of publication, spouse
residence               place of birth, based on, country of citizenship, date of birth, has part,
                        number of participants, history of topic, place of birth, country of origin,
                        has quality, significant event, occupant, relative, residence
Figure 3: System Prompt

</pre>