Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text

Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text AnanyaAnanya ananyah@iitbhilai.ac.in Indian Institute of Technology

Bhilai India

SanjuTiwari tiwarisanju18@ieee.org BVICAM & UAT

New Delhi India, Mexico

NandanaMihindukulasooriya nandana@ibm.com IBM Research

New York United States

TommasoSoru Serendipity AI Ltd

London United Kingdom

ZiweiXu National Institute of Advanced Industrial Science and Technology

Tokyo Japan

DiegoMoussallem diegomoussallem@gmail.com

Jusbrasil, Salvador Brazil

Towards Harnessing Large Language Models as Autonomous Agents for Semantic Triple Extraction from Unstructured Text 1613-0073 0D4081F4D9E2EEA7F1B9C29C2C1ABAAB GROBID - A machine learning software for extracting information from scholarly documents Triple extraction Knowledge Graph Knowledge Graph Construction LLM Agents Reasoning Handling modalities and negations Mitigating biases

The use of Large Language Models as autonomous agents interacting with tools has shown to improve the performance of several tasks from code generation to API calling and sequencing. This paper proposes a framework for using Large Language Models as autonomous agents for the task of Knowledge Graph construction from unstructured text. Specifically, it focuses on triple extraction, which involves identifying entities and their relationships from text to construct a Knowledge Graph. Our novel framework "Auto-KG agent" incorporates two relation extraction tools, REBEL and KnowGL, in conjunction with Large Language Models. Experimental results on the CONLL04 dataset demonstrate that while multi-tool approaches face challenges like hallucination, LLM-based agents are promising in mitigating biases, major event identification, handling negations and modalities thus enhancing extraction accuracy, particularly for complex linguistic structures. The impetus for this research is to overcome the current limitations of existing systems for Knowledge Graph construction and propose a roadmap for developing a robust framework capable of handling the intricacies of natural language with minimal human interference. The paper also discusses future directions, such as emulating Large Language Model training using reinforcement learning with human feedback, incorporating query decomposition, and integrating a re-ranking module. Through this research, the authors aim to set a new direction for future endeavours in building advanced, reliable systems for knowledge extraction. Overall, this work highlights the potential of LLM-based agents for knowledge graph construction and proposes a framework for harnessing their capabilities.

Introduction

The advent of Knowledge Graphs has revolutionized the way we represent and utilize information in the digital age. By structuring data as triples-consisting of a head entity, a relationship, and a tail entity (h, r, t) -Knowledge Graphs provide a semantic framework to describe the varied and countless entities and their interrelations in the objective world. This structured approach to data organization underpins intelligent applications and has garnered significant attention in both academic and industrial spheres due to its potent semantic processing capabilities and open organizational structure [1].

In the field of Natural Language Processing, extracting relational facts from text is crucial. Understanding the semantic relationships between entities in unstructured text helps convert raw data into structured formats. This structured data is extremely valuable for several tasks, such as building and enhancing Knowledge Bases. These bases are essential for powering applications that rely on knowledge [2].

In the realm of information extraction, frameworks like REBEL [2] and KnowGL [3] have emerged as powerful tools for converting unstructured text into structured relational data. These frameworks leverage the advancements in machine learning and natural language processing to perform tasks that traditionally required separate models for Named Entity Recognition (NER) and Relation Classification (RC). REBEL, which stands for Relation Extraction By End-to-end Language Generation, utilizes an autoregressive sequence-to-sequence (seq2seq) model, specifically a BART-large model, to extract relationships between entities in a text. The architecture of REBEL is designed to represent relations as a linearized sequence that includes entity mentions, labels, types, and the relation label. Similarly, KnowGL is a comprehensive framework that aims to transform natural language text into structured data that aligns with the schema of a Knowledge Graph like Wikidata. KnowGL consists of three main components: "Knowledge Generation", "Fact Ranking", and "Linking to Wikidata". We focus on the Knowledge Generation component of the KnowGL framework. The Knowledge Generation component uses fine-tuned, pre-trained language models to identify entity mentions and generate facts, including entity labels, types, and relationships.

Despite their innovative approaches, REBEL and KnowGL have certain limitations. The performance of these models is heavily influenced by the quality of their pre-training data. Biases or inaccuracies present in the training datasets can propagate through the models, affecting the accuracy of the extracted relations. Furthermore, the ability of these frameworks to generalize across various domains and text types hinges on the extent to which the pre-trained language models are fine-tuned or further pre-trained on domain-specific datasets. While Large Language Models inherently possess a broad understanding of language, their performance in specialized contexts improves significantly with targeted fine-tuning. Additionally, these systems may struggle with complex sentence structures and fail to identify all relevant major events, particularly in sentences laden with modalities or multiple clauses.

To address these challenges, we introduce a novel framework that synergizes the capabilities of REBEL and KnowGL with the nuanced understanding of Large Language Models. Large Language Models demonstrate remarkable efficacy in processing sentences with modalities and complex structures, leading to accurate event identification and triple extraction. This integration not only enhances event detection but also aids in mitigating biases inherent in training data, ensuring a more comprehensive extraction of triples. Apart from introducing a novel framework, we also aim to answer the following research questions: The impetus for our research is to overcome the current limitations of existing systems and chart a course for the development of a robust framework capable of handling the intricacies of natural language. Our experiments are designed with this objective in mind and are conducted with the resources available to us. Through this research, we aim to set a new direction for future endeavors in building advanced, reliable systems for knowledge extraction and reasoning.

Related Work

Using Large Language Models as autonomous agents has become increasingly popular in recent research. Large Language Models possess advanced reasoning abilities and skills in utilizing tools, making them well-suited for autonomous operations. They excel in tasks like acquiring knowledge, understanding instructions, generalizing information, planning, and reasoning, showcasing their potential for autonomous tasks [4]. However, Large Language Models do have limitations, such as performing arithmetic operations and staying updated with the latest information, which cannot be fully addressed through simple fine-tuning alone. This highlights the need for designing autonomous agent frameworks that can complement LLMs by integrating external data and supplementary tools [5].

This section has covered the existing studies on LLM-based Agents for Knowledge Graph Generation from Text. Jiang et. al [6] have introduced KG-Agent, an autonomous framework based on Large Language Models, designed to enable a small Large Language Model to independently make decisions throughout the reasoning process over Knowledge Graphs until completion. Within KG-Agent, a Large Language Model is combined with a versatile toolbox, a knowledge memory system and a KG-based executor. Jiang et. al [7] also introduced StructGPT tool, an Iterative Reading-then-Reasoning (IRR) framework aimed at addressing question-answering tasks using structured data. In this framework, a specialized interfaces has been designed to acquire relevant information from structured data and allowing Large Language Models to focus on the reasoning tasks based on the acquired information. This research also introduced a procedure termed invoking linearization generation to promote Large Language Models in reasoning over structured data with the help of provided interfaces. Zhu et. al [8] explore Large Language Models for Knowledge Graph construction with reasoning and introduced an innovative approach called AutoKG, which utilizes multiple agents to efficiently handle both Knowledge Graph construction and reasoning tasks. There are several LLM-based Agents which are shown in Table 1.

Methodology

Description Tasks Dataset Tool Learning [9] In-context demonstration and Generation Regulation Tool manipulation, multi-tool usage ToolBench [9] Instruction tuning [10] Learning on high-quality instruction datasets Tool manipulation, multi-tool usage ToolBench [9] Instruction Tuning with Human Curriculum [11] Instruction data mimicking human learning progression Reasoning and knowledgebased tasks CORGI [11] ReAct [12] Prompting LLMs for Decision Making Reasoning algorithm HotPotQA [13], FEVER [14] DFSDT [10] Prompting LLMs for Decision Making Reasoning algorithm ToolBench [9] CoT [15] Prompting LLMs for Decision Making Reasoning algorithm GSM8K [16], SVAMP [17], ASDiv [18], AQuA, MAWPS [19] AutoGPT [20] Online Decision Making Any decision-making task ALFRED [21], DAgger [22] WebGPT [23] Text-based web browsing environment Long-form QA ELI5 [24] MMREACT [25] A system integrated with ChatGPT with a pool of vision experts

Multimodal reasoning and action

Self ProgPrompt [26] Prompt with program-like specifications of the available actions and objects in an environment

Generate situated Robot Task Plans

Self LLM-ARK [27] KG reasoning agent Conversational Reasoning on Knowledge Graph, predictions on KG paths as a decision-making task OpenDialKG [28] KG-Agent [29] Enables a small LLM to actively make decisions until finishing the reasoning process over KGs Improve the reasoning ability and complex QA WebQSP [30], CWQ [31], GrailQA [32] KnowAgent [33] Enhances the planning capabilities of LLMs by incorporating explicit action knowledge from KGs A knowledgeable selflearning strategy for path planning HotPotQA [13] Table 1 A comparison of work which augments LLMs with tool usage

Background

In the construction of LLM agents, an LLM acts as the primary controller or "brain, " orchestrating the sequence of operations required to accomplish a task or respond to a user query. These LLM agents may require additional modules like planning, memory, and tool utilization to enhance their functionality [34]. To activate the LLM component, a prompt template containing essential operational details and tool access specifications is utilized. While not mandatory, agents can be characterized or given a persona to define their role. This profiling information is typically embedded within the prompt and may include details such as role description, personality traits, social characteristics, and other demographic attributes [35].

Our research focuses on extracting triples from unstructured text to facilitate Knowledge Graph construction. Triple extraction entails identifying and extracting structured information in the form of triples, which comprise a subject entity, a relation or predicate, and an object entity. This process is crucial in converting text into a structured format suitable for tasks like knowledge representation and information retrieval in natural language processing.

Relation extraction is another vital task we explore, involving the identification and extraction of semantic relationships between entities mentioned in unstructured text. These relationships, such as "is married to" or "works for", capture meaningful associations and are represented as triples for downstream applications like Knowledge Graph construction.

REBEL [2] REBEL, which stands for Relation Extraction By End-to-end Language generation, is an extraction technique to pull out relationship details from raw text.

REBEL uses a special kind of model called autoregressive sequence-to-sequence (seq2seq) models. These models are good at making text and also understanding natural language. The main part of REBEL is a seq2seq model based on BART [36]. It represents relations between entities in the input text as a linearized sequence following a specific schema involving entity mentions, labels, types, and the relation label. REBEL uses BART-large as the base model which is first pre-trained on a large distantly supervised dataset called REBEL which was created by extracting over 800K training instances with 220 relation types from Wikipedia abstracts aligned with Wikidata facts. The pre-trained BART is then fine-tuned on this REBEL dataset to maximize the likelihood of generating the correct linearized triplet representation given the input text. REBEL demonstrates several advantages -it frames relation extraction in an end-to-end manner, can extract open-ended relation types, allows quickly fine-tuning on new datasets across domains, and achieves state-of-the-art performance on multiple relation extraction benchmarks while being simpler than prior complex pipeline approaches.

KnowGL [3] KnowGL is a comprehensive framework designed to convert natural language text into structured relational data that aligns with the schema of a Knowledge Graph like Wikidata. This framework comprises three key components: "Knowledge Generation", "Fact Ranking", and "Linking to Wikidata". The Knowledge Generation aspect focuses on extracting facts by fine-tuning pre-trained language models to identify entity mentions and generate sets of facts including entity labels, types, and relationships. Fact Ranking involves parsing generated sequences to create a ranked list of distinct facts based on scores assigned to each fact. Lastly, Linking to Wikidata facilitates retrieving Wikidata IDs associated with the generated semantic annotations. By enabling the conversion of text into Wikidata statements in JSON format, KnowGL demonstrates the potential of pre-trained language models for generating structured data from text, offering an alternative to traditional information extraction pipelines.

System Architecture

The goal of our framework is to facilitate automatic triple extraction from text inputs. This framework is designed as a multi-tool system utilizing Large Language Models to execute the task of triple extraction.

Figure 1 outlines a framework for training a Large Language Model (LLM) using a method referred to as RLHF, which stands for Reinforcement Learning from Human Feedback. The flowchart is divided into two main sections:

1. The Large Language Model training procedure using RLHF [37] 2. RLHF LLM Based Autonomous Agents for triple Extraction for Knowledge Graph construction

LLM Training Procedure Using RLHF

The training procedure begins with raw text to pretrain the Large Language Model. This pre-training step is where the model learns from a large corpus of text to understand language patterns and structures.

After pretraining, the model becomes a "Pretrained LLM" which is then subjected to "Supervised Fine-tuning" using Demonstration data. Demonstration data consists of prompts and response pairs. This step involves training the model on specific examples to perform certain tasks and understand particular domains better. This form an instruction-following Chatbot of low quality.

The fine-tuned model, referred to as "SFT LLM" is then used in conjunction with "Human Preference Data" to train a "Reward Model". The Human Preference data consist of prompts, winner tuples and loser tuples. This reward model evaluates the outputs of the LLM and provides feedback on its performance.

The feedback from the reward model is used in Reinforcement Learning, where the Large Language Model is further trained using RLHF (Reinforcement Learning with Human Feedback) training prompts to improve its outputs based on human preferences and feedback. The RLHF training prompts also consists of prompt, winner tuples and loser tuples. In the end we get a high-quality instruction-following RLHF LLM [38]. The diagram presented in the first half of the Figure 1 RLHF LLM Based Autonomous Agents for Triple Extraction for KG Construction The bottom section of the flowchart shows the application of the trained RLHF LLM for a specific task: Autonomous agents for triple extraction to construct Knowledge Graph.

A User inputs a complex query, a command which is then decomposed by the querydecomposition LLM. The decomposed query is processed by the "RLHF LLM" which interacts with a Tool DB (Tool Database). Here the tools in the Tool DB are "REBEL" and "KnowGL".

The RLHF LLM then performs re-ranking of triples extracted from the complex query which involves selecting the most relevant and accurate triples based on their count of occurrences of triples in the Knowledge Graph. The final output are then re-ranked which would be used in the construction of a Knowledge Graph. The system prompt for RLHF LLM is provided in the Appendix A.

It is a comprehensive framework for training a Large Language Model using human feedback and then applying this model to extract triples for building Knowledge Graph.

We implement the "Auto-KG Agent" framework to facilitate automatic triple extraction from text inputs. This framework is designed as multi-tool system utilizing Large Language Models to extract triples. We utilise REBEL and KnowGL as tools for triple extraction (relation along with entity). More such frameworks can be added as tools in the Tool DB. Large Language Model is asked to return the entities in JSON format. The diagram presented in the second half of the Figure 1 serves as the visual representation for this section.

The current system comprises the second half of the the Figure 1, without the query decomposition Large Language Model and Large Language Model training using RLHF. As of now we only incorporate Tool DB, LLM without RLHF (only system prompt) and re-ranking of triples based on the length of relation extracted. We state in the section 7 for incorporating other modules in the "Auto-KG Agent" framework.

Preliminary Experimental Setup

Dataset We evaluate our system's performance on the CONLL04 dataset [39], which comprises sentences extracted from news articles. Each sentence is annotated with four entity types (person, organization, location, and other) and five relation types (kill, work for, organization based in, live in, and located in). Our evaluation focuses on the test split consisting of 288 instances [40], the ground truth, comparing the performance of our model against the REBEL model. The dataset statistics is described in Table 2.

Dataset

Evaluation Metrics

The evaluation process compares the predicted triples extracted from test data with the ground truth triples. Each instance in both datasets is represented as a dictionary, with a unique identifier and a set of triples. Each sentence has an object corresponding to it which stores the triple.

To calculate the true positives (correctly predicted triples), we iterate through each instance in the ground truth data. For each instance, we check if the corresponding instance exists in the predicted data. If it does, we find the intersection of the triples in the ground truth and predicted data, which gives us the number of correct predictions (true positives).

Additionally, we also calculate the number of extra predictions made by the model that is the count of triples not present in the ground truth. However, we don't calculate scores for them as it would require Human evaluation.

After calculating count of true positives, we compute both micro and macro scores for precision, recall, and F1 score. Micro scores consider the total number of triples in the entire dataset for calculating precision, recall, and F1 score, while macro scores average these metrics across each instance in the dataset.

Overall, this evaluation process enables us to assess the performance of the triple extraction model by quantifying its precision, recall, and F1 score, considering both individual instances and the entire dataset.

We did a strict evaluation where the correctness for triples extracted as a whole is compared with the corresponding head entity, tail entity and relation in the ground truth. Following are the counts of unique relations extracted for different frameworks:

• REBEL -68 relations mapped to 5 relations • KnowGL -58 relations mapped to 5 relations • REBEL + KnowGL -90 relations mapped to 5 relations Triple Extraction Tools REBEL and KnowGL frameworks are used as triple extraction tools in our experiment. In our evaluation of the REBEL model on the CONLL04 test dataset, we encountered a diverse set of 68 unique relations extracted by the model. To align these with the CONLL04 dataset's five predefined relations, we undertook a manual mapping process. This process was guided by semantic similarity and contextual relevance, ensuring that each extracted relation was correctly associated with one of the canonical relations such as 'killed by', 'residence', 'location', 'headquarters location', and 'employer', as originally formatted in the REBEL paper. The necessity for this manual mapping arose from the fact that the REBEL model, trained on multiple datasets, identified relations beyond the scope of the CONLL04 dataset, requiring careful consideration to maintain semantic integrity. The manual mappings for these relations, based on their semantic similarity to the corresponding five relations were carried out. Similarly, KnowGL had 54 unique relations being extracted. We followed the same mapping procedure for it. Figure 2 shows the distribution based on the number of occurrences for 5 types of relations extracted for different experiment design settings.

For details, readers can refer to Appendix A. The code for experiments is available in GitHub1 . Models for benchmarking: "Gemma" and "Mistral" Gemma [41] is a family of lightweight LLMs built from the same research and technology Google used to create the Gemini models. Gemma models are available in two sizes, 2 billion and 7 billion parameters. These models are trained on up to 6T tokens of primarily English web documents, mathematics, and code, using a transformer architecture with enhancements like Multi-Query Attention, RoPE Embeddings, GeGLU Activations, and advanced normalization techniques. We use the 2B one.

Mistral-7B-Instruct-v0.2 [42] Large Language Model (LLM) is an improved instruct finetuned version of Mistral-7B-Instruct-v0.1. The Mistral-7B-v0.1 Large Language Model is a pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama 2 13B.

For additional experimental set-up refer Appendix 3. The Table 3 illustrates the number of relations and instances for different experiment design settings.

In Table 4, REBEL refers to all triples extracted by REBEL, REBEL (subset Mistral) refers to removing all the triples which could not be extracted because of the hallucination (not returning triples in expected JSON format rather string format or other format) in the Mistral LLM. So those triples are removed from ground truth and then evaluation is carried out. Similar follows for REBEL (subset Gemma). Here the tools like KnowGL, REBEL are used as a single-tool in conjuction with LLM (Mistral and Gemma), REBEL and KnowGL as a multi-tool in conjuction with LLM (Mistral and Gemma). From Table 5 it can be observed that Gemma, a significant number of hallucinations were observed, accounting for 99 out of 288 total responses. This higher incidence of hallucination in Gemma was primarily attributed to incorrect JSON format returned by the model. Conversely, Mistral exhibited a lower occurrence of hallucination, with only 26 out of 288 total responses displaying such phenomena. The single-tool here refers to REBEL, and multi-tool has both REBEL and KnowGL. "Hallucination" refers to a phenomenon where the model generates text that is incorrect, nonsensical, or not real.

Results

We investigated the performance of multiple tools versus single tools for relation extraction and observed a notable decline in scores with multi-tool usage, suggesting that single-tool approaches may yield better results as shown in Table 4. We attributed this drop to increased hallucination, particularly more prevalent when employing multiple tools due to hallucination. However, single-tool usage also presented challenges, as occasionally, the returned format did not align with the one specified in the system prompt. Moreover, From Table 4 , it can be also be observed that REBEL with Large Language Model and only REBEL has almost same performance. It is due to the fact that REBEL and KnowGL is being used as a tool to trigger the action of extracting the relations. Both KnowGL and REBEL have the same architecture thus similar biases.

Our findings underscore the need for integrating Large Language Models with extraction tools to harness their full potential. While tools exhibits shortcomings in certain contexts, Large Language Models offer a complementary approach, particularly in mitigating biases and enhancing extraction accuracy. By empowering Large Language Models to engage in more nuanced planning or decomposing the query, we anticipate significant improvements in relation extraction performance.

Gemma's higher scores are because of a larger number of responses being generated as strings rather than in the expected JSON format as shown in Table 5. Consequently, these nonformatted responses are removed, resulting in fewer triples available for evaluation as compared to Mistral. This phenomenon contributes to Gemma's higher scores in triple extraction tasks compared to Mistral.

In order to answer our aforementioned Research Questions, we took certain examples that had multiple events, complex clauses, negation, modalities and then evaluation by human was carried out manually to check for the correctness of triples extracted from the pre-existing tools and our Auto-KG agent.

Event identification and mitigating biases in Triple Extraction Our investigation uncovered instances of flawed relation extraction within the REBEL and KnowGL tools. In a sample sentence, "While Marie Curie and Albert Einstein conducted groundbreaking experiments in their laboratories at the University of Paris, Leonardo da Vinci's sketches of Renaissance architecture in the bustling streets of Florence sparked inspiration across Italy, " REBEL solely identified the triple "Florence, located in, Italy". However, this sentence encompasses two distinct events: "The experimentation conducted by Marie Curie and Albert Einstein at the University of Paris", and "the inspiration sparked by Leonardo da Vinci's sketches across Italy". REBEL's oversight stemmed from its inclination towards location-centric relations, influenced by biases within the training data.

In contrast, our Auto-KG agent showcased promise in mitigating such limitations. They accurately extracted triples from the sentence, capturing nuanced relations such as "Marie Curie, experimented at, University of Paris", "Albert Einstein, experimented at, University of Paris", "Leonardo da Vinci's sketches, located in, Florence", and "Leonardo da Vinci's sketches, sparked inspiration in, Italy". This underscores the Auto-KG agent's proficiency in comprehending complex linguistic structures and discerning meaningful relations, showcasing their potential for enhancing relation extraction accuracy.

Negation Handling Discrepancies in Triple Extraction

In our comparative analysis, we also observed a notable discrepancy in the handling of negation between REBEL and KnowGL tools, as opposed to our Auto-KG agent in triple extraction tasks. Both REBEL and KnowGL demonstrated limitations in effectively managing negation cues within text, resulting in erroneous extraction of triples. Conversely, our tool exhibited robust performance in negation handling, yielding more accurate triple extractions even in the presence of negation cues.

We considered the sentence "Fado does not work at IIT". REBEL and KnowGL erroneously extract a triple indicating that Fado works at IIT, failing to account for the negation. In contrast, our tool was adept at discerning the negation cue "not" and appropriately adjusting the extracted triple to reflect the absence of the stated relationship, thereby accurately capturing the intended semantics of the sentence.

This discrepancy underscores the nuanced understanding of language exhibited by Large Language Models, enabling them to effectively navigate linguistic complexities such as negation cues in triple extraction tasks. It highlights the potential of leveraging LLM-based approaches to enhance the accuracy and reliability of triple extraction processes in natural language processing applications.

Generalising well on various datasets A significant disparity in the performance of seq2seq based approaches such as REBEL and KnowGL when trained or fine-tuned on specific datasets. While these models exhibit impressive performance within the confines of their training data, they demonstrate limited generalization capabilities beyond the dataset they were trained on. Conversely, Large Language Models showcase remarkable generalization prowess even without explicit training on a particular dataset. This discrepancy underscores the inherent adaptability and robustness of LLMs, enabling them to effectively handle diverse datasets and tasks without the need for extensive training or fine-tuning.

Future Directions

For future, our focus lies on emulating the Large Language Model training methodology using Reinforcement Learning with Human Feedback (RLHF) as detailed in Section 4

Additionally, we intend to incorporate a query-decomposition LLM to partition complex user queries into sub-queries, facilitating more precise event identification and subsequent triple extraction.

Furthermore, our proposed future work entails synergizing LLMs with multiple extraction tools to enhance their generalization capabilities across diverse datasets without requiring explicit training. This approach holds potential to surpass the performance of seq2seq models such as REBEL and KnowGL.

Moreover, we aim to integrate a re-ranking module into our framework. This module will prioritize all extracted triples based on their confidence levels, ensuring a more refined and accurate output.

We also aim to develop a diverse dataset that encompasses a wide range of relations and includes a variety of sentence structures. This dataset is intended to serve as a robust benchmark for evaluating performance in triple extraction tasks.

RQ1:How effective are Large Language Models in mitigating biases for extracting triples which are present in the datasets used for training information extraction tools? RQ2: To what extent do Large Language Models accurately handle modalities and negations in natural language, and how does this capability affect the quality of triple extraction? RQ3: Can Large Language Models enhance the identification of events within unstructured text, thereby improving the accuracy and completeness of triple extraction? RQ4: How well do Large Language Models generalize across different datasets without the need for extensive training or fine-tuning, particularly in the context of triple extraction for knowledge graph construction? RQ5: What is the impact of using multiple tools versus a single tool on the performance of triple extraction?

Figure 1 :1Figure 1: System architecture for utilising Large Language Models as autonomous agents enabling tools for Knowledge Graph from unstructured text

Figure 2 :2Figure 2: Count of Types of Relations for different Files

Table 22Dataset statistics for CONLL04 (The number of instances are in brackets and the number of triples outside the bracket)Entity Types Relation TypesTrainValidationTestCONLL04451,290 (922)343 (231)422 (288)

Table 33A summary of statistics for different experiment design settings.ModelMicroMacroExtrasPRF1PRF1REBEL0.16 0.16 0.16 0.18 0.18 0.18 356REBEL (subset Mistral)0.15 0.15 0.15 0.17 0.17 0.17 330REBEL (subset Gemma)0.18 0.19 0.18 0.21 0.21 0.21 235MISTRAL (single-tool REBEL)0.15 0.15 0.15 0.17 0.17 0.17 323GEMMA (single-tool REBEL)0.15 0.16 0.16 0.18 0.18 0.18 241KNOWGL0.05 0.05 0.05 0.07 0.06 0.07 372Mistral (KNOWGL + REBEL multi-tool) 0.05 0.07 0.06 0.08 0.09 0.08 533

Table 44Precision, Recall and F1 Scores (rounded upto 2 decimal places)

Table 55Comparison of Hallucination (particularly not giving response in expected JSON format) Occurrences

ModelTotal Responses Number of HallucinationsGemma28899Mistral (single-tool)28826Mistral (multi-tool)28842

https://github.com/Ananyaiitbhilai/Text2Triple-LLM-Agent

Limitations and Conclusion

The paper presents a novel framework that integrates Large Language Models (LLMs) with existing tools like REBEL and KnowGL for the task of triple extraction from unstructured text to construct knowledge graphs. The proposed framework aims to leverage the strengths of LLMs in understanding complex linguistic structures, handling modalities and negations, and mitigating biases inherent in training data. The experimental results on the CONLL04 dataset indicate that while multi-tool approaches face challenges such as hallucination, the integration of LLMs shows promising results in enhancing extraction accuracy.

There are certain limitation of our research work: 1. Limited LLM Models Evaluated: The experiments were confined to using the Gemma (2B parameters) and Mistral-7B LLMs. The performance of other large language models like LaMDA or models with higher parameter counts (e.g., GPT-4) remains unexplored. Future work could extend these experiments to a broader range of LLM architectures and sizes to provide a more comprehensive evaluation.

2. Limited Task Coverage: The current study focused on a specific task: triple extraction for knowledge graph construction. However, knowledge graph construction and reasoning encompass a wide range of tasks, and the performance of LLMs on other tasks, such as entity linking, relation classification, or multi-hop reasoning, remains unexplored. Future research could extend the evaluation to a broader set of tasks to provide a more comprehensive understanding of Large Language Model capabilities in the context of knowledge graph construction and reasoning.

Limited Evaluation Dataset:

The paper evaluates the proposed framework on the CONLL04 dataset, which comprises sentences extracted from news articles with a limited set of entity types and relation types. This dataset may not fully represent the diversity and complexity of real-world text, potentially limiting the generalizability of the findings to other domains and contexts. In Future, evaluation can be carried out on other datasets 4. Reliance on Manual Mapping: The paper mentions that manual mapping was required to align the relations extracted by REBEL and KnowGL with the canonical relations in the CONLL04 dataset. This manual intervention introduces potential biases and inconsistencies, as the mapping process may not be entirely objective or scalable across larger datasets or domains.

The authors acknowledge these limitations and express their anticipation for future research opportunities that would allow them to further explore these areas and provide a more comprehensive evaluation of LLM capabilities in the context of knowledge graph construction and reasoning. The research sets a new direction for future work in building advanced, reliable systems for knowledge extraction and reasoning. It highlights the potential of LLM-based agents for knowledge graph construction and proposes a comprehensive framework for harnessing their capabilities.

A. Appendix: Additional Details

Additional details about System and parameters for the preliminary Experimental set-up

• The context size(n_ctx) is the maximum number of tokens that the model can account for when processing a response. this includes the prompt, and the response itself. In our case the context size was set to 2048.

• The maximum number of tokens to generate is 2000 in our case. If 𝑚𝑎𝑥_𝑡𝑜𝑘𝑒𝑛𝑠 ≤ 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.

• Average inference time per context/sentence for CONLL04 test dataset for extracting triples in conjunction with LLMs was 25 seconds.

• Temperature was set to 0

• The gguf files for Mistral and Gemma were run locally on Mac M1.

The system prompt is shown in Figure 3 Relation

Triple trustworthiness measurement for knowledge graph SJia YXiang XChen KWang 10.1145/3308558.3313586 doi:10.1145/3308558.3313586 The World Wide Web Conference ACM 2019 WWW '19 REBEL: Relation extraction by end-to-end language generation P.-LHuguet Cabot RNavigli 10.18653/v1/2021.findings-emnlp.204 Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics M.-FMoens XHuang LSpecia SW.-T Yih

Punta Cana, Dominican Republic

2021 GRossiello MF MChowdhury NMihindukulasooriya OCornec AMGliozzo arXiv:2210.13952 Knowgl: Knowledge generation and linking from text 2022 A survey on large language model based autonomous agents LWang CMa XFeng ZZhang HYang JZhang ZChen JTang XChen YLin WZhao ZWei J.-RWen 2023 HNaveed AUKhan SQiu MSaqib SAnwar MUsman NAkhtar NBarnes AMian arXiv:2307.06435 A comprehensive overview of large language models 2024 JJiang KZhou WXZhao YSong CZhu HZhu J.-RWen arXiv:2402.11163 Kg-agent: An efficient autonomous agent framework for complex reasoning over knowledge graph 2024 arXiv preprint JJiang KZhou ZDong KYe WXZhao J.-RWen arXiv:2305.09645 Structgpt: A general framework for large language model to reason over structured data 2023 arXiv preprint YZhu XWang JChen SQiao YOu YYao SDeng HChen NZhang arXiv:2305.13168 Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities 2023 arXiv preprint QXu FHong BLi CHu ZChen JZhang arXiv:2305.16504 On the tool manipulation capability of open-source large language models 2023 YQin SLiang YYe KZhu LYan YLu YLin XCong XTang BQian SZhao LHong RTian RXie JZhou MGerstein DLi ZLiu MSun arXiv:2307.16789 Toolllm: Facilitating large language models to master 16000+ real-world apis 2023 Instruction tuning with human curriculum BWLee HCho KMYoo arXiv:2310.09518 2024 SYao JZhao DYu NDu IShafran KNarasimhan YCao arXiv:2210.03629 React: Synergizing reasoning and acting in language models 2022 arXiv preprint ZYang PQi SZhang YBengio WWCohen RSalakhutdinov CDManning arXiv:1809.09600 Hotpotqa: A dataset for diverse, explainable multi-hop question answering 2018 JThorne AVlachos CChristodoulopoulos AMittal arXiv:1803.05355 Fever: a large-scale dataset for fact extraction and verification 2018 JWei XWang DSchuurmans MBosma BIchter FXia EChi QLe DZhou arXiv:2201.11903 Chain-ofthought prompting elicits reasoning in large language models 2023 KCobbe VKosaraju MBavarian MChen HJun LKaiser MPlappert JTworek JHilton RNakano CHesse JSchulman arXiv:2110.14168 Training verifiers to solve math word problems 2021 Are nlp models really able to solve simple math word problems? APatel SBhattamishra NGoyal arXiv:2103.07191 2021 A diverse corpus for evaluating and developing English math word problem solvers S.-YMiao C.-CLiang K.-YSu 10.18653/v1/2020.acl-main.92 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 MAWPS: A math word problem repository RKoncel-Kedziorski SRoy AAmini NKushman HHajishirzi 10.18653/v1/N16-1136 Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics KKnight ANenkova ORambow the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics

San Diego, California

2016 HYang SYue YHe arXiv:2306.02224 Auto-gpt for online decision making: Benchmarks and additional opinions 2023 MShridhar JThomason DGordon YBisk WHan RMottaghi LZettlemoyer DFox arXiv:1912.01734 Alfred: A benchmark for interpreting grounded instructions for everyday tasks 2020 A reduction of imitation learning and structured prediction to no-regret online learning SRoss GJGordon JABagnell arXiv:1011.0686 2011 RNakano JHilton SBalaji JWu LOuyang CKim CHesse SJain VKosaraju WSaunders XJiang KCobbe TEloundou GKrueger KButton MKnight BChess JSchulman arXiv:2112.09332 Webgpt: Browser-assisted question-answering with human feedback 2022 AFan YJernite EPerez DGrangier JWeston MAuli arXiv:1907.09190 Eli5: Long form question answering 2019 ZYang LLi JWang KLin EAzarnasab FAhmed ZLiu CLiu MZeng LWang arXiv:2303.11381 Mmreact: Prompting chatgpt for multimodal reasoning and action 2023 ISingh VBlukis AMousavian AGoyal DXu JTremblay DFox JThomason AGarg arXiv:2209.11302 Progprompt: Generating situated robot task plans using large language models 2022 Evaluating and enhancing large language models for conversational reasoning on knowledge graphs YHuang LShi ALiu HXu arXiv:2312.11282 2024 OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs SMoon PShah AKumar RSubba 10.18653/v1/P19-1081 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics AKorhonen DTraum LMàrquez the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 JJiang KZhou WXZhao YSong CZhu HZhu J.-RWen arXiv:2402.11163 Kg-agent: An efficient autonomous agent framework for complex reasoning over knowledge graph 2024 The value of semantic parse labeling for knowledge base question answering WYih MRichardson CMeek MChang JSuh Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016

Berlin, Germany

The Association for Computer Linguistics August 7-12, 2016. 2016 2 Short Papers The web as a knowledge-base for answering complex questions ATalmor JBerant Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018 MAWalker HJi AStent the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018

New Orleans, Louisiana, USA

June 1-6, 2018. 2018 1 Association for Computational Linguistics three levels of generalization for question answering on knowledge bases YGu SKase MVanni BMSadler PLiang XYan YSu II DBeyond WWW '21: The Web Conference 2021, Virtual Event JLeskovec MGrobelnik MNajork JTang LZia

Ljubljana, Slovenia

ACM / IW3C2 April 19-23, 2021. 2021 YZhu SQiao YOu SDeng NZhang SLyu YShen LLiang JGu HChen arXiv:2403.03101 Knowagent: Knowledge-augmented planning for llm-based agents 2024 arXiv preprint EKarpas OAbend YBelinkov BLenz OLieber NRatner YShoham HBata YLevine KLeyton-Brown DMuhlgay NRozen ESchwartz GShachaf SShalev-Shwartz AShashua MTenenholtz arXiv:2205.00445 Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning 2022 LWeng Llm-powered autonomous agents, lilianweng 2023 github.io Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension MLewis YLiu NGoyal MGhazvininejad AMohamed OLevy VStoyanov LZettlemoyer arXiv:1910.13461 2019 <author> <persName><forename type="first">H</forename><surname>Touvron</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Martin</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Stone</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Albert</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Almahairi</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Babaei</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Bashlykov</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Batra</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Bhargava</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Bhosale</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Bikel</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Blecher</surname></persName> </author> <author> <persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Ferrer</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Chen</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Cucurull</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Esiobu</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Fernandes</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Fu</surname></persName> </author> <author> <persName><forename type="first">W</forename><surname>Fu</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Fuller</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Gao</surname></persName> </author> <author> <persName><forename type="first">V</forename><surname>Goswami</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Goyal</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Hartshorn</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Hosseini</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Hou</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Inan</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Kardas</surname></persName> </author> <author> <persName><forename type="first">V</forename><surname>Kerkez</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Khabsa</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Kloumann</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Korenev</surname></persName> </author> <author> <persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Koura</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Lee</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Liskovich</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Lu</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Mao</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Martinet</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Mihaylov</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Mishra</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Molybog</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Nie</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Poulton</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Reizenstein</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Rungta</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Saladi</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Schelten</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Silva</surname></persName> </author> <author> <persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Smith</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Subramanian</surname></persName> </author> <author> <persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Tan</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Tang</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Taylor</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Williams</surname></persName> </author> <author> <persName><forename type="first">J</forename><forename type="middle">X</forename></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b37"> <monogr> <author> <persName><forename type="first">P</forename><surname>Kuan</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Xu</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Yan</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Zarov</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Fan</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Kambadur</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Narang</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Rodriguez</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Stojnic</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Edunov</surname></persName> </author> <author> <persName><surname>Scialom</surname></persName> </author> <idno type="arXiv">arXiv:2307.09288</idno> <title level="m">Llama 2: Open foundation and fine-tuned chat models 2023 MPayne Fine-tuning open llms with reinforcement learning from human feedback 2023 A linear programming formulation for global inference in natural language tasks DRoth W.-TYih Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, Association for Computational Linguistics the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004, Association for Computational Linguistics

Boston, Massachusetts, USA

2004 Table filling multi-task recurrent neural network for joint entity and relation extraction PGupta HSchütze BAndrassy Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee YMatsumoto RPrasad COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee

Osaka, Japan

2016 GTeam TMesnard CHardin RDadashi SBhupatiraju SPathak LSifre MRivière MSKale JLove PTafti LHussenot AChowdhery ARoberts ABarua ABotev ACastro-Ros ASlone AHéliou ATacchetti ABulanova APaterson BTsai BShahriari CLLan CAChoquette-Choo CCrepy DCer DIppolito DReid EBuchatskaya ENi ENoland GYan GTucker G.-CMuraru GRozhdestvenskiy HMichalewski ITenney IGrishchenko JAustin JKeeling JLabanowski J.-BLespiau JStanway JBrennan JChen JFerret JChiu JMao-Jones KLee KYu KMillican LLSjoesund LLee LDixon MReid MMikuła MWirth MSharman NChinaev NThain OBachem OChang OWahltinez PBailey PMichel PYotov PGSessa RChaabouni RComanescu RJana RAnil RMcilroy RLiu RMullins SLSmith SBorgeaud SGirgin SDouglas SPandya SShakeri SDe TKlimenko THennigan VFeinberg WStokowiec YHui Chen ZAhmed ZGong TWarkentin LPeran MGiang CFarabet OVinyals JDean KKavukcuoglu DHassabis ZGhahramani DEck JBarral FPereira ECollins AJoulin NFiedel ESenter AAndreev KKenealy arXiv:2403.08295 Gemma: Open models based on gemini research and technology 2024 <author> <persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Mensch</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Bamford</surname></persName> </author> <author> <persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>De Las Casas</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Bressand</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lengyel</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lample</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Saulnier</surname></persName> </author> <author> <persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Lavaud</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Stock</surname></persName> </author> <author> <persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lacroix</surname></persName> </author> <author> <persName><forename type="first">W</forename><forename type="middle">E</forename><surname>Sayed</surname></persName> </author> <idno type="arXiv">arXiv:2310.06825</idno> </analytic> <monogr> <title level="j">Mistral 7 2023