<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Machine
Learning Research 22 (2021) 1-6. URL: http://jmlr.org/papers/v22/20</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3442381.3450043</article-id>
      <title-group>
        <article-title>Validation of Knowledge Graph Completion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Weihang Zhang</string-name>
          <email>w.zhang21@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ovidiu Serban</string-name>
          <email>o.serban@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Knowledge Graph Completion, Large Language Model, Knowledge Graph Curation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Institute, Imperial College</institution>
          <addr-line>London Exhibition Rd, South Kensington, London, SW7 2BX</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KGCW'25: 6th International Workshop on Knowledge Graph Construction</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>1737</fpage>
      <lpage>1748</lpage>
      <abstract>
        <p>In the past decade, Knowledge Graph Completion (KGC), a task aimed at discovering missing knowledge within a knowledge graph by predicting missing links between entities, has garnered significant attention from researchers. Despite significant algorithmic advances, even state-of-the-art KGC models produce predictions with unavoidable inaccuracies. The crucial yet underexplored task of validating these predictions before integration remains predominantly manual, creating a substantial bottleneck in knowledge graph curation pipelines. Unvalidated KGC predictions can propagate errors to downstream knowledge-intensive applications, such as fact-checking, potentially compromising reliability. To address this challenge, we propose a novel workflow leveraging Large Language Models (LLMs) that mimics human validation processes by systematically evaluating KGC predictions made by existing systems through dual verification: analyzing and reranking plausible KGC predictions via internal graph evidence and external knowledge sources. The proposed method leverages token probabilities from LLMs to quantify model confidences and, therefore, provides reranking results of KGC predictions made by existing methods. Our work presents the first large-scale empirical evaluation of using LLMs to post-process KGC predictions on standard benchmarks. Experimental results demonstrate that our approach can significantly reduce human validation eforts while maintaining high factual accuracy standards in knowledge graph curation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge graphs (KGs) have become essential for organizing structured information across numerous
applications, from search engines and recommendation systems to complex reasoning tasks like factual
verification. As KGs scale, one of the most critical challenges to the KG curation process is to improve the
inherent incompleteness. Knowledge Graph Completion (KGC) has emerged as a promising approach
to address this challenge by systematically predicting missing facts. However, despite significant
theoretical advances [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], in KGC research, a substantial gap remains between these algorithmic
improvements and their practical implementation in real-world knowledge graphs: while substantial
research eforts have focused on developing increasingly sophisticated algorithms to improve KGC
prediction accuracy, comparatively little attention has been given to what happens after these predictions
are generated and how can they be integrated into KG curation process. This oversight is problematic
because even state-of-the-art KGC models produce imperfect predictions that, if directly integrated into
knowledge graphs, can propagate errors to downstream applications. For instance, when knowledge
graphs serve as evidence sources for fact-checking systems, erroneous KGC predictions can lead to
incorrect verification outcomes, undermining the reliability of the entire pipeline. The validation and
selection of KGC predictions is a crucial yet underexplored step in the knowledge graph curation
lifecycle. This process relies heavily on manual human validation — a resource-intensive approach
that can become increasingly unsustainable as KGs scale in size and complexity [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The fundamental
challenge stems from the tension between two competing evaluation paradigms: the closed-world
assumption traditionally used in KGC and linked prediction model evaluation, which assumes all missing
facts to be incorrect, and the open-world assumption that better reflects reality by acknowledging
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
that knowledge graphs are inherently incomplete and that missing facts should not automatically be
treated as incorrect. Although the open-world assumption aligns more closely with real-world scenarios,
the closed-world assumption is far more commonly used in knowledge graph completion evaluation
frameworks, primarily because it simplifies the automation of the evaluation process without requiring
human efort. The extensive human eforts required to correctly evaluate the KGC predictions under
open-world assumptions against real-world facts are discussed in several recent works [
        <xref ref-type="bibr" rid="ref5 ref6">6, 5</xref>
        ].
      </p>
      <p>As knowledge graphs continue to expand, there is an urgent need for approaches that can automate
or semi-automate the validation process while maintaining high standards of factual accuracy. The
emergence of Large Language Models (LLMs), with their remarkable capabilities for reasoning, inference,
and natural language understanding, presents a promising direction for addressing this challenge. LLMs
have demonstrated the ability to reason about factual information and access parametric knowledge
and external resources through techniques like Retrieval-Augmented Generation (RAG). This paper
proposes a novel framework that leverages LLMs as automated agents for validating and ranking KGC
predictions. Our approach mimics the workflow of human annotators by evaluating predicted triplets
based on internal graph evidence (leveraging existing knowledge within the KG) and external references
(utilizing broader knowledge sources via search engines). This dual-validation mechanism helps ensure
that only high-confidence, factually accurate predictions are integrated into the knowledge graph,
thus maintaining its integrity while reducing the human efort required for curation. Additionally, the
proposed approach leverages next-token probability distributions from LLMs to quantify validation
confidence, enabling systematic comparison across multiple candidates for the same KGC task.</p>
      <p>
        Currently, there exist few studies that explore reducing or replacing human efort in the KGC task,
with the notable exception of KGValidator [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which utilises a RAG framework with LLMs to conduct
validations of KGC predictions under the open-world setting. KGValidator approaches the validation
of KGC predictions as a classification task, verifying each prediction against external sources such
as web search results or reference knowledge graphs. Although this paper employs a similar setup
to reduce human efort in post-processing KGC predictions, it difers from KGValidator in several
key aspects. First, alongside external evidence, the proposed method incorporates relevant graphical
evidence from within the KG, mimicking the human tendency to prioritize internal sources before using
external knowledge sources to validate a predicted triplet. Second, rather than framing validation as a
classification problem, the proposed method treats it as a reranking task. Here, the top-k candidates
from KGC predictions are retrieved and reranked by LLMs. This reranking approach allows for
broaderscale evaluation and directly compares model improvements with traditional KGC evaluation metrics.
Lastly, LLMs in the proposed method are prompted to validate each prediction, with the next-token
probabilities used to quantify the confidence scores for each judgment. These scores are then used to
reorder the top-k candidates, providing a more intuitive view of intermediate results when used to aid
human annotators.
      </p>
      <p>Overall, this paper makes the following contributions to the field:
• We identify and address a critical gap in the real-world KGC pipeline by focusing on the
underexplored post-prediction reranking and validation process.
• We propose a novel LLM-based framework for automating the validation and reranking of KGC
predictions using internal and external knowledge sources, mimicking human workflow.
• Our work presents the first large-scale empirical evaluation of using LLMs to post-process KGC
predictions on standard benchmarks, demonstrating that our approach can significantly reduce
the required human validation efort.</p>
      <p>The remainder of this paper is organized as follows: section 2 reviews related work in knowledge
graph completion and recent advances in language models for the KGC task. Section 3 details our
proposed methodology for LLM-augmented reranking of KGC predictions. Section 4 presents our
experimental setup and results. Finally, section 5 discusses implications, limitations, and directions for
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Knowledge Graph Completion</title>
        <p>
          Over the past decades, the terms knowledge graph completion, link prediction, and sometimes
knowledge graph reasoning have often been used interchangeably by researchers when referring to the task
that aims to solve the sparsity and incompleteness problem of knowledge graphs. The past decades
have also observed rapid development and iterations of KGC methods. The predominant approaches in
KGC research rely on representation learning techniques, where entities and relations are encoded into
low-dimensional embedding spaces. Translation-based models such as TransE [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] represent relations as
translations in the embedding space, while semantic matching models like RESCAL [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] employ tensor
factorization to capture multi-relational data. These two works have spawned numerous subsequent
studies [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref7 ref8 ref9">7, 8, 9, 10, 11, 12, 13</xref>
          ] addressing their inherent limitations. The emergence of graph neural
networks catalyzed significant advancements in KGC approaches. Researchers have leveraged Graph
Convolutional Networks (GCNs) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and their variants as sophisticated encoders to capture complex
knowledge graph structures within representation spaces. Notable contributions such as R-GCN [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ],
CompGCN [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and KGAT [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] have demonstrated superior performance on knowledge graph
completion tasks by efectively modelling multi-hop neighbourhood information and relational patterns.
More recently, researchers [16, 17] have also explored KGC, which utilizes the collective knowledge of
multiple KGs to aid the curation process.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Language Model aided Knowledge Graph completion</title>
        <p>
          Rapid developments in pre-trained and large language models have given researchers great tools for
capturing semantic meaning and patterns in natural language. As a result, these advancements motivated
the development of various methods for incorporating the language models to enhance the knowledge
graph completion task. One of the earliest attempts was KG-Bert [18], in which triplets in KGs are
treated as textual sequences. Several works further extended the concept of KG-Bert. For example, Wang
et al. [19] proposed a structure-augmented text representation(StAR) model, which uses a Siamese-style
textual encoder to learn two contextualized representations for a single triplet. Lovelace and Rosé [20]
introduces a supervised embedding extraction method to best leverage entity embeddings for the KGC
task. With the rapid development of LLMs, Zhu et al. [21] conducts comprehensive evaluations of LLMs
for KG construction and reasoning, showing that LLMs perform close to state-of-the-art models across
both tasks. While the research community has made remarkable progress in developing increasingly
sophisticated KGC methods, the predominant focus has remained on improving prediction accuracy
through algorithmic innovations. However, despite these advances, KGC predictions exhibit significant
imperfections that limit their direct applicability in real-world knowledge graph curation. To our
knowledge, KGValidator [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is one of the notable exceptions that attempts to use LLMs to automate
the validation process of the KGC predictions. A retrieval-augmented generation workflow against the
open-source knowledge base and web search is proposed to reduce human eforts in the post-processing
step of KGC predictions.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>As illustrated in Figure 1, the proposed method first generates predicted triplets and integrates internal
knowledge graph evidence and external references to rerank the predicted triplets. To mimic the real-life
situation, this study generates the predicted triplets using traditional embedding-based KGC methods.
For each KGC task (e.g. head prediction or tail prediction), the top-k predicted entities are included
for the proposed subsequent LLM reranking workflow. Including traditional KGC methods efectively
reduces the search space for the LLM reranking workflow. However, this also imposes reliance on the
quality of the conventional KGC method, as the reranking process can only be beneficial if the correct
prediction exists in the initial top-k predicted entities for a KGC task.</p>
      <p>Overall, to rerank the potential candidates for a KGC task, the retrieval process consists of three main
steps: 1) retrieval of internal graphical evidence, 2) retrieval of external references via the DuckDuckGo
search engine, and 3) reranking KGC predictions based on LLM token probabilities.</p>
      <sec id="sec-3-1">
        <title>3.1. Retrieval of internal graphical evidence</title>
        <p>Drawing inspiration from how humans validate predicted triplets, the first step involves retrieving
internal evidence directly from the knowledge graph. For each predicted triple, three types of retrieval
are performed:
• First, for each triplet of form (ℎ,  , ) , graph walks are initiated from the head entity ℎ to the tail
entity  , or vice versa, with a maximum walk length of 3. These walks explore the intermediate
nodes and edges that connect ℎ and  within the knowledge graph. For example, for a triplet
(Paris, locatedIn, France), a valid graph traversal path could be (Paris, partOf, Île-de-France),
(Île-de-France, locatedIn, France). The resulting paths represent potential reasoning chains that
may explain or support the predicted relationship. These paths capture indirect associations
and contextual information, which can help validate the plausibility of the predicted triplet by
highlighting logical or meaningful connections between the entities.
• Second, all triplets in the knowledge graph are encoded and stored in an embedding index. For
each predicted triplet, dense retrieval is performed to identify and retrieve semantically similar
triplets. This dense retrieval process is designed to uncover similar patterns not directly connected
to the triplets, providing additional context or supporting evidence for the prediction.
• Lastly, for each predicted triplet, all triplets in the knowledge graph that share the same relation 
are also retrieved. These examples ofer additional context for the predicate, providing insights
into its typical usage and helping to prevent misinterpretation, especially when the relation has
multiple or ambiguous meanings. For example, in FB15K-237 [22], a widely-adopted KGC dataset,
the relation “place” is sometimes used in a self-referential triplet, such as (’Glen Cove’, ’place’,
’Glen Cove’). The triplet indicates that Glen Cove is categorized as a place. Such self-referential
examples may initially seem confusing for human annotators, but they are technically accurate.
Depending on the definitions of relations in diferent knowledge graphs, similar relations can
occasionally lead to ambiguity. The proposed method retrieves example triples corresponding to
the target relation to address this issue, providing more contextual examples for the subsequent
LLM prompting.</p>
        <p>Example prompt
You are given a target knowledge graph triplet in the form (head, relation, tail). Your task is to judge whether
this triplet is factually correct based on the given information and common sense.</p>
        <p>To help you with the judgment, we provide some additional information that may be helpful to your judgment:
1) existing triplets in the knowledge graph have the same relation, so you can see if the target triplet follows the
same pattern.
2) existing triplets in the knowledge graph that may be relevant to the target triplet.
3) additional context that may be relevant from a search engine to validate the target triplet.
Remember your task is to judge the factuality of the target triplet based on the provided information and your
internal knowledge. Use common sense where applicable. Always start your answer with correct, incorrect
or NEI(not enough info) to indicate your judgment of the factuality of the target triplet and give a short
one-sentence reason to justify your judgement.
== Provided information ==
KG triplets with same relation: [(’John Wood’, ’nationality’, ’England’), (’Mark Twain’, ’nationality’, ’United
States of America’), (’Lars von Trier’, ’nationality’, ’Denmark-GB’), (’Sridevi’, ’nationality’, ’India’), (’Brad Grey’,
’nationality’, ’United States of America’)]
Existing triplets: [[(’Stevie Wonder’, ’currency’, ’United States Dollar’), (’United Kingdom’, ’currency’, ’United
States Dollar’)], [(’Stevie Wonder’, ’currency’, ’United States Dollar’), (’United Kingdom’, ’adjustmentcurrency’,
’United States Dollar’)], [(’Elton John’, ’awardnominee’, ’Stevie Wonder’), (’Elton John’, ’nationality’, ’United
Kingdom’)], (’Stevie Nicks’, ’nationality’, ’United States of America’), (’Stevie Ray Vaughan’, ’nationality’,
’United States of America’), (’Stevie Wonder’, ’profession’, ’Singer-songwriter-GB’), (’Stevie Wonder’, ’profession’,
’Actor-GB’), (’Stevie Wonder’, ’origin’, ’Detroit’)]
Additional context: Wonder told the BBC that gaining Ghanaian nationality on his birthday was an ”amazing
thing”. The superstar was born and bred in the US state of Michigan but has long had an afinity for Ghana - a ...
American music icon Stevie Wonder has been granted Ghanaian citizenship. In a visit to Accra, Ghana’s capital,
in which he was joined by his family, the artist heard a speech from the country’s ...</p>
        <p>Ghana was the first Black African country south of the Sahara to achieve independence from colonial rule in
1957, britannica.com said. Blind from infancy, Wonder was born in Saginaw as Steveland ...</p>
        <p>Target triplet: (’Stevie Wonder’, ’nationality’, ’United Kingdom’)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Retrieval of external reference</title>
        <p>
          In addition to internal references derived from patterns and reasoning paths within the knowledge
graph, external references play a crucial role in validating and ranking predicted triples, particularly
when internal knowledge is insuficient. Previous studies [
          <xref ref-type="bibr" rid="ref6">23, 6</xref>
          ] have shown that predicted triples, while
deemed incorrect under the closed-world assumption commonly used in automated KGC evaluations,
may be correct under the open-world assumption. Incorporating external references from outside the
knowledge graph can efectively address this discrepancy, providing additional validation during the
reranking process and improving the accuracy of predicted triples. In the process, external evidence
is retrieved using the DuckDuckGo search engine. The queries are constructed based on the simple
concatenations of the predicted triplet’s head, relation, and tail (e.g. ‘Paris partOf Île-de-France’).
DuckDuckGo is selected for its privacy-preserving search capabilities, which provide broad and unbiased
retrieval from diverse external sources. The top results, including documents, web pages, and knowledge
snippets, are collected and processed to supplement the validation process.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Reranking with LLMs</title>
        <p>After retrieving internal graphical evidence and external references, the next step is constructing
a structured prompt for the LLM. This prompt includes the original KGC prediction alongside the
supporting or contradicting evidence from internal and external sources. The prompt aims to refine the
ranking of predictions for each KGC query, prioritising the most accurate results. While LLMs have
already shown impressive performance with various reasoning tasks, even under the zero-shot settings,
there has been limited success for the important text ranking problem using of-the-shelf LLMs [ 24].
Several existing studies [24, 25, 26] have focused on improving the document ranking problems by
formulating the task with diferent prompting strategies. The two main formulations of the ranking task
are list-wise ranking and point-wise ranking. In list-wise ranking, the ranking candidates are evaluated
together, requiring the inclusion of all candidates’ contextual information within the prompt for the
LLM. While this method has shown efectiveness in previous studies [ 24, 25], it often leads to conflicting
or irrelevant outputs, particularly when using moderate-sized LLMs. For ranking KGC predictions,
list-wise ranking poses a significant challenge, as it necessitates an exceptionally long context to account
for all candidates, potentially exceeding the input length limit of some models and causing confusion
or hallucination for the LLMs. Pair-wise ranking with LLMs [26] addresses the long-context issue
inherent in list-wise ranking by comparing candidates in pairs. However, it can be computationally
expensive, as the model may need to evaluate numerous permutations of candidate pairs. Even with
optimized sorting and selection strategies, pair-wise ranking can still incur significant computational
costs, making it less eficient for tasks with large candidate sets. For ranking KGC predictions, pair-wise
ranking becomes significantly less eficient due to the large number of predictions typically generated
by algorithms, requiring an impractical number of pairwise comparisons.</p>
        <p>This paper proposes a method for ranking KGC predictions by prompting the LLM with each predicted
triplet while specifying the LLM agent to start the response with the judgement of “Correct”, “Incorrect”
or “NEI” (Not Enough Information). The token probability of the first generated token is then used
to quantify and rank the candidates for each KGC query. Table 1 provides an illustrative example of
the designed prompt. In the designed prompt, instructions are explicitly added for the LLM to begin
by providing a natural language judgment on the correctness of the KGC predicted triplet, given the
retrieved evidence. Instead of simply providing a ranked output, the LLM first assesses the prediction,
either afirming it as correct or flagging it as incorrect, based on the evidence provided. This step
allows the model to engage in reasoning and explain its decision, similar to how a human evaluator
might approach the task. The next-token probability for the first token  (  ) is recorded as a
quantification of the LLM’s internal assessment and is used for ranking.</p>
        <p>
          For KGC tasks like head prediction (?,  , ) and tail prediction (ℎ,  , ?) , traditional deep-learning-based
KGC methods are used to generate a ranking of candidates. The proposed LLM-based reranking
method takes the top 10 candidates for each query, retrieves relevant reference information for each
candidate, and applies the prompting strategy to rerank them based on the retrieved evidence. This
efectively reduces the search space for the LLM reranking process. Unlike prior work [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] that focuses
on individually validating each predicted triplet, this paper formulates the task as a reranking problem.
This approach has two key reasons: First, this paper positions the LLM as an eficient tool to aid human
decision-making rather than a full replacement for human expertise, which we argue is more appropriate
for processes where accuracy is highly sensitive and critical. Human evaluators can concentrate on the
most promising predictions by leveraging the LLM to quickly organize and rerank the top candidates
generated by existing systems. Second, maintaining the ranking framework for KGC predictions allows
for a more seamless evaluation of the LLM’s impact, as standard ranking metrics can still be applied to
assess the efectiveness of the reranking process.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Basic Settings</title>
        <p>The experiments in this paper are designed to evaluate the efectiveness of the proposed LLM reranking
process. Standard ranking metrics commonly applied in KGC tasks are used for comparison. Traditional
embedding-based KGC methods are first employed in the experiments to generate baseline results. From
these, the top 10 candidate predictions for each KGC query are selected and subsequently reranked using
the proposed LLM reranking process. The performance before and after reranking is then reported,
demonstrating the impact of the LLM reranking method. Since only the top 10 predicted candidates
from the KGC tasks are selected for reranking, metrics such as Mean Rank (MR) and Mean Reciprocal
Rank (MRR) are no longer applicable for evaluation. Instead, Hit@1 and Hit@3 are reported in the
results table to reflect performance.</p>
        <p>The experiments are conducted on two datasets commonly used for knowledge graph completion
tasks. The first dataset, FB15K-237 [ 22], is a subset of the Freebase knowledge graph, containing 237
diferent relations. It is widely adopted as a standard benchmarking dataset for evaluating various
KGC methods under the closed-world assumption. The second dataset, InferWiki [27], was designed to
enhance the evaluation of KGC methods under the open-world assumption. It includes human-annotated
test triplets that indicate whether the triples can be verified through inferential patterns within the
KG or by referencing external open-world sources. Positively labelled test triplets are selected as the
test set for the experiments, as these annotations are thoroughly verified using available open-world
evidence. In automatic evaluations, biases can occur when unknown facts in the knowledge graphs are
assumed to be incorrect. The annotations in the InferWiki dataset help minimize this bias, simulating
an evaluation in a real open-world setting.</p>
        <p>The TransE and RotatE models are chosen as the baseline KGC methods. For each dataset used in
the experiments, pipelines from the Pykeen library [28] are used to train and evaluate the traditional
embedding-based KGC approaches. The traditional KGC models are trained for 10 epochs with an
embedding size of 100, and their performances on the tail prediction task serve as the baselines. The
top ten candidates for each tail prediction query are preserved for further reranking using an LLM.
GPT-3.5 and GPT-4o-mini are selected as the LLMs for this evaluation. In the LLM reranking process,
reasoning paths of up to length 2 are used as internal KG references, while the top three results from
the DuckDuckGo search API are selected as external references.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>Table 3 and table 4 present the performance of LLM reranking when diferent types of evidence are
provided, with TransE and RotatE serving as the original KGC methods, respectively. In the tables, full
reference refers to when internal graphical reference and external reference from search engines are
available for the LLM. In both tables, the LLM reranking process achieves the highest performance
across all datasets, significantly improving upon the original predictions made by existing KGC methods.
The more recent LLM at the time of the experiment, GPT-4o-mini, also outperforms its predecessor,
GPT-3.5, across all tasks, highlighting how the ongoing evolution of LLMs can lead to continuous
improvements in this reranking task. Furthermore, on both datasets, the original KGC method RotatE
ofers stronger baselines than TransE. The reranking results for RotatE are also significantly higher,
primarily due to the superior quality of its initial pool of candidate predictions. Notably, since the LLM
reranking process only has access to the top-10 candidates from the existing predictions, the theoretical
upper bound for Hit@1 and Hit@3 metrics after reranking is exactly the Hit@10 performance of the
original KGC method. With this in mind, the LLM reranking with full evidence, especially GPT-4o-mini,
comes reasonably close to the theoretical best for Hit@3 when both internal and external references
are provided. This observation demonstrates the potential of the proposed reranking process to reduce
human efort significantly.</p>
        <p>Additionally, it can be observed from table 3 and table 4 that LLMs reranking generally performs
better when only internal graphical references are available compared to when relying solely on external
references from a search engine. The only exception is GPT-4o-mini on Inferwiki64k, with RotatE
as the original KGC method. While a more comprehensive external reference search might improve
results, the comparison still supports our initial intuition that internal graphical evidence plays a
crucial role in validating and ranking KGC predictions. Interestingly, when relatively earlier LLMs
like GPT-3.5 are used, reranking using only external references results in worse performance than the
original predictions. This can be attributed to two main factors. First, the quality of search results from
the DuckDuckGo search API is not always reliable due to the relatively basic search strategy. When
LLMs are provided solely with external references, they may struggle to make accurate judgments
without clear or relevant evidence, leading to confusion. Second, the automatic ranking evaluation
operates under the closed-world assumption, which treats unobserved facts as incorrect. External
references from web searches can introduce open-world knowledge, causing LLMs to rank specific
triplets higher based on external evidence. While this behaviour aligns with how LLMs handle
openworld information, it leads to predictions being considered incorrect due to the closed-world evaluation
criterion. Interestingly, similar efects are not observed on the InferWiki64K dataset. This may be
because the test set of InferWiki64K has been validated by human annotators against open-world
knowledge. As a result, for each query pattern of (h, r, ?) and (?, r, t), most of the correct triplets are
already included in the dataset, creating a scenario that approximates open-world conditions even when
evaluated under closed-world assumptions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Reranking case studies</title>
        <p>Table 5 presents several examples of LLM reranking outputs. These include the original KGC queries,
the top 10 candidate predictions from existing methods, and the LLM-reranked outputs and their
respective probabilities. For the first query (Parkersburg, timezones, ?), the correct test triplet is
(Parkersburg, timezones, Eastern Time Zone), where Parkersburg refers to the town in West Virginia,
which is in the Eastern time zone. However, the original KGC embedding method incorrectly ranked
the Greenwich Mean Time Zone as the top candidate. During the reranking process, the LLM assigned
a confidence score of 0.98 to the correct Eastern Time Zone, placing it second behind the Central Time
Zone, which received a 0.99 confidence score. While the correct candidate’s rank did not improve, it
can be observed that under the open-world assumption, the Central Time Zone may also be considered
valid, as Parkersburg could refer to the town in Illinois, which is in the Central Time Zone. At the
same time, the incorrect prediction of the Greenwich Mean Time Zone is ranked much lower by the
reranking process.</p>
        <p>For the second query (Joe Shuster, placeofdeath, ?), the correct test triplet is (Joe Shuster, placeofdeath,
Los Angeles). The LLM reranking process successfully ranks Los Angeles as the top candidate, whereas
the original embedding method struggles to distinguish between various U.S. locations. Interestingly,
Malibu and New York City are also assigned relatively high probabilities. While Malibu is reasonable
given its proximity to Los Angeles, New York City’s high ranking is likely due to it being a notable
place where Joe Shuster lived during his lifetime. The observation highlights the importance of treating
the post-processing of KGC predictions as a ranking task instead of individual validation tasks, as
binary validation may lead to incorrect conclusions when dealing with partially correct predictions.
For the third query (A.I. Artificial Intelligence, genre, ?), the correct genre is Science Fiction in the test
set. However, beyond the context of the knowledge graph, Drama is also a valid genre as the movie is
often classified as a Science Fiction Drama film. The LLM reranking process correctly ranks the two as
top candidates with identical probabilities. The final query (University of Ottawa, major-fieldofstudy,
?) represents a classic one-to-many relationship in the knowledge graph, resulting in multiple valid
candidates. In the FB15k-237 dataset, only Chemical Engineering, Computer Science, and Mathematics
are listed. However, the search results suggest that the University of Ottawa ofers relevant majors
across all top-ranked candidates. The LLM ranks all majors with high probabilities when external
references are provided. Interestingly, the LLM assigns lower probabilities to broader terms like Science
and Engineering, likely because they refer to fields of study rather than specific majors. This observation
suggests that the LLM’s decision-making process closely aligns with that of human annotators in this
context.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitation &amp; Conclusion</title>
      <p>This paper demonstrates the efectiveness of using large language models as a post-processing step
to rerank knowledge graph completion predictions. By leveraging LLMs’ advanced contextual
understanding and their next-token probabilities, the proposed method can refine the initial predictions
generated by traditional KGC models. Experimental results on existing KGC benchmarking datasets
have shown that LLMs substantially improve the rankings of potential candidates. Using the LLM in the
post-processing phase means that future knowledge discovery systems could achieve higher accuracy
and completeness without requiring significant manual intervention. This has broad implications for
automating knowledge graph curation and completion, as it reduces the reliance on human oversight
to vet predictions, making the system more scalable. While the results presented in this paper show
promise, several limitations must be acknowledged. The success of the LLM-based reranking process is
still dependent on the initial quality of KGC predictions, and while the LLM enhances accuracy, it is not
infallible. This approach is analogous to existing information retrieval systems, where an initial dense
retrieval phase is typically followed by a reranking stage with cross-encoders. In these systems, the
initial retrieval aims to capture a broad set of relevant results, which is crucial for the overall success
of the process. A limited number of failure cases can also be observed, especially when the LLMs are
presented with ambiguous references or triplets. The imperfect results also demonstrate that at the
current stage, using the LLM as a reranking tool to help reduce human eforts presents a more promising
path than directly using the LLM to replace human validation of KGC predictions.</p>
      <p>Overall, using LLMs to rerank and validate KGC predictions is an important step forward in developing
robust and accurate knowledge mining in the KGs. The experimental results demonstrate that the
LLM reranking system can reduce human efort. As LLMs continue to advance, they will likely play
an increasingly critical role in KG curation pipelines, further minimizing human intervention and
enhancing the quality and scalability of knowledge graphs for their downstream applications.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research is funded by the Royal Bank of Canada’s Wealth Management Strategy, Products &amp; Digital
Investing team.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-3.5 for: Grammar and spelling checks.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tresp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <article-title>A Three-Way Model for Collective Learning on Multi-Relational Data (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          ,
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          , in: C.
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Weinberger (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>26</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2013</year>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2013/file/ 1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bloem</surname>
          </string-name>
          , R. van den Berg, I. Titov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Modeling Relational Data with Graph Convolutional Networks</article-title>
          , in: A.
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
            , M.-E. Vidal,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hollink</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tordai</surname>
          </string-name>
          , M. Alam (Eds.),
          <source>The Semantic Web, Lecture Notes in Computer Science</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>607</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>319</fpage>
          - 93417- 4_
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vashishth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nitin</surname>
          </string-name>
          , P. Talukdar,
          <article-title>COMPOSITION-BASED MULTI-RELATIONAL GRAPH CONVOLUTIONAL NETWORKS (</article-title>
          <year>2020</year>
          )
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Boylan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mangla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Ghalandari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ghafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hokamp</surname>
          </string-name>
          ,
          <article-title>Kgvalidator: A framework for automatic validation of knowledge graph construction</article-title>
          ,
          <source>ArXiv abs/2404</source>
          .15923 (
          <year>2024</year>
          ). URL: https://api.semanticscholar.org/CorpusID:269362826.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Huaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kärle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          ,
          <article-title>Knowledge graph validation</article-title>
          , CoRR abs/
          <year>2005</year>
          .01389 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2005</year>
          .01389. arXiv:
          <year>2005</year>
          .01389.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Learning Entity and Relation Embeddings for Knowledge Graph Completion</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>29</volume>
          (
          <year>2015</year>
          ). URL: https://ojs.aaai.org/index.php/AAAI/article/view/9491. doi:
          <volume>10</volume>
          .1609/aaai.v29i1.9491,
          <issue>number</issue>
          :
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Knowledge Graph Embedding by Translating on Hyperplanes</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>28</volume>
          (
          <year>2014</year>
          ). URL: https://ojs.aaai.org/ index.php/AAAI/article/view/8870. doi:
          <volume>10</volume>
          .1609/aaai.v28i1.8870,
          <issue>number</issue>
          :
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Knowledge graph embedding via dynamic mapping matrix</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
          </string-name>
          , M. Strube (Eds.),
          <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Beijing, China,
          <year>2015</year>
          , pp.
          <fpage>687</fpage>
          -
          <lpage>696</lpage>
          . URL: https://aclanthology.org/P15-1067. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>P15</fpage>
          - 1067.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rao</surname>
          </string-name>
          , S. Liu,
          <article-title>Modeling relation paths for representation learning of knowledge bases</article-title>
          , in: L.
          <string-name>
            <surname>Màrquez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Callison-Burch</surname>
          </string-name>
          , J. Su (Eds.),
          <source>Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Lisbon, Portugal,
          <year>2015</year>
          , pp.
          <fpage>705</fpage>
          -
          <lpage>714</lpage>
          . URL: https://aclanthology.org/D15-1082. doi:
          <volume>10</volume>
          . 18653/v1/
          <fpage>D15</fpage>
          - 1082.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          , W. tau
          <string-name>
            <surname>Yih</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Embedding entities and relations for learning and inference in knowledge bases</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2014</year>
          . URL: https://api.semanticscholar.org/CorpusID:2768038.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Trouillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , E. Gaussier, G. Bouchard,
          <article-title>Complex embeddings for simple link prediction</article-title>
          ,
          <source>in: Proceedings of The 33rd International Conference on Machine Learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2071</fpage>
          -
          <lpage>2080</lpage>
          . URL: https://proceedings.mlr.press/v48/trouillon16.html, ISSN:
          <fpage>1938</fpage>
          -
          <lpage>7228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rosasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          ,
          <article-title>Holographic embeddings of knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence</source>
          , AAAI'
          <fpage>16</fpage>
          , AAAI Press,
          <year>2016</year>
          , p.
          <fpage>1955</fpage>
          -
          <lpage>1961</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-Supervised Classification with Graph Convolutional Networks</article-title>
          , arXiv:
          <fpage>1609</fpage>
          .02907 [cs, stat] (
          <year>2017</year>
          ). ArXiv:
          <volume>1609</volume>
          .
          <fpage>02907</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , M. Liu, T.-S. Chua,
          <article-title>Kgat: Knowledge graph attention network for recommendation</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, KDD '19</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>950</fpage>
          -
          <lpage>958</lpage>
          . URL: https://doi.org/10.1145/3292500.3330989. doi:
          <volume>10</volume>
          .1145/3292500.3330989.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>