<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>HyP-KGRAG: Hypothetical Path-Based Knowledge Graph Retrieval Augmented Generation with DeepSeek</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhaotai Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <email>harald.sack@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Genet Asefa Gesese</string-name>
          <email>genet-asefa.gesese@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Knowledge Graph, Large Language Models, Question Answering, Retrieval Augmented Generation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) often produce hallucinated or factually incorrect responses in domain-specific question answering (QA) tasks. To address this limitation, this work explores the integration of Knowledge Graphs (KGs) with Retrieval-Augmented Generation (RAG) as a strategy to improve factual accuracy and multihop evidence selection. Specifically, the benefits of using structured information from a KG to enhance the QA performance of the DeepSeek model are investigated. A novel framework, HyP-KGRAG, is introduced in which KG triples are retrieved via hypothetical paths and refined through an LLM-based denoising module. Experimental results on the material science MSE-KG dataset show that HyP-KGRAG significantly improves the QA performance of DeepSeek and other baseline models, achieving a ROUGE-1 F1 score of 0.532 and an SBERT similarity of 0.629.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large language models (LLMs) have attracted significant attention due to their remarkable capabilities
in various natural language processing (NLP) tasks, including question answering (QA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], text
summarization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and machine translation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], driven by massive data, parameters, and computational
resources. However, their parametric knowledge is limited by training data timeliness and domain
coverage, leading to hallucinations, i.e., factually incorrect or ungrounded outputs. Traditional solutions
to mitigate hallucinations typically rely on retraining or fine-tuning LLMs with updated data, a process
that is computationally costly and inflexible. An alternative and more eficient approach is
RetrievalAugmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which dynamically incorporates external knowledge sources, such as
documents or knowledge graphs (KGs), into the generation process. RAG improves factual accuracy and
relevance without necessitating costly retraining, thereby overcoming some limitations of standalone
LLMs. KGs, with their structured representation of information as triples, comprising a head entity, a
relation, and a tail entity, play a vital role in enhancing RAG systems. They provide a rich, explicit, and
semantically organized knowledge, which can be leveraged to improve reasoning, factual consistency,
and explainability in QA tasks.
      </p>
      <p>This work focuses on addressing the following research question: How can KGs be efectively leveraged
within a RAG framework to reduce hallucinations and enhance domain-specific QA performance in LLMs
such as DeepSeek?</p>
      <p>The key contributions of this work are as follows:
• HyP-KGRAG framework: a novel RAG-based framework that retrieves KG triples via
hypothetical paths and leverages LLMs for denoising and answer generation.
• Experiments on a materials science KG: demonstrate that HyP-KGRAG significantly enhances
domain-specific QA performance in terms of accuracy and semantic coherence.
(G. A. Gesese)
CEUR</p>
      <p>ceur-ws.org
• Ablation studies: validate the benefits of triple-structured retrieval and the denoising module
to achieve a better balance between precision and recall.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Preliminaries</title>
        <p>
          DeepSeek-V3 DeepSeek-V3 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a Mixture-of-Experts (MoE) language model with 671B total
parameters, designed with Multi-head Latent Attention and DeepSeekMoE architectures for eficient
training and inference, achieving performance competitive with leading closed-source models while
maintaining a modest compute footprint.
        </p>
        <p>
          RAG RAG [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is a framework that integrates parametric memory, typically a pre-trained
sequence-tosequence language model, with non-parametric memory in the form of a dense index over external
knowledge sources such as Wikipedia. This setup allows the model to retrieve relevant information
dynamically during generation, improving performance on knowledge-intensive tasks. RAG enhances
factual accuracy, diversity, and specificity of responses compared to models relying solely on internal
parametric knowledge.
        </p>
        <p>
          BM25 BM25 (Best Matching 25) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a probabilistic information retrieval model that improves upon
traditional TF-IDF by introducing term frequency saturation and document length normalization. It is
widely used in modern search engines (e.g., Elasticsearch) due to its robust performance in ranking
documents based on query relevance.
        </p>
        <p>
          BGE-M3 BGE-M3 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], developed by the Beijing Academy of Artificial Intelligence (BAAI) and released
in early 2024, represents a state-of-the-art universal text embedding model with 1024-dimensional
representations, capable of cross-task and cross-lingual semantic understanding. As a flagship model
in the BGE series, it achieves groundbreaking advancements across three key dimensions through its
high-dimensional embedding space.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Related Work</title>
        <p>
          The integration of KGs into RAG has emerged as a promising direction to enhance the problem of
hallucination with LLMs. Several recent works have proposed methods for incorporating structured
knowledge into the RAG pipeline, moving beyond traditional text-based retrieval. KAPING [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
introduces a zero-shot KG-RAG framework that retrieves and linearizes relevant KG triples into natural
language prompts without requiring model fine-tuning. While this method improves zero-shot
performance, it does not leverage the rich structural and multi-hop relationships inherent in KGs. To address
this, CommunityKG-RAG [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] utilizes community detection via the Louvain algorithm and aligns KG
clusters with queries using BERT embeddings. This structure-aware approach significantly improves
fact-checking and contextual relevance in zero-shot settings.
        </p>
        <p>
          For more complex query understanding and summarization tasks, GraphRAG [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] constructs KGs
from input documents and performs topic-oriented clustering using the Leiden algorithm. Although
efective for global summarization, its performance on traditional QA tasks remains limited.
LightRAG [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] enhances retrieval eficiency and adaptability through graph-structured text indexing and
a two-layer retrieval process, enabling improved comprehension of both entity-level and topic-level
semantics. Reasoning over KGs has also gained attention. RoG [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] presents a
”planning-retrievalreasoning” framework that explicitly models reasoning paths over KGs to guide LLMs, resulting in
more explainable outputs. In a related efort, GNN-RAG [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] integrates Graph Neural Networks into
the RAG pipeline to perform multi-hop reasoning over dense subgraphs, improving answer accuracy in
complex KGQA scenarios.
        </p>
        <p>
          KG2RAG [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] further extends retrieval by combining semantic similarity-based chunk selection with
m-hop KG traversal. A graph-structured context organization module filters and arranges retrieved
chunks into coherent passages for LLM input, enhancing both coverage and relevance. In professional
domains, KAG [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] unifies LLMs and KGs via a hybrid reasoning engine and mutual indexing, enabling
logical reasoning and high semantic fidelity in settings like government and medical QA. KGMistral [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
proposes a hybrid framework that augments the Mistral 7B model with rule-based SPARQL queries. It
supports one-hop and two-hop question patterns by aligning predefined templates with the logic of
user queries. Although efective in scientific domains, KGMistral’s reliance on static templates may
limit its flexibility across diverse question types.
        </p>
        <p>
          In contrast to existing methods, the proposed HyP-KGRAG framework introduces a novel hypothetical
path-based retrieval approach that leverages structured knowledge from KGs and incorporates
LLMbased denoising to enhance QA performance. While it is inspired by HyDE [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], which generates
hypothetical documents using instruction-following LLMs such as InstructGPT, HyDE does not utilize
KGs. HyP-KGRAG uniquely grounds its hypothetical paths in KG triples, enabling more semantically
aligned and factually grounded multi-hop evidence retrieval for domain-specific QA tasks. Moreover,
while most prior KG-augmented RAG approaches utilize various LLMs such as GPT and Mistral, this
work specifically focuses on integrating KG-based retrieval with the DeepSeek-V3 model. Unlike
general-purpose RAG systems, HyP-KGRAG is tailored to address hallucination and domain adaptation
challenges observed in DeepSeek during domain-specific QA.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. HyP-KGRAG</title>
      <p>This section introduces HyP-KGRAG (Hypothetical Path-Based KG Retrieval Augmented Generation
with DeepSeek), a novel framework designed to enhance the retrieval quality and multi-hop evidence
selection of LLMs, with a focus on DeepSeek without loss of generality, through KG integration. As
illustrated in Figure 1, HyP-KGRAG contains three main components: ofline processing, hypothetical
path construction, and LLM-as-Judge denoising and generation.</p>
      <sec id="sec-3-1">
        <title>3.1. Ofline Processing</title>
        <p>In the ofline processing phase, HyP-KGRAG performs two key preparatory tasks: KG verbalization
and index construction. These steps transform structured knowledge into a format that enables fast
and semantically meaningful retrieval during inference.</p>
        <p>
          KG Verbalization. This step involves converting structured triples (i.e., subject, predicate, object)
into natural language documents. Unlike prior approaches that employ LLMs to generate free-form
verbalization of triples, HyP-KGRAG uses a linear verbalization strategy, as described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In this
method, each triple is preserved in its original structure by directly concatenating the subject, predicate,
and object. This technique preserves the explicit relationships in the KG while minimizing noise from
generative paraphrasing. As confirmed by our ablation study in Section 4.5, such linear verbalization
consistently outperforms LLM-generated paraphrases in downstream QA tasks.
        </p>
        <p>
          Index Construction. Following verbalization, the triples are encoded and indexed for retrieval. For
semantic retrieval, the bge-M3 embedding model is employed to encode verbalized triples into dense
vector representations, supporting multi-granularity encoding. The resulting embeddings are indexed
using FAISS [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] to enable eficient approximate nearest neighbor (ANN) search based on inner product
similarity during inference.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hypothetical Paths Construction</title>
        <p>As depicted in Figure 1, HyP-KGRAG leverages an LLM (specifically DeepSeek-V3) to identify potential
paths underlying the query, where these hypothetical paths are structured as triplets. These triples
undergo the same verbalization process as described in Section 3.1. Subsequently, BGE-M3 is employed
to encode the text into vector representations. FAISS is then utilized to retrieve semantically closest
factual triplets from an ofline-constructed vector database, with similarity measured via the inner
product of vectors. To construct a complete retrieval paths for the query, HyP-KGRAG utilizes class
information to fill in missing components, whether they are target answers or intermediate nodes,
ensuring coherence and completeness of the paths. A class represents an abstract collection of entities
(e.g., Person, Researcher, Organization), making it well-suited as a bridge in incomplete paths.</p>
        <p>Efective multi-hop evidence retrieval over KGs requires generating coherent paths that span one or
more hops. Accordingly, user questions are categorized into two types:
Single-hop Questions. These involve a direct relation between two entities, requiring only one
retrieval step to obtain the supporting evidence for the answer. These questions are relatively
straightforward, as they rely on a single factual assertion represented as a triple in the KG. For example,
the question ”Who published the dataset ’Nano2023’?” requires identifying an entity (likely a person
or organization) directly connected to the dataset via the published relation. Although single-hop,
HyP-KGRAG still constructs hypothetical paths in both directions to support more robust retrieval.
Class-constrained hypothetical paths include:</p>
        <p>published
Person −−−−−−−−→ Nano2023 or</p>
        <p>published by
Nano2023 −−−−−−−−−−→ Person
Multi-hop Questions. These involve more complex chains of entities and relations, requiring
multihop evidence retrieval to gather the information needed to answer the question. For instance, the
question “Which researcher from MIT published the dataset ’Nano2023’ cited in ’Advanced Materials
Lecture 5’?” combines afiliation, authorship, and citation relationships across entities. Here, answering
requires traversing a sequence of links. Two possible hypothetical paths are:</p>
        <p>afiliation published cites</p>
        <p>MIT ←−−−−−−− Researcher −−−−−−−−→ Nano2023 ←−−−− Advanced Materials Lecture 5
alternative,</p>
        <p>cites published by employs</p>
        <p>Advanced Materials Lecture 5 −−−−→ Nano2023 −−−−−−−−−−→ MIT −−−−−−−→ Researcher
HyP-KGRAG dynamically adjusts the path length based on question complexity. To support the
generation of such paths, a structured prompt is defined to guide the LLM. This prompt instructs the
model to generate step-by-step logical triples for knowledge-grounded retrieval. The full prompt is
available in our GitHub repository1.</p>
        <p>Once a query’s hypothetical paths are generated, each triple is used to perform semantic similarity
search against the ofline-constructed vector database to retrieve relevant factual triples. After
deduplication, the retrieved factual triples represent the preliminary result set and are passed on to the
denoising phase (see Section 3.3).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. LLM-as-Judge Denoising and Generation</title>
        <p>Adding all retrieved triples directly into the LLM’s input can be problematic due to input size limitations
and the potential inclusion of irrelevant or noisy triples retrieved via vector similarity. To address this,
HyP-KGRAG incorporates an LLM-as-Judge denoising mechanism inspired by the multi-dimensional
evaluation strategies used in GraphRAG and LightRAG. In this process, the LLM scores each candidate
triple based on the following three criteria, filtering out low-quality or of-topic information before
answer generation.</p>
        <p>• Directness - refers to how clearly a triple indicates the answer to a query. Triples with relevant
entities and relations that directly point to the answer should receive higher scores.</p>
        <p>Scoring: 2 = direct mention of relevant entities/relations; 1 = indirect reference.
• Entity clarity - reflects the precision and unambiguity of the entities within a triple, which strongly
influences its overall quality.</p>
        <p>Scoring: 2 = explicit entities; 1 = ambiguous references.
• Context relevance - captures the LLM’s ability to assess whether the answer thoroughly addresses
all aspects and details of the question.</p>
        <p>Scoring: 2 = core relevance; 1 = peripheral relevance.</p>
        <p>Only triples with a total score of 3 or higher are retained. Triples with ambiguous entities or lacking
contextual grounding are excluded, ensuring that only the most relevant and precise information
supports the generation process. The full prompt guiding this filtering process is available in our
GitHub repository: https://github.com/Zhaotai924/HyP-KGRAG.</p>
        <p>After denoising, the filtered and retrieved triples, serving as core contextual information, are combined
with the original query and input into the LLM (DeepSeek-V3) to generate a response.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section, the experiments conducted to evaluate the performance of the proposed approach
are presented. The source code and the datasets are made publicly available at https://github.com/
Zhaotai924/HyP-KGRAG.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>The MSE-KG2 (Material Science and Engineering KG), which represents heterogeneous research data
within the NFDI-MatWerk3 community and beyond, is the dataset used for the experiments.The KG
covers three main domains: i) research community structure, including researchers, projects, universities,
and institutions; ii) scientific infrastructure, such as software, workflows, instruments, facilities, and
educational materials; and iii) research data, spanning repositories, publications, datasets, and reference
data. The graph consists of 5,153 triples, 1,533 entities, and 109 distinct relations. To evaluate the
proposed approach, 37 competency questions were used. 13 of these questions are complex multi-hop
questions.
1https://github.com/Zhaotai924/HyP-KGRAG
2https://demo.fiz-karlsruhe.de/matwerk/
3https://nfdi-matwerk.de/</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baseline Models</title>
        <p>HyP-KGRAG is compared against four baselines:
• Deepseek V3: This model operates without utilizing any information from the KG, relying
exclusively on the knowledge contained within the Deepseek V3.
• KGMistral: This model utilizes SPARQL to extract information from a KG to improve the QA
performance of the Mistral model.
• BM25 Retrieval + Pre-Retrieval Query Processing: This model first verbalizes the triples and
performs sentence-level and paragraph-level sparse retrieval-based matching using BM25. It can
also use LLM to process the original query to facilitate keyword-based matching in BM25.
• BGE-M3 Retrieval + Pre-Retrieval Query Processing: This model first verbalizes the triples
and performs sentence-level and paragraph-level dense retrieval-based matching using BGE-M3.</p>
        <p>It can also use LLM to process the original query to facilitate semantic matching in BGE-M3.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics and Experiment Setting</title>
        <p>
          The evaluation uses text-matching metrics at three distinct levels: ROUGE-1, ROUGE-2, and
ROUGEL, to provide a systematic assessment for lexical coverage (uni- gram), phrase structure (bigram),
and semantic coherence (LCS). Every ROUGE metric assesses several attributes, including ground
truth coverage, generation accuracy, and global balance by means of Recall, Precision, and F1 Score,
respectively. In addition, BLEU and SBERT similarity measures are used. BLEU provides stricter n-gram
precision evaluation, which is especially valuable for measuring phrase-level fidelity. The SBERT
similarity measure is based on the cosine similarity computed between sentence embeddings using the
lightweight model ”all-MiniLM-L6-v2”[
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ], which ofers a strong balance of semantic representation
ability and computational eficiency.
        </p>
        <p>For the hyperparameter settings of retrieving document chunks or triples, without loss of generality,
all retrievals are configured to return the top-k most relevant document chunks or triples for each query
(whether rewritten or not), where k ranges from 10 to 100, sorted by similarity score. These retrieved
documents are then fed into the LLM to generate answers using a standardized prompt structure. The
DeepSeek-V3 model is the LLM chosen for the experiments conducted in this work.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Results</title>
        <p>As shown in Table 1, HyP-KGRAG consistently outperforms the baseline models across all
precisionfocused metrics. It achieves the highest F1 scores for ROUGE-1 (0.532), ROUGE-2 (0.408), and ROUGE-L
(0.491), along with leading precision score of 0.569, 0.453, and 0.528 respectively. HyP-KGRAG also scores
the top BLEU score (0.231), reflecting its strong capability in producing lexically accurate and concise
answers. These results are further visualized in Figure 2, where HyP-KGRAG demonstrates superior
performance in generating highly accurate and lexically precise answers, significantly outperforming
other models in minimizing errors and aligning with reference outputs. While BGE-M3 demonstrates
competitive recall, HyP-KGRAG’s precision-centric design ensures superior reliability in scenarios
demanding factual correctness. For instance, its ROUGE-1 precision (0.569) surpasses BGE-M3’s (0.473)
by 20%, reflecting stricter adherence to relevant content. Similarly, its BLEU score (0.231) outperforms
BGE-M3 (0.200), further validating its lexical alignment strength.</p>
        <p>Compared to the earlier KGMistral7B baseline, which integrates SPARQL-based KG retrieval with
Mistral-7B generation, HyP-KGRAG achieves substantial improvements, particularly in precision and
semantic coherence. Meanwhile, the standalone DeepSeek-V3 model struggles severely on
domainspecific datasets (e.g., BLEU=0.002, SBERT=0.283), underscoring its limitations in the absence of prior
knowledge or retrieval augmentation.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Ablation Studies</title>
        <p>This section presents the ablation studies in detail. First, the proposed approach is evaluated by
comparing the linear-verbalization technique with LLM-based verbalization. Then, the impact of the
denoising module on the overall performance of HyP-KGRAG is assessed.</p>
        <sec id="sec-4-5-1">
          <title>4.5.1. Verbalization in Hypothetical Path Generation</title>
          <p>HyP-KGRAG encodes both the KG and the hypothetical paths using structured triples, represented
through a linear verbalization technique as described in Section 3.1. To evaluate the impact of this design
choice, an ablation study is conduct in which the linear-verbalization is replaced with sentence-level
natural language expressions generated by an LLM. Note that the sentence-level verbalization is chosen
here because it demonstrated a better performance as compared to paragraph-level verbalization for
the baseline models.</p>
          <p>As shown in Table 2, the results demonstrate that triple-based hypothetical path matching, i.e.,
applying linear-verbalization, achieves significantly higher retrieval accuracy than its natural language
based verbalization. This diference highlights the advantages of structured representations in RAG
systems: triples ofer compact, unambiguous, and noise-resistant semantic cues that are better aligned
with the query intent.</p>
          <p>In contrast, natural language representations, while richer in context, introduce higher variability and
redundancy. This leads to less efective matching during retrieval, particularly in precision-sensitive
tasks. Overall, the results support the efectiveness of the linear verbalization approach in closing the
semantic gap between queries and structured knowledge, thus reinforcing its role as a core design
choice in HyP-KGRAG.</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>4.5.2. Efect of Denoising Removal</title>
          <p>This ablation study evaluates the impact of removing the LLM-based denoising mechanism from
HyPKGRAG. As shown in Table 3, denoising plays a critical role in enhancing both retrieval precision and
the overall quality of generated responses. Across all precision-related metrics, denoising consistently
leads to notable improvements indicating that noisy or irrelevant triples are efectively filtered out,
allowing for more targeted and accurate content generation. One of the most noticeable improvements</p>
        </sec>
        <sec id="sec-4-5-3">
          <title>RAG Systems</title>
        </sec>
        <sec id="sec-4-5-4">
          <title>DeepSeek-V3</title>
          <p>F1
Precision
Recall</p>
        </sec>
        <sec id="sec-4-5-5">
          <title>KGMistral7B[16]</title>
          <p>F1
Precision
Recall
BM25
F1
Precision
Recall
BGE-M3
F1
Precision
Recall</p>
        </sec>
        <sec id="sec-4-5-6">
          <title>HyP-KGRAG</title>
          <p>F1
Precision
Recall
is the BLEU score, which increases by 16.5%, confirming a stronger lexical alignment between the
generated output and the ground truth. This suggests that denoising enhances the model’s ability to
focus on essential content and reduce spurious or of-topic information.</p>
          <p>However, this gain in precision comes with a slight trade-of. Specifically, ROUGE-1 recall drops
by 4.2%, pointing to a potential loss of some relevant details due to the aggressive filtering of context.
Despite this, the F1 scores across ROUGE and BLEU remain consistently positive, demonstrating an
efective balance between precision and recall. Importantly, the semantic coherence of the outputs
remains intact. The SBERT similarity score shows negligible change (from 0.598 to 0.596), indicating
that the high-level semantic alignment between the generated response and the reference answer is
Metric
ROUGE-1-F1
ROUGE-1-Precision
ROUGE-1-Recall
ROUGE-2-F1
ROUGE-2-Precision
ROUGE-2-Recall
ROUGE-L-F1
ROUGE-L-Precision
ROUGE-L-Recall
BLEU
SBERT Similarity
preserved even when some lexical content is pruned. In summary, this analysis confirms that the
denoising mechanism in HyP-KGRAG serves as a powerful tool for improving precision and refining
generation quality, without compromising the underlying conceptual integrity of the output.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <sec id="sec-5-1">
        <title>5.1. Summary</title>
        <p>This work introduces HyP-KGRAG, a knowledge-grounded RAG framework that redefines how
structured information from KGs is utilized in domain-specific question answering. By directly operating on
triples, rather than relying on potentially noisy natural language verbalizations, HyP-KGRAG avoids
linguistic ambiguities and better captures the intent. The incorporation of hypothetical path generation
allows for dynamic multi-hop evidence retrieval, while the hybrid retrieval strategy (BM25 + BGE-M3)
ensures high-recall access to relevant knowledge. Additionally, the LLM-as-judge denoising module
significantly boosts answer quality by filtering irrelevant or noisy triples prior to generation.</p>
        <p>Experimental results on the MSE-KG dataset validate the efectiveness of our design, showing strong
gains in both factual precision (e.g., ROUGE-1 F1: 0.532) and semantic coherence (SBERT similarity:
0.629) over established baselines. These findings highlight the promise of structured, triple-based RAG
approaches in enhancing reliability and accuracy in knowledge-intensive tasks.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Limitations</title>
        <p>Key limitations include: i) the system’s efectiveness still depends significantly on the reasoning
capabilities of the underlying LLM and ii) current evaluation is restricted to a single dataset, i.e.
MSEKG, which may limit generalizability.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Future Work</title>
        <p>Future work will focus on improving the scalability and generalizability of HyP-KGRAG by extending
evaluations to diverse multi-hop KGQA benchmarks. This will allow for a more comprehensive
assessment of the model’s robustness and applicability. Additionally, the integration of dynamic path
retrieval mechanisms that more closely reflect ground-truth graph structures could further enhance
multi-hop evidence retrieval quality. Another promising direction is the incorporation of multimodal
knowledge through visual-language KGs and image-attributed triples, enabling image-augmented RAG
(iRAG) models that improve contextual grounding and reduce hallucinations. The development of
agentic capabilities where the system can autonomously select and sequence retrieval, rewriting, and
verification modules may also lead to more adaptive and reliable performance. While the current
evaluation primarily focuses on addressing hallucination challenges in domain-specific QA with the
DeepSeek LLM, future work will involve more extensive experiments comparing HyP-KGRAG with
state-of-the-art models such as GraphRAG and LightRAG to more rigorously assess its comparative
performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gottweis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sayres</surname>
          </string-name>
          , E. Wulczyn,
          <string-name>
            <given-names>M.</given-names>
            <surname>Amin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Pfohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cole-Lewis</surname>
          </string-name>
          , et al.,
          <article-title>Toward expert-level medical question answering with large language models</article-title>
          ,
          <source>Nature</source>
          Medicine (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          , E. Durmus,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McKeown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <article-title>Benchmarking large language models for news summarization</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>39</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <article-title>Document-level machine translation with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2304.02210</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ruan</surname>
          </string-name>
          , et al.,
          <source>Deepseek-v3 technical report, arXiv preprint arXiv:2412.19437</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , et al.,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>03216</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Safari</surname>
          </string-name>
          ,
          <article-title>Knowledge-augmented language model prompting for zero-shot knowledge graph question answering</article-title>
          ,
          <source>arXiv preprint arXiv:2306.04136</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zhang</surname>
          </string-name>
          , Communitykg-rag:
          <article-title>Leveraging community structures in knowledge graphs for advanced retrieval-augmented generation in fact-checking</article-title>
          ,
          <source>arXiv preprint arXiv:2408.08535</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Edge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Trinh</surname>
          </string-name>
          , N. Cheng, J.
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mody</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Truitt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Metropolitansky</surname>
            ,
            <given-names>R. O.</given-names>
          </string-name>
          <string-name>
            <surname>Ness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>From local to global: A graph rag approach to query-focused summarization</article-title>
          ,
          <source>arXiv preprint arXiv:2404.16130</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Lightrag: Simple and fast retrieval-augmented generation (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>Reasoning on graphs: Faithful and interpretable large language model reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2310.01061</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mavromatis</surname>
          </string-name>
          , G. Karypis, Gnn-rag:
          <article-title>Graph neural retrieval for large language model reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2405.20139</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Knowledge graph-guided retrieval augmented generation</article-title>
          ,
          <source>arXiv preprint arXiv:2502.06864</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Kag: Boosting llms in professional domains via knowledge augmented generation</article-title>
          ,
          <source>in: Companion Proceedings of the ACM on Web Conference</source>
          <year>2025</year>
          ,
          <year>2025</year>
          , pp.
          <fpage>334</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Gesese</surname>
          </string-name>
          , et al.,
          <article-title>Kgmistral: Towards boosting the performance of large language models for question answering with knowledge graph integration</article-title>
          ,
          <source>in: Workshop on Deep Learning and Large Language Models for Knowledge Graphs</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Precise zero-shot dense retrieval without relevance labels</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>1762</fpage>
          -
          <lpage>1777</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , M. Douze,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Billion-scale similarity search with gpus</article-title>
          ,
          <source>IEEE Transactions on Big Data</source>
          <volume>7</volume>
          (
          <year>2019</year>
          )
          <fpage>535</fpage>
          -
          <lpage>547</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Minilm:
          <article-title>Deep self-attention distillation for task-agnostic compression of pre-trained transformers</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>10957</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10084</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Mingze</surname>
          </string-name>
          , Kgmistral, https://github.com/Mingze101/KGMistral/,
          <year>2023</year>
          . Accessed:
          <fpage>2025</fpage>
          -05-26.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>