<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Nara, Japan
$ angelika.kaplan@kit.edu (A. Kaplan); jan.keim@kit.edu (J. Keim); marco.schneider@student.kit.edu (M. Schneider);
ralf.reussner@kit.edu (R. Reussner)
 https://dsis.kastel.kit.edu/staf_angelika_kaplan.php (A. Kaplan); https://mcse.kastel.kit.edu/staf_jan_keim.php (J. Keim);
https://dsis.kastel.kit.edu/staf_ralf_reussner.php (R. Reussner)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>HubLink: A Novel Question Answering Retrieval Approach over Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Angelika Kaplan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Keim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Schneider</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Reussner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Karlsruhe Institute of Technology (KIT)</institution>
          ,
          <addr-line>Kaiserstraße 12, 76131 Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>The rapid growth of scholarly literature poses challenges for eficient and efective information retrieval. Existing Question Answering over Knowledge Graphs (KGQA) systems, particularly those relying on Semantic Parsing, struggle with schema dependency and required training data. In this paper, we introduce HubLink, a schemaagnostic, training-free KGQA approach leveraging pre-trained Large Language Models to enhance scholarly search with semantic aspects. HubLink structures (research) knowledge graphs into conceptual hubs, enabling source-aware inference for literature. For evaluation, we use the Open Research Knowledge Graph as the underlying knowledge base and utilize a dataset from software architecture research to populate the graph. The empirical results show that our approach HubLink outperforms three state-of-the-art baselines, especially for complex queries, marking a major advancement in scholarly KGQA. In future work, we aim to explore more advanced techniques to improve the final answer generation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Research Knowledge Graphs (RKG)</kwd>
        <kwd>Question Answering over Knowledge Graphs (KGQA)</kwd>
        <kwd>Retrieval-Augmented Generation (RAG)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Despite advances in digital publishing, scholarly communication remains fundamentally
documentcentric, with scientific knowledge fragmented across isolated articles [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This prevents researchers
from efectively interlinking related findings, methodologies, and datasets across publications [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
resulting isolation creates major barriers to knowledge discovery, particularly as the exponential growth
of scientific literature makes comprehensive literature review increasingly infeasible [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Furthermore,
traditional keyword-based search methods in this context are hindered by lexical variability, including
synonyms, abbreviations, and misspellings, further complicating information retrieval [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Recent progress in Large Language Models (LLMs) ofers new opportunities for natural
languagebased scholarly search. In particular, LLMs show potential for enabling question answering (QA)
interfaces that allow to query large bodies of scientific content. However, relying solely on the internal
knowledge of pre-trained LLMs is problematic, as it often leads to shallow, unverifiable, or hallucinated
responses [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Retrieval-Augmented Generation (RAG) has emerged as a promising approach to mitigate
these issues by enriching LLM responses with context retrieved from an external knowledge base [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
However, current naive RAG systems, i.e., based on embeddings and vector stores, sufer from limited
retrieval precision and fail to synthesize information coherently across multiple sources [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        To address these limitations, structured representations of scholarly knowledge, specifically Research
Knowledge Graphs (RKGs) such as the Open Research Knowledge Graph (ORKG) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], have been
proposed as a means to transition from document-centric to knowledge-centric scholarly communication.
RKGs represent entities and relationships extracted from scholarly content in a machine-readable format,
enabling more expressive and interconnected queries.
      </p>
      <p>While KGQA has achieved notable success in open-domain applications, its adaptation to scholarly
knowledge remains largely unexplored. The few existing approaches in the scientific domain
predominantly rely on semantic parsing (SP) methods, which require extensive training data and domain-specific
query templates. These SP-based systems face critical limitations: they struggle to scale across diverse
research domains, cannot adapt to the continuously evolving schemas of RKGs, and fail to capture
the nuanced semantic relationships inherent in academic discourse. Moreover, their dependence on
predefined patterns makes them inefective for the complex and multifaceted queries typical of scholarly
queries. Therefore, we aim to advance academic knowledge discovery by creating a schema-agnostic
KGQA system for complex scholarly queries without requiring domain-specific training.</p>
      <p>To address our objective, we ask the following research questions (RQs):
1. How can the limitations of schema-dependent and training-based KGQA systems in scholarly domains
be overcome by integrating RKGs with LLMs for schema-agnostic and provenance-aware retrieval?
2. What is the performance of such an approach when benchmarked against state-of-the-art approaches
on scholarly datasets?</p>
      <p>
        To address our research questions and overcome the limitations of existing approaches, we propose
HubLink, a novel training-free and schema-agnostic retrieval approach for scholarly KGQA that
organizes Knowledge Graphs (KGs) into conceptual hubs to enable eficient, source-aware retrieval. To
systematically evaluate KGQA systems in scholarly contexts, we introduce a comprehensive taxonomy
for categorizing academic queries based on their complexity and information needs. Using this
taxonomy, we evaluate HubLink on a literature search task in the software architecture (SWA) research
domain utilizing the ORKG [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] as KG. Our experimental results show that HubLink outperforms
three state-of-the-art KGQA baselines, performing particularly better at complex, multi-hop queries
that require information from multiple sources. These contributions advance the field toward more
efective and scalable KGQA systems for academic knowledge discovery. The supplementary material
associated with this paper is publicly available in our replication package [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. HubLink: KGQA by Graph Decomposition</title>
      <p>In this section, we present HubLink in response to our first research question, which addresses the
challenge of overcoming schema dependencies while enabling provenance-aware retrieval. HubLink is
our novel schema-agnostic, training-free KGQA approach that employs a GraphRAG-inspired pipeline
comprising indexing, retrieval, and generation stages. The approach decomposes KGs into semantically
coherent subgraph structures, so-called “hubs”, during indexing, enabling source-aware retrieval with
full traceability and provenance, which are essential requirements for scholarly literature search.
The indexing phase identifies root entities, constructs hubs through structured graph traversal, and
stores them in vector databases for eficient retrieval (Section 2.1). HubLink implements two retrieval
strategies: Graph Traversal Strategy and Direct Retrieval Strategy (Section 2.2). The generation phase
synthesizes answers while maintaining explicit source links to graph origins, ensuring both relevance
and traceability (Section 2.3). This architecture delivers transparent KGQA with verifiable provenance,
addressing the interpretability demands of academic and research contexts. An overview of our retrieval
approach, HubLink, is depicted in Figure 1.</p>
      <sec id="sec-2-1">
        <title>2.1. Indexing</title>
        <p>HubLink’s indexing process transforms KGs into structured sets of interlinked subgraphs (Hubs) that
serve as primary retrieval units for Question Answering (QA). The process begins by selecting start
entities as initial traversal points, then identifies Hub Root entities through graph traversal until
reaching leaf nodes or other Hub Roots. Each Hub Root anchors Hub Paths, i.e., directed paths to
terminal nodes. Hub Roots and their associated Paths collectively define Hubs.
User Question in
Natural Language
Extract Question</p>
        <p>Components
Topic Entity</p>
        <p>Storage
Vector Store</p>
        <p>Retrieval</p>
        <p>Retrieval
Direct Retrieval Strategy
Graph Traversal Strategy</p>
        <p>Score and </p>
        <p>Prune Hubs
External Data Linking
External
Database</p>
        <p>Legend:</p>
        <p>Input
Output
Alternatives
Knowledge Base
LLM / Embedding
Dataflow</p>
        <p>Optional Dataflow
Retrieved
Context(s)
Final Answer</p>
        <p>Partial Anwer</p>
        <p>Generation</p>
        <p>Partial
Answer(s)</p>
        <p>Final Anwer
Generation</p>
        <p>
          Each Hub Path undergoes a multi-step transformation and indexing pipeline: (1) LLM generation of
textual descriptions, (2) parsing into structured components, including RDF-like triples and entities, (3)
embedding into a vector space via pre-trained models, and (4) storage in vector databases supporting
Approximate Nearest Neighbor (ANN) search [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ].
        </p>
        <p>The vectors include essential metadata (Hub Root identifiers &amp; LLM-generated descriptions), enabling
traceable retrieval during queries. The indexing process is recursively applied to Hub Path endpoints
until it reaches leaf nodes or the maximum traversal depth, ensuring controlled graph coverage.</p>
        <p>Since retrieval relies entirely on indexed data, maintaining consistency between knowledge graphs
and indices is critical. As such, the approach implements two complementary update strategies: Fixed
Update for complete scheduled synchronization and Dynamic Update, where the (real-time) changes in
the graph define the required updates, enabling consistent KGQA in dynamic graph environments.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Retrieval</title>
        <p>HubLink’s retrieval identifies relevant hubs and paths for generating partial answers used in final answer
generation. Two strategies are supported: (a) Direct Retrieval Strategy for fast retrieval with potential
accuracy reduction in locally scoped queries, and (b) Graph Traversal Strategy for higher precision in
local queries at increased execution cost. Both strategies require question preprocessing: (i) computing
full question embeddings via pre-trained models, (ii) extracting semantic components (entities, types,
time expressions, constraints) via LLM for precise matching, e.g., "Which papers have been published
by CEUR-WS.org in 2020?" decomposes to [’Publisher’, ’CEUR-WS.org’, ’2020’], and (iii) individually
embedding and aggregating extracted components. To support fine-grained retrieval for complex
queries, the input includes the original question, semantic components, and vector representations.</p>
        <p>
          The Direct Retrieval Strategy uses ANN search [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ] within the precomputed vector index (cf.
Section 2.1) for eficient KG content retrieval without graph queries, emphasizing speed and simplicity.
Following question preprocessing, the strategy iteratively queries the vector store to collect candidate
hubs, avoiding duplicates and terminating early when no new results are found. Retrieved hubs undergo
refinement to standardize path counts: excess paths are pruned while insuficient hubs receive additional
paths via vector search based on hub root entities. Final hubs are scored, ranked, filtered for processing,
and then generate intermediate responses. If one or more partial answers exist, they are synthesized
into a coherent final answer (cf. Section 2.3).
        </p>
        <p>Starting from Hub roots, the Graph Traversal Strategy explores the KG structure to identify
relevant information, incorporating structural graph information through bidirectional path traversal.
The process initializes bidirectional search from the entry entity, preprocessing the input question as
described above. Using these values, traversal proceeds iteratively up to a predefined maximum depth.
At each level, the algorithm checks for available entities, identifies reachable hub candidates from the
current entity set, and determines the next entities for exploration. Found hub candidates are ranked
by relevance, with top-ranked hubs used to generate partial and final answers (cf. Section 2.3). If no
answers are produced, traversal continues to the next level.</p>
        <p>HubLink’s Pruning filters irrelevant hubs before answer generation, increasing eficiency and
accuracy by evaluating candidate hubs based on their associated path relevance. Each hub receives a
weighted relevance score computed by aggregating individual path scores, with higher-scoring paths
receiving greater weight to ensure highly relevant information dominates the hub’s final score. This
weighting prioritizes hubs that contain highly informative paths while minimizing the impact of weaker
paths. The approach ranks the scored hubs and keeps only those that exceed a predefined threshold.
The goal is to produce semantically aligned hubs that support focused, accurate responses.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Answer Generation and Information Linking</title>
        <p>Partial answer generation serves as HubLink’s key intermediate step, building the foundation for
ifnal answer synthesis. This step utilizes KG information and supports the integration of external
contextual data through linking procedures, enabling enhanced reasoning despite graph incompleteness
or insuficient content. The generation takes a hub and processed question as input; for the graph
traversal strategy, it includes a path connecting the hub to central topics, converted to natural language.
The LLM receives the question, Hub Path descriptions, triples, and external knowledge, generating
partial answers only when data suficiently supports meaningful responses; otherwise, the Hub is
skipped. This ensures context-aware, semantically grounded answers that are potentially enriched
beyond the graph.</p>
        <p>Final answer generation synthesizes partial answers into a single response. To achieve this,
HubLink takes the user’s question, the partial answers, and the paths they came from, and prompts an
LLM to merge them. The prompt can be adjusted based on the task, e.g., in a literature search, the model
is asked to insert citation tags into the answer to indicate where each claim came from. At the end of the
answer, the approach adds a list of all the sources that were referenced. To make the results explainable
and traceable, HubLink also returns a list of the actual triples (subject-predicate-object statements) from
the KG that were used to generate the answer. It starts by sorting the retrieved paths by relevance, then
collects all the triples they contain. These triples are passed, along with the question and final answer,
to an LLM, which identifies only those triples that were actually relevant to the answer. This means
that users not only get a clear and concise answer but also see which parts of the graph were used to
create it, supporting transparency and helping build trust. This feature is especially useful in research
contexts, where it is important to know an information’s origin.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        3.1. Setup
In this section, we evaluate HubLink in terms of performance benchmarking, addressing our second
research question. To systematically assess retrieval quality and answer alignment, we use the Goal,
Question, Metrics (GQM) approach [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. We first describe our experimental setup (Section 3.1), then
present results organized by evaluation goals (Section 3.2).
      </p>
      <p>This section describes the experimental setup for evaluating HubLink against state-of-the-art baselines.</p>
      <sec id="sec-3-1">
        <title>3.1.1. Evaluation Metrics</title>
        <p>In RAG systems, retrieval quality is fundamental to overall performance, as irrelevant or missing
documents directly impact the quality of generated answers.</p>
        <p>We employ various common information retrieval metrics to evaluate the efectiveness of document
retrieval. For this, we use four rank-agnostic metrics. Accuracy provides an assessment of the ratio of
correctly retrieved contexts compared to the total number of retrieved contexts. Precision quantifies
the fraction of retrieved documents that are relevant, indicating the retrieval system’s ability to avoid
false positives. Recall measures the fraction of relevant documents that are successfully retrieved,
capturing the system’s completeness in finding information. The F1 score combines precision and
recall into a single harmonic mean, providing a balanced measure that penalizes systems performing
poorly on either metric. Additionally, we employ rank-aware metrics that consider the order of retrieved
documents, as ranking quality is crucial for RAG systems that typically use only the top-k results.
Hits@K measures the fraction of queries where at least one relevant document appears within the top-k
retrieved results, providing insight into retrieval success at diferent cut-of points. Mean Reciprocal
Rank (MRR) computes the average of the reciprocal ranks of the first relevant document for each query,
emphasizing the importance of ranking relevant documents highly. Mean Average Precision (MAP)
calculates the average precision across diferent recall levels for each query, ofering a comprehensive
view of ranking quality across all relevant documents. Exact Match (EM) quantifies the proportion of
queries where the retrieved contexts exactly match the expected ground-truth contexts, providing a
strict binary evaluation of retrieval correctness.</p>
        <p>
          Beyond retrieval, we evaluate the quality of generated answers using established text generation
metrics adapted for scientific QA contexts. ROUGE [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] calculates n-gram, word sequence, and longest
common subsequence overlap between generated and reference texts, capturing lexical precision and
content coverage essential for factual accuracy. BLEU [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] measures n-gram overlap between generated
and reference answers, originally for machine translation but widely adopted for text generation.
BERTScore [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] uses pre-trained transformer embeddings to compute semantic similarity rather than
lexical overlap, suitable for scientific domains with paraphrasing and terminology variations. Using
the RAGAs framework [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], we also use the following metrics: The semantic similarity compares
the embeddings of the expected answer with the provided one, using OpenAI’s text-embedding-3-small
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and cosine similarity. Lastly, the string similarity uses basic string comparison metrics like
Levenshtein, Jaro, Jaro-Winkler, and Hamming distance.
        </p>
        <p>
          Complementing traditional metrics, we employ LLM-as-judge evaluation using LLMs to assess answer
quality beyond lexical similarity [17, 18]. For this, we use the RAGAs framework [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and the LLM
gpt4o-mini [19] by OpenAI. We use the following key metrics for scientific RAG evaluation from the RAGAs
framework: Faithfulness ensures all answer claims are logically inferred from retrieved contexts,
crucial for scientific QA requiring accuracy and verifiability. Answer Relevancy evaluates
questionanswer alignment, penalizing incomplete, verbose, or irrelevant responses. Factual Correctness
measures alignment between generated answers and ground-truth references. Instruction Following
evaluates whether generated answers comply with explicit formatting requirements in questions (e.g.,
ordering, aggregation, specific output structures).
        </p>
        <p>When computing metrics across queries, multiple averaging approaches exist. We employ
macroaveraging, which computes metrics per query then averages, treating each query equally. We use this
averaging method for a balanced assessment across diverse query types and complexities, particularly
important for evaluating KGQA systems on heterogeneous scholarly questions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.2. Knowledge Base Variants</title>
        <p>To evaluate how knowledge representation afects retrieval performance, we constructed four RKG
variants based on two orthogonal characteristics: path length and contribution granularity.</p>
        <p>Path length determines the semantic depth of graph traversals. Long paths preserve rich semantic
relationships through detailed intermediate nodes, enabling more expressive queries but requiring
extensive traversal operations. Short paths collapse these relationships into direct connections, simplifying
retrieval at the potential cost of semantic expressiveness.</p>
        <p>Contribution granularity afects how research contributions are represented. Distinct contribution
nodes create separate graph entities for each contribution within a paper, maintaining clear boundaries
between diferent research outputs and their associated metadata. Cumulative contribution nodes
consolidate all contributions from a paper into a single node, reducing graph complexity but potentially
blending distinct research outputs.</p>
        <p>Combining these characteristics yields four evaluation variants: Variant 1 (GV1) contains long paths
with distinct contribution nodes (maximal semantic richness); Variant 2 (GV2) has long paths with
cumulative contribution nodes (balanced semantic depth); Variant 3 (GV3) has short paths with distinct
contribution nodes (simplified structure, preserved granularity); Variant 4 (GV4) contains short paths
with cumulative contribution nodes (minimal complexity).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.3. Question Taxonomy &amp; Dataset</title>
        <p>To structure our evaluation, we developed a comprehensive question taxonomy. This allows us to
categorize questions and construct a question dataset that contains a fair mixture of diferent kinds
of questions. For this paper, we focus on the categories based on retrieval operations required for
answer generation. Drawing from KGQA dataset literature and scholarly research question frameworks
[20, 21, 22, 23, 24, 25, 26, 27], we identified the following eight distinct question categories.</p>
        <p>Basic questions enable direct answer retrieval without additional operations (e.g., “What is the
definition of the Client-Server software architecture pattern?”). Relationship questions necessitate identifying
connections or dependencies between information elements, such as causalities or correlations (e.g.,
“Which components in a client-server software architecture need to communicate with each other?”).
Negation questions require detecting conditions that do not hold, based on explicit negation or absent
information (e.g., “Which KGQA approach does not use training?”). Aggregation questions demand
synthesizing multiple information pieces (e.g., “What is the average runtime of systems based on the
client-server architecture?”). Counting questions involve enumerating relevant data points (e.g., “How
many KGQA approaches were published between 2020 and 2024?”). Superlative questions require
identifying extremes among multiple data points (e.g., “Which architecture pattern ensures lowest
latency?”). Ranking questions necessitate ordering multiple data points according to specified criteria
(e.g., “How do training-free KGQA approaches perform, sorted by F1 score?”). Comparison questions
require evaluating attributes across two or more data points (e.g., “Is method A better than method B?”).</p>
        <p>We constructed our evaluation dataset using a curated, domain-specific dataset from the SWA
research domain by Konersmann et al. [28] comprising 153 scientific publications from the European
Conference on Software Architecture (ECSA) and the International Conference on Software Architecture
(ICSA). Each publication is annotated w.r.t. multiple dimensions, including research object, evaluated
property, evaluation method, paper class, bibliographic metadata, and others. An overview of the
data schema is presented in [29]. We use this dataset to populate the ORKG according to each of the
four knowledge base variants (see Section 3.1.2). Using these templates, we generated a QA dataset
grounded in KG statements. The final dataset comprises 170 questions paired with ground-truth answers
and corresponding KG statements required for answer generation, with variations adapted for each
knowledge base variant.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1.4. Baselines</title>
        <p>For the evaluation against state-of-the-art approaches, we established specific criteria for baseline
selection. Given our focus on training-free scholarly KGQA, we required baseline approaches to
be: (1) LLM-based KGQA systems, (2) training-free to ensure fair comparison with HubLink, and (3)
accompanied by available and functional source code to enable reproducible evaluation. Our baseline
selection process followed a systematic review of two comprehensive KGQA surveys by Pan et al.
[30] and Peng et al. [31]. From this systematic review, we identified three approaches that met all our
criteria and represent diferent paradigms within training-free KGQA: Direct Fact Retrieval (DiFaR)
[32], FiDeLiS [33], and MindMap [34]. Notably, all selected baselines are published at top-tier ACL
venues (2023-2025), ensuring methodological rigor and community validation.</p>
        <p>Direct Fact Retrieval (DiFaR) [32] indexes KG triples as embeddings, then uses ANN search to
retrieve the closest triples for each query. Retrieved triples serve as context for answer generation, with
an optional LLM-based reranking refinement step.</p>
        <p>FiDeLiS [33] performs beam search from designated entry points using LLM-generated strategic
plans that extract keywords and convert queries to declarative format. It iteratively retrieves and
embeds relational paths, scoring them against keyword embeddings to maintain top-N candidates until
either the termination criteria or the maximum path length is reached, then generates answers from
the candidate paths.</p>
        <p>MindMap [34] embeds all KG entities and uses LLM-driven entity extraction with ANN search to
identify relevant entities from questions. It constructs evidence subgraphs, i.e., shortest paths between
entities and 1-hop neighborhoods, then converts them to natural language descriptions via LLM to
serve as context for answer generation.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.1.5. Parameter Selection</title>
        <p>HubLink, as well as the selected approaches, all require certain parameters to be selected. Due to
practical constraints, including high costs and time requirements for LLM experiments, a fully-fledged
parameter optimization process is infeasible. Thus, we employ a parameter selection process using the
One-Factor-a-Time (OFAT) method [35] to optimize retrieval performance based on Recall and Hits@10
metrics. The following enumerates the individual parameters. Italic parameters served as the baseline
for the OFAT method. In the evaluation, we use the parameters that have shown the most promising
results. For parameter optimization, we utilized a reduced dataset of 44 diverse questions, following the
same construction methodology as the full dataset, evaluated against GV1. Details on the individual
results can be found in the replication package.</p>
        <p>
          General parameters for all approaches: LLM models (Closed: gpt-4o-mini [19], gpt-4o [36],
o3-mini [37]; Open: Qwen2.5-14B [38], Llama3.1-8B [39]), embedding models (mxbai-embed-large
[40], text-embedding-3-large [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], granite-embedding [41]), question augmentation (false/true), and
reranking (false/true).
        </p>
        <p>HubLink-specific parameters: traversal strategy (false/ true), extract question components
(false/true), top paths to keep (10/20/30), number of hubs (10/20/30), filter output context (false/ true),
diversity ranking penalty (0/0.01/0.05/0.1), and path weight alpha (0/3/5/9).</p>
        <p>Baseline parameters: DiFaR uses number of results (30/60/90/120/150) and distance metric
(cosine/IP/L2); FiDeLiS employs top-k (10/20/30), top-n (10/20/30), and alpha (0.1/0.3/0.6); MindMap
considers final paths to keep ( 10/20/30), shortest paths to keep (10/20/30), and neighbors to keep (10/20/30).</p>
        <p>Besides the above parameters, we have diferent variants of HubLink Configurations: The T
variant utilizes the graph traversal strategy of HubLink. This variant was also used in the parameter
selection process. HubLink D instead uses the direct retrieval strategy. The variant F is a fast version that
focuses on reduced runtime, employing the direct retrieval strategy, and limiting the number of hubs to
10 per question. The O variant adopts the T variant to use open models, i.e., the mxbai-embed-large
embedding model and the LLM Qwen2.5-14B.</p>
        <sec id="sec-3-5-1">
          <title>3.2. Evaluation Results</title>
          <p>In this section, we present the empirical results of our comprehensive evaluation across two main goals:
retrieval quality (Section 3.2.1) and answer generation quality (Section 3.2.2).</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>3.2.1. Evaluating Retrieval Quality</title>
        <p>First, we evaluate the retrieval quality (Evaluation Goal 1). To address the three following evaluation
questions associated with this evaluation goal, we use the metrics Precision, Recall, F1-score, Hits@10,
EM@10, MRR@10, and MAP@10.</p>
        <p>Q1.1: What is the overall retrieval performance of the diferent Hublink variants in comparison
to the baselines, and which HubLink variant performs best?</p>
        <p>Table 1 shows retrieval performance of HubLink variants and baselines on GV1.</p>
        <p>HubLink with graph traversal (T) achieves best overall performance, surpassing direct retrieval
(D) which trades topic entity requirements for reduced efectiveness in complex graphs. HubLink (F),
runtime-optimized with limited scope, achieves highest Precision/F1, illustrating hub count-performance
trade-ofs. The open variant (O) performs worst overall, though highest MRR@10 indicates efective
ranking when retrieving correct triples. Comparing HubLink to the competing approaches, HubLink
(T) achieves 0.754 Recall, doubling DiFaR’s 0.352, far exceeding Mindmap/FiDeLiS 10%. Despite low
Precision across models, HubLink (T)’s 0.246 shows improved relevance filtering. It leads in Hits@10
(0.512), MAP@10 (0.299), MRR@10 (0.502), though ranking limitations persist. HubLink variants
outperform baselines in retrieval accuracy/contextual relevance, with graph traversal (T) proving most
efective. Higher hub counts improve recall but reduce precision/ranking quality. Embedding-based
methods ofer clear advantages for scholarly KGQA despite filtering/ranking limitations.</p>
        <p>Summary Q1.1: HubLink substantially outperforms baseline KGQA approaches in both recall and
precision, more than doubling the recall of the next best baseline. While ranking performance needs
improvement, HubLink demonstrates superior retrieval of contextually relevant triples.</p>
        <p>Q1.2: Which performance influence do the diferent required retrieval operations have?
Table 2 shows the performance across the eight retrieval operations (cf. Section 3.1.3). The results are
based on graph variant GV1. HubLink excels at Basic operations (highest Recall/Hits@10), indicating
strong single-triple lookup performance. Precision/F1 scores remain lower than Recall, especially for
Negation/Superlative operations, suggesting dificulty filtering relevant triples in complex reasoning.
HubLink (T)/(O) excel at Basic operations; (D)/(F) perform better on Comparative/Relationship queries.
Negation/Superlative operations consistently show reduced performance across variants.</p>
        <p>Summary Q1.2: Recall peaks for basic operations but declines with reasoning complexity. The retrievers
struggle to distinguish relevant/irrelevant contexts, particularly for negation/superlative operations.</p>
        <p>Q1.3: How robust to other graph schemas is the best HubLink variant compared to the baselines?
Table 3 shows retrieval performance across graph variants. HubLink demonstrates superior
robustness to graph variation compared to baselines. Shorter paths improve performance across all
approaches, especially for Precision/ranking metrics. The baselines show steeper performance declines
with increasing path length while HubLink maintains consistent superiority.</p>
        <p>HubLink excels in multi-hop reasoning: as path lengths increase, baseline performance drops sharply
while HubLink sustains high Recall and better Precision/ranking scores. Despite a slight decrease in
precision with longer paths, HubLink remains most efective at retrieving deeply embedded information.</p>
        <p>Summary Q1.3: HubLink demonstrates greater robustness across diverse graph structures. While shorter
paths improve all methods, HubLink notably outperforms baselines in multi-hop reasoning tasks.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.2.2. Evaluating Answer Alignment</title>
        <p>Second, we evaluate the answer generation quality (Evaluation Goal 2) of the best HubLink variant
(T) against the baselines on GV1. To answer the corresponding evaluation questions, we are using the
metrics BLEU, ROUGE, Semantic/String Similarity, BERTScore, and the LLM-as-judge metrics Factual
Correctness, Answer Relevancy, Instruction Following, and Faithfulness. Table 4 presents the results.</p>
        <p>Q2.1: How semantically and factually consistent are the generated answers based on the retrieved
triples of HubLink in comparison to the baselines?</p>
        <p>HubLink achieves the highest factual correctness Recall (0.543), but substantially lower than retrieval
Recall (0.754), indicating fact loss during generation. High ROUGE-1 Recall (0.757) may reflect lexical
overlap rather than factual accuracy. DiFaR/Mindmap show better retrieval-generation alignment,
suggesting more eficient fact retention. HubLink’s BERTScore Recall (0.678) trails DiFaR’s (0.702);
lower precision across metrics indicates structural/lexical divergence from references. Despite higher
factual correctness precision (0.301) than retrieval precision (0.246), HubLink lags behind DiFaR (0.290),
suggesting inclusion of extraneous content. HubLink produces comprehensive but verbose answers,
reducing factual focus.</p>
        <p>Summary Q2.1: HubLink shows limited fact integration and semantic consistency. Answers include
F1
unrequested information and deviate structurally from references, evidenced by lower metric scores.</p>
        <p>Q2.2: How relevant to the questions are HubLink’s answers in comparison to the baselines?
HubLink achieves the highest answer relevancy (0.570), outperforming Mindmap (0.545), FiDeLiS
(0.432), and DiFaR (0.203). However, 43% of answers lack optimal alignment. Notably, answer relevancy
measures semantic fit, not factual accuracy or hallucinations.</p>
        <p>Summary Q2.2: HubLink leads in answer relevance, demonstrating strong semantic alignment with
questions, although its absolute performance indicates room for improvement.</p>
        <p>Q2.3: How well does HubLink follow instructions in the question in comparison to the baselines?
Dataset questions often include explicit format instructions like ordering or aggregation. HubLink
achieves the highest Instruction Following score (0.653), 68% above Mindmap/FiDeLiS (0.388) and double
DiFaR (0.312). However, HubLink still fails to follow instructions in 34% of cases.</p>
        <p>Summary Q2.3: HubLink outperforms baselines in instruction following, but inconsistent adherence
still requires improvement.</p>
        <p>Q2.4: How aligned are the answers of HubLink (T) and the baselines to the retrieved contexts?
Faithful answer generation requires that outputs are strictly based on the retrieved context, avoiding
unsupported content. DiFaR achieves the highest Faithfulness score (0.645), while HubLink (0.445),
Mindmap (0.396), and FiDeLiS (0.112) show progressively weaker grounding to retrieved contexts.</p>
        <p>Summary Q2.4: HubLink shows weaker contextual grounding than DiFaR. Improving faithfulness and
reducing unsupported claims remain key areas for future work.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Threats to Validity</title>
      <p>This section analyzes potential threats to the validity of our experimental findings following the
classification proposed by Konersmann et al. [28].</p>
      <p>External validity refers to the generalizability of experimental results beyond the specific
experimental context. One primary threat in our study is that the KGQA dataset questions may not fully
capture the diverse information needs of researchers across diferent domains. To mitigate this risk, we
systematically generated questions based on our structured taxonomy covering multiple query types and
six representative use cases for scholarly literature search, ensuring broad coverage of typical research
scenarios. Another threat concerns the use of experimental settings or tools that do not accurately
reflect real-world usage conditions. To address this, we developed an evaluation framework using the
RAG paradigm, which represents the current state of the art in QA systems. Additionally, we used
standard evaluation metrics for RAG-based systems and their established formulas and implementations
to ensure methodological rigor and comparability with related work.</p>
      <p>Internal validity assesses whether observed efects stem from experimental manipulation rather
than confounding factors. Potential threats include implementation variations between baselines
and adaptation inconsistencies. To mitigate these, we used a custom framework integrating baseline
retrievers implemented according to their original descriptions and source code. Adaptations were
minimal to preserve original implementations, with all modifications documented for transparency.</p>
      <p>Construct validity assesses whether experimental constructs are well-defined and properly
operationalized. Given that our design is grounded in an established RAG evaluation framework and informed
by recent surveys, the risk to construct validity is low. However, several potential threats remain: (1)
testing a limited number of configurations per retriever may not capture the full performance space;
(2) reliance on automated metrics without human evaluation may introduce mono-method bias; and
(3) the OFAT design may overlook parameter interactions. To mitigate these risks, we used widely
accepted metrics (e.g., precision, recall, RAGAS faithfulness, and relevance) and justified OFAT for its
interpretability, applying systematic parameter selection.</p>
      <p>Confirmability addresses the risk that experimental findings are influenced by researcher bias
rather than being grounded in the underlying data. We acknowledge the potential for interpretive
bias in our study: the selection of baseline systems may favor approaches similar to HubLink, the
interpretation of ambiguous results could be influenced by our hypotheses, and the choice of evaluation
scenarios might inadvertently advantage our approach. To mitigate these risks, we implemented multiple
safeguards. First, we provide complete experimental artifacts, including raw results, processing scripts,
and visualization code in our replication package. Second, we report both favorable and unfavorable
results transparently, including cases where baselines outperform HubLink. Third, our baseline selection
followed a systematic review of recent KGQA surveys and selected applicable retrievers.</p>
      <p>Repeatability refers to the risk that experimental results may not be repeatable under the same or
similar conditions. The primary threat in our study stems from the inherent non-determinism of LLMs,
which may produce varying outputs across runs. While we anticipate that the relative performance of
KGQA approaches (i.e., HubLink and the baselines) will exhibit consistent trends, absolute values may
lfuctuate. To address this and to mitigate this risk, our framework is designed to capture, preserve, and
replicate the complete experimental configuration.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>The following section reviews four key areas of related work in KGQA retrieval, each ofering approaches
conceptually relevant to our proposed HubLink approach.</p>
      <p>Approaches KGQA on RKG: Recent scholarly QA approaches integrate LLMs with RKGs primarily
through semantic parsing, translating natural language to formal queries (e.g., SPARQL) [42]. JarvisQA
[43] and DBLP-QuAD [44] use annotated datasets and entity linking but sufer from schema dependence
and limited scalability. KGMistral [45] and Tafa and Usbeck [46] employ template-based RAG,
performing well in controlled settings but failing to generalize to unseen questions or dynamic KGs. Evaluation
primarily uses the SciQA dataset, whose auto-generated nature and train-test schema overlap limit
generalizability [47, 48], while semantic parser performance degrades on evolving KGs with unseen
entities [49]. HubLink instead embeds subgraphs (hubs) rooted at specific nodes and transforms paths
to text, enabling schema-agnostic adaptation without annotated training data.</p>
      <p>
        LLM-Guided Stepwise KGQA Approaches: Recent training-free KGQA methods decompose tasks
into iterative queries guided by pre-trained LLMs, difering from HubLink in how they structure and
utilize KG information. The LLM acts as reasoning controller: Think-on-Graph (ToG) [
        <xref ref-type="bibr" rid="ref17">50</xref>
        ] uses beam
search for multi-hop reasoning; Knowledge Solver (KSL) [
        <xref ref-type="bibr" rid="ref18">51</xref>
        ] models traversal as decision-making
dialogue; Observation-Driven Agent (ODA) [
        <xref ref-type="bibr" rid="ref19">52</xref>
        ] operates in observe–act–reflect loops; GRAPH-COT [
        <xref ref-type="bibr" rid="ref20">53</xref>
        ]
and StructGPT [
        <xref ref-type="bibr" rid="ref21">54</xref>
        ] call graph functions or linearize KG data; FiDeLiS [33] combines retrieval with
deductive reasoning. Extensions include ToG-2 [
        <xref ref-type="bibr" rid="ref22">55</xref>
        ] with hybrid KG-text retrieval and Generate-on-Graph
(GoG) [
        <xref ref-type="bibr" rid="ref23">56</xref>
        ] generating plausible triples for incomplete KGs. HubLink instead retrieves precomputed,
embedded subgraphs (hubs), enabling reasoning over structured hubs rather than sequential LLM-driven
exploration, thus incorporating transitive relations unreachable via stepwise traversal.
      </p>
      <p>
        KGQA using Contextual Subgraph Construction: Some approaches construct subgraphs
dynamically during retrieval by identifying question entities and expanding the KG to form localized
subgraphs for LLM-based KGQA. Mindmap[34] prompts LLMs to assemble evidence subgraphs from
paths and neighbors; KG-GPT [
        <xref ref-type="bibr" rid="ref24">57</xref>
        ] decomposes questions into triples before reasoning over linearized
subgraphs; RoK [
        <xref ref-type="bibr" rid="ref25">58</xref>
        ] builds multi-hop paths via chain-of-thought reasoning, ranks them with PageRank,
and forwards results to the LLM. Unlike these query-time dynamic approaches based on topology and
proximity, HubLink precomputes semantically coherent subgraphs (hubs) ofline, enabling fast retrieval
with structured reasoning and origin tracking, features absent in dynamic methods.
      </p>
      <p>
        Utilizing Vector Representations for KGQA: KGQA approaches using dense vector
representations vary in embedding scope and training requirements. Direct embedding methods use pre-trained
models without KG-specific training: DiFaR [ 32] encodes triples ofline for nearest-neighbor retrieval
but lacks structural context by focusing on isolated triples. HubLink instead embeds precomputed
subgraphs (hubs), capturing richer semantics through textualized paths, triples, and entities.
Trainingbased methods learn entity and relation vectors from KG structure: Pretrain-KGE [
        <xref ref-type="bibr" rid="ref26">59</xref>
        ] fine-tunes
BERT on triples; KEPLER [
        <xref ref-type="bibr" rid="ref27">60</xref>
        ] combines Knowledge Graph Embedding (KGE) with language modeling;
quaternion-based models [61] capture implicit semantics. However, all require substantial training data.
HubLink uses general-purpose embeddings for textualized hubs, providing structured context without
task-specific fine-tuning, ensuring portability across KGs.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion &amp; Future Work</title>
      <p>Current KGQA systems face critical limitations: schema dependence, training data requirements, and
lack of standardized evaluation. This paper introduces HubLink, a schema-agnostic, training-free
retrieval approach for scholarly KGQA. HubLink transforms RKGs into hubs, semantically coherent
knowledge clusters, enabling modular reasoning with full provenance tracking. The embedding-based
retrieval uses LLM capabilities to eliminate schema dependencies and adapt to evolving KG structures.</p>
      <p>Evaluation on ORKG with SWA data showed HubLink outperformed DiFaR, FiDeLiS, and MindMap
across almost all metrics, excelling particularly in handling complex queries. The evaluation dataset
based on our question taxonomy demonstrated HubLink’s robustness. These results answer our second
research question, demonstrating superior performance compared to state-of-the-art approaches on
scholarly datasets.</p>
      <p>This work demonstrates that HubLink’s schema-agnostic, training-free approach with RKGs can
help access scholarly knowledge, paving the way for more efective and scalable KGQA systems.</p>
      <p>However, the evaluation also showed areas for improvement, including the following:
Improving Answer Generation. Our evaluation revealed limited alignment between retrieved
contexts and generated answers. Future work should explore advanced prompt engineering and
alternative synthesis methods beyond single-prompt generation.</p>
      <p>Enhancing the Evaluation Scope. Current KGQA datasets operate at label-based granularity
(Section 3.1.3). Extending HubLink to alternative text granularities could validate findings and enable
HubLink’s linking feature, which is inefective with abstract label-based data.</p>
      <p>Improving Relevancy Ranking. Results showed limitations in prioritizing relevant contexts.
Future work should explore refined prompts or post-retrieval reranking mechanisms.</p>
      <p>Augmenting Hub Content. Enriching hub indices with summaries or precomputed aggregations
could enhance retrieval accuracy. This remains an open area for development and empirical validation.</p>
      <p>Handling Numerical Constraints. While HubLink demonstrates temporal reasoning, complex
numerical filtering remains underexplored and requires targeted evaluation.</p>
      <p>Document-Based Retrieval Settings. Transitioning HubLink to document retrieval scenarios,
where the KG supports source-aware retrieval, represents a promising direction. Preliminary work
includes baseline implementations (e.g., LightRAG, GraphRAG) and structured knowledge extraction
from scientific texts, though full evaluation remains pending.</p>
      <p>Application to Alternative Graphs and Domains. The evaluation of HubLink was conducted on
the ORKG as knowledge base within the SWA domain. Future work should assess the generalizability
across diverse RDF-based RKGs and other domains.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)
under the National Research Data Infrastructure – NFDI 52/1 – project number 501930651, NFDIxCS,
supported by funding from the pilot program Core Informatics at KIT (KiKIT) of the Helmholtz
Association (HGF) and supported by KASTEL Security Research Labs.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.
https://platform.openai.com/docs/models/
[17] A. Alinejad, K. Kumar, A. Vahdat, Evaluating the retrieval component in llm-based question
answering systems, 2024. URL: https://arxiv.org/abs/2406.06458.
[18] H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, Z. Liu, Evaluation of retrieval-augmented generation:
A survey, in: W. Zhu, H. Xiong, X. Cheng, L. Cui, Z. Dou, J. Dong, S. Pang, L. Wang, L. Kong,
Z. Chen (Eds.), Big Data, Springer Nature Singapore, 2025, pp. 102–120.
[19] OpenAI, gpt-4o-mini model, https://platform.openai.com/docs/models/gpt-4o-mini, 2024.
Accessed: 2025-09-01.
[20] D. Banerjee, S. Awale, R. Usbeck, C. Biemann, Dblp-quad: A question answering dataset over the
dblp scholarly knowledge graph, 2023. URL: https://arxiv.org/abs/2303.13351.
[21] D. Q. Nguyen, D. Q. Nguyen, S. B. Pham, Ripple down rules for question answering, Semantic</p>
      <p>Web 8 (2017) 511–532. doi:10.3233/SW-150204.
[22] S. Easterbrook, J. Singer, M.-A. Storey, D. Damian, Selecting empirical methods for software
engineering research, in: Guide to advanced empirical software engineering, Springer, 2008, pp.
285–311.
[23] J. T. Dillon, The classification of research questions, Review of Educational Research 54 (1984)
327–361. doi:10.3102/00346543054003327.
[24] S. K. Ratan, T. Anand, J. Ratan, Formulation of research question - stepwise approach, J. Indian</p>
      <p>Assoc. Pediatr. Surg. 24 (2019) 15–20.
[25] M. Dubey, D. Banerjee, A. Abdelkawi, J. Lehmann, Lc-quad 2.0: A large dataset for complex
question answering over wikidata and dbpedia, in: C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek,
I. Cruz, A. Hogan, J. Song, M. Lefrançois, F. Gandon (Eds.), The Semantic Web – ISWC 2019,
Springer International Publishing, 2019, pp. 69–78.
[26] V. Bolotova, V. Blinov, F. Scholer, W. B. Croft, M. Sanderson, A non-factoid question-answering
taxonomy, in: Proceedings of the 45th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New
York, NY, USA, 2022, p. 1196–1207. doi:10.1145/3477495.3531926.
[27] M. Y. Jaradeh, M. Stocker, S. Auer, Question answering on scholarly knowledge graphs, in:
M. Hall, T. Merčun, T. Risse, F. Duchateau (Eds.), Digital Libraries for Open Knowledge, Springer
International Publishing, 2020, pp. 19–32.
[28] M. Konersmann, A. Kaplan, T. Kühn, R. Heinrich, A. Koziolek, R. H. Reussner, J. Jürjens, M.
alDoori, N. Boltz, M. Ehl, D. Fuchß, K. Großer, S. Hahner, J. Keim, M. Lohr, T. Saglam, S. Schulz,
J. Töberg, Evaluation methods and replicability of software architecture research objects, in:
19th IEEE International Conference on Software Architecture, ICSA 2022, Honolulu, HI, USA,
March 12-15, 2022, IEEE, 2022, pp. 157–168. URL: https://doi.org/10.1109/ICSA53651.2022.00023.
doi:10.1109/ICSA53651.2022.00023.
[29] M. Konersmann, A. Kaplan, T. Kühn, 2022. URL: https://gitlab.com/SoftwareArchitectureResearch/</p>
      <p>StateOfPractice/-/wikis/Data-Extraction, [Last accessed on 2025-08-01].
[30] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying large language models and knowledge
graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering 36 (2024) 3580–3599.</p>
      <p>URL: https://doi.org/10.1109/TKDE.2024.3352100. doi:10.1109/TKDE.2024.3352100.
[31] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, S. Tang, Graph retrieval-augmented
generation: A survey, CoRR abs/2408.08921 (2024). URL: https://doi.org/10.48550/arXiv.2408.08921.
doi:10.48550/ARXIV.2408.08921. arXiv:2408.08921.
[32] J. Baek, A. F. Aji, J. Lehmann, S. J. Hwang, Direct fact retrieval from knowledge graphs without
entity linking, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association
for Computational Linguistics, Toronto, Canada, 2023, pp. 10038–10055. URL: https://aclanthology.
org/2023.acl-long.558/. doi:10.18653/v1/2023.acl-long.558.
[33] Y. Sui, Y. He, N. Liu, X. He, K. Wang, B. Hooi, Fidelis: Faithful reasoning in large language models
for knowledge graph question answering, in: W. Che, J. Nabende, E. Shutova, M. T. Pilehvar
(Eds.), Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria,
July 27 - August 1, 2025, Association for Computational Linguistics, 2025, pp. 8315–8330. URL:
https://aclanthology.org/2025.findings-acl.436/.
[34] Y. Wen, Z. Wang, J. Sun, MindMap: Knowledge graph prompting sparks graph of thoughts in
large language models, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 10370–10388. URL: https:
//aclanthology.org/2024.acl-long.558/. doi:10.18653/v1/2024.acl-long.558.
[35] D. C. Montgomery, Design and analysis of experiments, John wiley &amp; sons, 2017.
[36] OpenAI, gpt-4o model, platform.openai.com/docs/models/gpt-4o, 2024. Accessed: 2025-09-01.
[37] OpenAI, 03 mini model, platform.openai.com/docs/models/o3-mini, 2025. Accessed: 2025-09-01.
[38] Q. Team, Qwen2.5: A party of foundation models, 2024. URL: https://qwenlm.github.io/blog/qwen2.</p>
      <p>5/.
[39] A. Grattafiori, et al., The llama 3 herd of models, 2024. URL: https://arxiv.org/abs/2407.21783.</p>
      <p>arXiv:2407.21783.
[40] S. Lee, A. Shakir, D. Koenig, J. Lipp, Open source strikes bread - new flufy embedding model, 2024.</p>
      <p>URL: https://www.mixedbread.ai/blog/mxbai-embed-large-v1.
[41] I. Granite Team, Granite 3.0 language models, 2024. URL: https://github.com/ibm-granite/granite-3.</p>
      <p>0-language-models/, accessed: 2025-09-01.
[42] L. Zhang, J. Zhang, X. Ke, H. Li, X. Huang, Z. Shao, S. Cao, X. Lv, A survey on complex factual
question answering, AI Open 4 (2023) 1–12. URL: https://doi.org/10.1016/j.aiopen.2022.12.003.
doi:10.1016/J.AIOPEN.2022.12.003.
[43] M. Y. Jaradeh, M. Stocker, S. Auer, Question answering on scholarly knowledge graphs, in: M. M.</p>
      <p>Hall, T. Mercun, T. Risse, F. Duchateau (Eds.), Digital Libraries for Open Knowledge - 24th
International Conference on Theory and Practice of Digital Libraries, TPDL 2020, Lyon, France, August
25-27, 2020, Proceedings, volume 12246 of Lecture Notes in Computer Science, Springer, 2020, pp. 19–
32. URL: https://doi.org/10.1007/978-3-030-54956-5_2. doi:10.1007/978-3-030-54956-5\_2.
[44] D. Banerjee, S. Awale, R. Usbeck, C. Biemann, Dblp-quad: A question answering dataset over the
DBLP scholarly knowledge graph, in: I. Frommholz, P. Mayr, G. Cabanac, S. Verberne, J. Brennan
(Eds.), Proceedings of the 13th International Workshop on Bibliometric-enhanced Information
Retrieval co-located with 45th European Conference on Information Retrieval (ECIR 2023), Dublin,
Ireland, April 2nd, 2023, volume 3617 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp.
37–51. URL: https://ceur-ws.org/Vol-3617/paper-05.pdf.
[45] M. Li, H. Yang, Z. Liu, M. M. Alam, H. Sack, G. A. Gesese, et al., Kgmistral: Towards boosting
the performance of large language models for question answering with knowledge graph
integration, in: Workshop on Deep Learning and Large Language Models for Knowledge Graphs, 2024.
URL: https://www.fiz-karlsruhe.de/sites/default/files/FIZ/Dokumente/Forschung/ISE/Publications/
Conferences-Workshops/7-KGMistral-Towards-Boosting-t.pdf.
[46] T. A. Tafa, R. Usbeck, Leveraging llms in scholarly knowledge graph question answering, in:</p>
      <p>QALD/SemREC@ ISWC, 2023. URL: https://ceur-ws.org/Vol-3592/paper5.pdf.
[47] J. Lehmann, A. Meloni, E. Motta, F. Osborne, D. R. Recupero, A. A. Salatino, S. Vahdati, Large
language models for scientific question answering: An extensive analysis of the sciqa benchmark,
in: A. Meroño-Peñuela, A. Dimou, R. Troncy, O. Hartig, M. Acosta, M. Alam, H. Paulheim,
P. Lisena (Eds.), The Semantic Web - 21st International Conference, ESWC 2024, Hersonissos,
Crete, Greece, May 26-30, 2024, Proceedings, Part I, volume 14664 of Lecture Notes in Computer
Science, Springer, 2024, pp. 199–217. URL: https://doi.org/10.1007/978-3-031-60626-7_11. doi:10.
1007/978-3-031-60626-7\_11.
[48] L. Jiang, X. Yan, R. Usbeck, A structure and content prompt-based method for knowledge graph
question answering over scholarly data, in: D. Banerjee, R. Usbeck, N. Mihindukulasooriya,
G. Singh, R. Mutharaju, P. Kapanipathi (Eds.), Joint Proceedings of Scholarly QALD 2023 and
SemREC 2023 co-located with 22nd International Semantic Web Conference ISWC 2023, Athens,
Greece, November 6-10, 2023, volume 3592 of CEUR Workshop Proceedings, CEUR-WS.org, 2023.</p>
      <p>URL: https://ceur-ws.org/Vol-3592/paper3.pdf.
[49] Y. Gu, V. Pahuja, G. Cheng, Y. Su, Knowledge base question answering: A semantic parsing
[61] M. Nayyeri, Z. Wang, M. M. Akter, M. M. Alam, M. R. A. H. Rony, J. Lehmann, S. Staab, Integrating
knowledge graph embedding and pretrained language models in hypercomplex spaces, CoRR
abs/2208.02743 (2022). URL: https://doi.org/10.48550/arXiv.2208.02743. doi:10.48550/ARXIV.
2208.02743. arXiv:2208.02743.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovtun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kasprzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <article-title>Towards a knowledge graph for science</article-title>
          , in: R. Akerkar,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ivanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Manolopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Savic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Badica</surname>
          </string-name>
          , M. Radovanovic (Eds.),
          <source>Proceedings of the 8th International Conference on Web Intelligence</source>
          , Mining and Semantics,
          <string-name>
            <surname>WIMS</surname>
          </string-name>
          <year>2018</year>
          ,
          <string-name>
            <given-names>Novi</given-names>
            <surname>Sad</surname>
          </string-name>
          , Serbia, June 25-27,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , pp.
          <volume>1</volume>
          :
          <fpage>1</fpage>
          -
          <issue>1</issue>
          :
          <fpage>6</fpage>
          . URL: https://doi.org/10.1145/3227609.3227689. doi:
          <volume>10</volume>
          .1145/3227609.3227689.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kismihók</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge</article-title>
          , in: M.
          <string-name>
            <surname>Kejriwal</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Szekely</surname>
          </string-name>
          , R. Troncy (Eds.),
          <source>Proceedings of the 10th International Conference on Knowledge Capture, K-CAP</source>
          <year>2019</year>
          ,
          <article-title>Marina Del Rey</article-title>
          , CA, USA, November
          <volume>19</volume>
          -
          <issue>21</issue>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          . URL: https://doi.org/10.1145/3360901.3364435. doi:
          <volume>10</volume>
          .1145/3360901.3364435.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Afzal</surname>
          </string-name>
          , G. Tsatsaronis,
          <article-title>Unsupervised dense retrieval for scientific articles</article-title>
          , in: Y.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          Lazaridou (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing:</source>
          EMNLP 2022
          <string-name>
            <surname>- Industry</surname>
            <given-names>Track</given-names>
          </string-name>
          , Abu Dhabi,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          , December 7 -
          <issue>11</issue>
          ,
          <year>2022</year>
          , Association for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>313</fpage>
          -
          <lpage>321</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          .emnlp-industry.
          <volume>32</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2022</year>
          .EMNLP-INDUSTRY.
          <year>32</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Give us the facts: Enhancing large language models with knowledge graphs for fact-aware language modeling</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>3091</fpage>
          -
          <lpage>3110</lpage>
          . URL: https://doi.org/10.1109/TKDE.
          <year>2024</year>
          .
          <volume>3360454</volume>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2024</year>
          .
          <volume>3360454</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive NLP tasks</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/ 2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrievalaugmented generation for large language models: A survey</article-title>
          ,
          <source>CoRR abs/2312</source>
          .10997 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2312.10997. doi:
          <volume>10</volume>
          .48550/ARXIV.2312.10997.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Keim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reussner</surname>
          </string-name>
          ,
          <article-title>Replication package for HubLink: A novel question answering retrieval approach over knowledge graphs, 2025</article-title>
          . URL: https://doi.org/10.5281/zenodo. 17036243. doi:
          <volume>10</volume>
          .5281/zenodo.17036243.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for natural language processing: A survey</article-title>
          ,
          <source>CoRR abs/2407</source>
          .13193 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2407.13193. doi:
          <volume>10</volume>
          .48550/ARXIV. 2407.13193. arXiv:
          <volume>2407</volume>
          .
          <fpage>13193</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guzhva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          , J. Johnson, G. Szilvasy,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mazaré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>The faiss library</article-title>
          ,
          <source>CoRR abs/2401</source>
          .08281 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2401.08281. doi:
          <volume>10</volume>
          .48550/ARXIV.2401.08281. arXiv:
          <volume>2401</volume>
          .
          <fpage>08281</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Basili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <article-title>A methodology for collecting valid software engineering data</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>10</volume>
          (
          <year>1984</year>
          )
          <fpage>728</fpage>
          -
          <lpage>738</lpage>
          . URL: https://doi.org/10.1109/TSE.
          <year>1984</year>
          .
          <volume>5010301</volume>
          . doi:
          <volume>10</volume>
          .1109/TSE.
          <year>1984</year>
          .
          <volume>5010301</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V. R. B. G.</given-names>
            <surname>Caldiera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <article-title>The goal question metric approach, Encyclopedia of software engineering (</article-title>
          <year>1994</year>
          )
          <fpage>528</fpage>
          -
          <lpage>532</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries</article-title>
          ,
          <source>in: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (WAS</source>
          <year>2004</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Artzi,</surname>
          </string-name>
          <article-title>BERTScore: Evaluating Text Generation with BERT (</article-title>
          <year>2020</year>
          ). URL: https://openreview.net/forum?id=SkeHuCVFDr.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Es</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          , S. Schockaert,
          <article-title>RAGAs: Automated evaluation of retrieval augmented generation</article-title>
          , in: N.
          <string-name>
            <surname>Aletras</surname>
          </string-name>
          , O. De Clercq (Eds.),
          <article-title>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, St</article-title>
          . Julians, Malta,
          <year>2024</year>
          , pp.
          <fpage>150</fpage>
          -
          <lpage>158</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .eacl-demo.
          <volume>16</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>OpenAI, text-embedding-3-large</article-title>
          , text-embedding-3-large,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -09-01. perspective, in: S.
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Thorne</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rei</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Petroni</surname>
          </string-name>
          (Eds.),
          <source>4th Conference on Automated Knowledge Base Construction, AKBC</source>
          <year>2022</year>
          , London, UK, November 3-
          <issue>5</issue>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          . URL: https://akbc.ws/2022/papers/23_knowledge_base_question_answer.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations, ICLR</source>
          <year>2024</year>
          , Vienna, Austria, May 7-
          <issue>11</issue>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=nnVO1PvbTv.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>C.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fei</surname>
          </string-name>
          ,
          <article-title>Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs</article-title>
          ,
          <source>CoRR abs/2309</source>
          .03118 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2309.03118. doi:
          <volume>10</volume>
          .48550/ARXIV.2309.03118. arXiv:
          <volume>2309</volume>
          .
          <fpage>03118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arakawa</surname>
          </string-name>
          ,
          <article-title>ODA: observation-driven agent for integrating llms and knowledge graphs</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <article-title>Findings of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2024</year>
          , Bangkok, Thailand and virtual meeting,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>7417</fpage>
          -
          <lpage>7431</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>442</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          .FINDINGS-ACL.
          <year>442</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Meng</surname>
          </string-name>
          , J. Han,
          <article-title>Graph chain-of-thought: Augmenting large language models by reasoning on graphs</article-title>
          , in: L.
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <article-title>Findings of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2024</year>
          , Bangkok, Thailand and virtual meeting,
          <source>August 11-16</source>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>184</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>11</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2024</year>
          . FINDINGS-ACL.
          <year>11</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Structgpt: A general framework for large language model to reason over structured data</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>9237</fpage>
          -
          <lpage>9251</lpage>
          . URL: https: //doi.org/10.18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>574</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EMNLP-MAIN.
          <year>574</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation</article-title>
          ,
          <source>in: The Thirteenth International Conference on Learning Representations, ICLR</source>
          <year>2025</year>
          , Singapore,
          <source>April 24-28</source>
          ,
          <year>2025</year>
          , OpenReview.net,
          <year>2025</year>
          . URL: https://openreview.net/forum?id=oFBu7qaZpS.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Generate-on-graph: Treat LLM as both agent and KG in incomplete knowledge graph question answering</article-title>
          ,
          <source>CoRR abs/2404</source>
          .14741 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2404.14741. doi:
          <volume>10</volume>
          .48550/ARXIV. 2404.14741. arXiv:
          <volume>2404</volume>
          .
          <fpage>14741</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jo</surname>
          </string-name>
          , E. Choi, KG-GPT:
          <article-title>A general framework for reasoning on knowledge graphs using large language models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>9410</fpage>
          -
          <lpage>9421</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>631</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>631</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>He</surname>
          </string-name>
          , P. Cheng, L.
          <string-name>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Reasoning on eficient knowledge paths: Knowledge graph guides large language model for domain question answering</article-title>
          , in: V. S. Sheng,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wu (Eds.),
          <source>IEEE International Conference on Knowledge Graph, ICKG</source>
          <year>2023</year>
          , Shanghai, China, December 1-
          <issue>2</issue>
          ,
          <year>2023</year>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>142</fpage>
          -
          <lpage>149</lpage>
          . URL: https: //doi.org/10.1109/ICKG63256.
          <year>2024</year>
          .
          <volume>00026</volume>
          . doi:
          <volume>10</volume>
          .1109/ICKG63256.
          <year>2024</year>
          .
          <volume>00026</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Pretrain-kge: Learning knowledge representation from pretrained language models</article-title>
          , in: T. Cohn,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Online</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <fpage>16</fpage>
          -20
          <source>November</source>
          <year>2020</year>
          , volume
          <article-title>EMNLP 2020 of Findings of ACL, Association for Computational Linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>266</lpage>
          . URL: https: //doi.org/10.18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>25</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2020</year>
          .FINDINGS-EMNLP.
          <year>25</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Tang,</surname>
          </string-name>
          <article-title>KEPLER: A unified model for knowledge embedding and pre-trained language representation</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>176</fpage>
          -
          <lpage>194</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00360. doi:
          <volume>10</volume>
          .1162/TACL\_A\_
          <volume>00360</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>