<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Knowledge</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.2200/S01125ED1V01Y202109DSK022</article-id>
      <title-group>
        <article-title>GenKD: Generative Knowledge Discovery through Knowledge Graphs and Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fouad Zablith</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shadi Youssef</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathieu d'Aquin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LORIA, Université de Lorraine, CNRS</institution>
          ,
          <addr-line>Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Suliman S. Olayan School of Business, American University of Beirut</institution>
          ,
          <addr-line>PO Box 11-0236, Riad El Solh, 1107 2020, Beirut</addr-line>
          ,
          <country country="LB">Lebanon</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>12</volume>
      <issue>2021</issue>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>With the continuous growth of data published on the web, knowledge discovery is getting increasingly challenging. This challenge is mainly driven by the knowledge discovery process that often requires the continuous aggregation and exploration of questions and patterns that span local and external knowledge sources. This work investigates the facilitation of knowledge discovery over distributed sources of knowledge on the web. We present GenKD, a Generative Knowledge Discovery framework that leverages the semantic interconnectedness of knowledge graphs and the generative capabilities of Large Language Models (LLMs). GenKD enables, through a human-AI collaborative process, the automatic generation of relevant questions, executable queries, and visualizations to uncover patterns from local and external knowledge graph sources. We demonstrate the feasibility of the proposed framework through a case study in the context of bee colonies and stressors.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge discovery</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>large language models</kwd>
        <kwd>data exploration</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>humanAI collaboration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge discovery and data exploration often involve manipulating existing data to discover
interesting insights [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Insights usually emerge through an iterative process of pattern detection
and question answering over the available data, which are often distributed among local and external
sources [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Such sources have been increasingly available on the web in the form of knowledge graphs
and linked data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As a result, research on knowledge exploration on the web of data has been
advancing in several directions [5]. For example, some related research focused on query exploration by
example [6]. Other works studied how to enrich and augment knowledge graphs through manual and
automatic entities linking and alignment [
        <xref ref-type="bibr" rid="ref4">4, 7, 8, 9</xref>
        ]. More recently, there has been an increased interest
in leveraging Large Language Models (LLMs) to support data visualization [10, 11, 12] and SPARQL
query generation from natural language [13, 14]. However, it is still challenging for data analysts to
manage local and external knowledge sources to generate relevant questions, construct answerable
queries from the combined data, and design visualizations that support their discovery endeavors.
      </p>
      <p>This demo paper investigates the following research question: How can we facilitate the knowledge
discovery process over distributed sources of knowledge on the web? We present GenKD, a novel
Generative Knowledge Discovery framework that leverages the semantic data connections of knowledge
graphs and the generative capabilities of Large Language Models (LLMs). GenKD supports, through
a human-AI collaborative process, the automatic generation of relevant questions, executable queries,
and data visualizations over local and external knowledge graph sources. A demo is implemented in
the context of bee colonies and stressors in the US to illustrate the feasibility of the proposed
framework. It includes a web-accessible prototype designed to enable a user guided LLM process to generate
questions, queries, and visualizations that are adaptive to the local and external knowledge graph
contexts. This work contributes to advancing knowledge discovery at a web scale through a human guided
process building on the semantically rich capabilities of knowledge graphs and the generative features
of LLMs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Knowledge discovery and exploration is an iterative process that involves several steps [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]. It usually
starts with one or several data sources, depending on the data analyst’s goals and context, followed
by processing and manipulating the data to identify interesting patterns and insights [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Knowledge
graphs have been providing substantial opportunities for organizing data at a web scale [
        <xref ref-type="bibr" rid="ref4">4, 15</xref>
        ]. They
are seen as enablers of exploratory data analytics tasks in general, and more specifically at the level
of exploratory search tasks [16]. Existing eforts have focused on supporting analysts in exploring
their knowledge graph data by example [6]. They ofer the potential to reverse engineer knowledge
graph SPARQL queries based on user-provided examples. However, with the interconnected nature of
knowledge graphs, it becomes more challenging for analysts to anticipate the potential questions that
may be answered by their local data sources, especially when combined with other sources.
      </p>
      <p>
        Uncovering interesting questions during the knowledge discovery process often requires the fusion
of multiple knowledge sources. To answer such questions, knowledge graphs that focus on certain
contexts must be augmented with additional information. Aligning and mapping knowledge entities is
a core task to enrich and construct knowledge graphs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This entity linking aspect can be done using
tools for linking graphs [7, 8] and by relying on the entities’ external identifiers and manual eforts [ 9].
We observe that such eforts focus on the backend aspects of linking and mapping knowledge graphs.
However, knowledge discovery requires human guidance and several iterations. This process expects
analysts to ask the right questions based on the data, and to develop queries that can be executed to
fetch and visualize the data to discover insights [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Recently, we have witnessed an increased adoption of Large Language Models such as ChatGPT,
Claude, and Gemini to perform a wide range of tasks [17]. For example, LLMs have been investigated
in the automatic generation of data visualizations [10, 11, 12]. In the context of knowledge graphs,
there are increased attempts to leverage language models in accessing knowledge. For example, some
eforts were invested in supporting the generation of queries from natural language [ 13, 14]. In this
context, it is assumed that analysts are aware of the questions that they are aiming to answer, or of the
potential questions that they may be able to answer through the knowledge graphs. This may not be
often the case, in situations where the analyst would need more guidance in the question generation
process, which may lead to serendipitous findings [ 18]. Furthermore, another challenge is at the level of
combining the existing datasets with external knowledge sources to increase the potential richness of
ifndings aligned with the analysts’ needs. We see an opportunity to further study LLMs and knowledge
graphs’ capabilities in advancing knowledge discovery on the web of data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach: GenKD Framework</title>
      <p>
        Building on the knowledge discovery process [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we propose GenKD, a Generative Knowledge
Discovery framework that aims to support users to better collaborate with AI during the knowledge discovery
process. Figure 1 illustrates the framework’s components. One of the challenges in knowledge
discovery is to understand the data available and the potential of extending such data with relevant external
sources that may fill some knowledge gaps. Understanding the context of analysis is not only relevant
to the analyst, but is also a key requirement for guiding and prompting the LLM with appropriate
knowledge entities. The Context Generator component aims at representing the local context from the
user’s knowledge graph, with the potential extension to external knowledge context sources such as
the linked open data cloud. Knowledge graphs facilitate the generation of such contexts on the fly
through a series of SPARQL queries sent to local and external endpoints.
      </p>
      <p>The generated context in the form of knowledge graph entities is combined into prompts using the
Prompt Builder component. The prompts in the framework include question prompts, query prompts,
Knowledge</p>
      <p>Graph</p>
      <p>Local</p>
      <p>Knowledge
Human</p>
      <p>New Entities
Insights</p>
      <p>User
Refinements</p>
      <p>Linked Open</p>
      <p>Data</p>
      <p>External
Knowledge</p>
      <p>Context
Context
Generator
Prompt
Builder</p>
      <p>Data</p>
      <p>Integrator
Visualization Data
and Code</p>
      <p>Question
Prompts
Query
Prompts</p>
      <p>Questions
Generator
SPARQL</p>
      <p>Builder
Pattern
Prompts</p>
      <p>Visualization
Recommender</p>
      <p>New
Data</p>
      <p>AI
and pattern prompts. Question prompts are designed to guide the LLM to generate questions that are
relevant to the local and external knowledge contexts through the Questions Generator component. A
clearly articulated context (local and external) plays an important role in crafting prompts that assist the
LLM to generate questions that are answerable by the data. The suggested questions are then used to
generate the SPARQL queries through the SPARQL Builder component. At this level, query prompts are
designed to construct SPARQL queries that can be executed against both local and external knowledge
graph endpoints. The resulting data is then transformed into a format (e.g., tabular) that can be easily
accessible and processed by the Visualization Recommender component. At this level, pattern prompts
are used for the LLM to select data that may contribute to interesting patterns (e.g., highly correlated
data fields) and suggest appropriate visualizations according to the data types and question scopes.
In addition to the choice of visualizations, the LLM is prompted to generate the visualization code
and required libraries to use in the visualization process. While the framework provides predesigned
prompts, it supports human-AI collaboration by allowing users to guide the LLM through custom
prompts according to their needs with respect to the selection of knowledge graph entities, questions
generated, queries suggested, and visualizations recommended.</p>
    </sec>
    <sec id="sec-4">
      <title>4. GenKD Demo: a Use Case on Bee Colonies in the US</title>
      <p>We demonstrate the feasibility of the proposed framework through a case study in the context of
exploring a knowledge graph in the domain of analyzing statistics about bee colonies. The data contains
information about bee colonies (i.e., number of colonies, colonies lost or added, their percent lost, etc.)
and bee stressors (i.e., stress type, month range, percent impacted by stress, etc.) in the United States1.
We built a prototype of the GenKD framework based on a JavaScript application that can be loaded
directly in a browser2. An OpenAI key is needed to enable the LLM features. A video recording that
showcases the key demo features is also available3. Figure 2 shows details of the main application
1The data was transformed into a knowledge graph from the following data source: https://github.com/rfordatascience/
tidytuesday/blob/main/data/2022/2022-01-11/readme.md
2The GenKD prototype is available online: https://linked.aub.edu.lb/apps/genkd
3The video recording is available at: https://youtu.be/Cnj05zOz1pY
features.</p>
      <p>Users can choose an existing knowledge graph to explore, or specify an endpoint where their local
knowledge graph is hosted4. The application then loads the data and visualizes it as a graph for the
user to interact with. The user can start the discovery process by clicking and expanding the nodes of
interest. In the background, the context generator component captures the context of the knowledge
graph by recording the nodes clicked by the user. For example, in Figure 2, the user clicked a node path
that included the Bees Stressor dataset and the Nebraska dimension instance. After the user finishes
interacting with the data, the prompt builder component combines the clicked nodes with potential
anchors from external linked open data endpoints. The DBpedia endpoint is used in this demo. The
external context is fleshed out by traversing the related entities on DBpedia. For example, the state
Nebraska is anchored in the https://dbpedia.org/page/Nebraska resource, and related
entities (e.g., the dbo:areaLand or dbp:areaTotalKm) and combined to generate the external
context. The Prompt Builder component takes the combined contexts and generates question prompts
that are passed to the LLM using the OpenAI API. Having explicit semantic contexts with full DBpedia
and local URIs passed to the LLM increases the chances of having queries answerable by the endpoints.</p>
      <p>The LLM generates a set of questions that are relevant to the context, as shown in Figure 2. For
example, one of the proposed questions by the AI model was: How might Nebraska’s elevation influence
4In the current demo implementation, the endpoint is required to follow the RDF Data Cube vocabulary (https://www.w3.
org/TR/vocab-data-cube/) to control access to the data measures and dimensions
the types of plants available for bee foraging and consequently the stressors they face? It is worth noting
that the proposed question by the LLM can’t be answered directly by the local knowledge graph on bees,
as it doesn’t contain the elevation of Nebraska. This exploratory new question, which is a core feature
of the framework, was enabled by the external context derived from the online DBpedia endpoint.</p>
      <p>The user can select the proposed question by the AI model that may be of interest, and then the
SPARQL Builder component is triggered through a Query Prompt. The prompts at this level are designed
to generate executable queries that can fetch the needed data to answer the question. For example,
to answer this specific question, the elevation of Nebraska needs to be extracted from the DBpedia
endpoint. The LLM then automatically generates the SPARQL query, with the option to push the
new data to the local knowledge graph through the Data Integrator component. In the following step,
the user can trigger the Visualization Recommender component that automatically checks potential
correlations between data entities, generates the visualization code through the LLM, and renders
the visualization in the application using JavaScript. The lower part of Figure 2 shows the generated
visualization of a scatter plot proposed by the LLM to visualize the relationship between the elevation of
states and the bees’ stress percentage. The user can manipulate the visualization by further prompting
the LLM, without the need to write any code. Table 1 shows a sample of prompts created by the prompt
builder component.</p>
      <sec id="sec-4-1">
        <title>Question Prompt</title>
        <p>Based on the information retrieved from DBpedia about http://dbpedia.org/resource/Nebraska:
http://www.w3.org/2000/01/rdf-schema#label, http://www.w3.org/2000/01/rdf-schema#comment,
http://www.w3.org/2000/01/rdf-schema#seeAlso &lt;...remaining related entities to the resource...&gt;
Suggest relevant questions that can be answered from dbpedia about: http://dbpedia.org/resource/Nebraska.
Make sure to include the DBpedia property that answers each question directly after each question (without
line break). Make the questions relevant to the dataset:
http://linked.aub.edu.lb/data/Bees/Dataset/Bees_Colonies
and to the domain: http://linked.aub.edu.lb/data/Bees/Domain/Environment_Domain and are able to provide
added value, avoid too general questions that might not be related to the dataset or to the topic. The node is
connected to: http://linked.aub.edu.lb/data/Bees/observation/Bees_Stressors-Disesases-Nebraska-April-June-2021...
Please ensure that the questions are answerable using DBpedia’s data. Avoid questions that require information
beyond what DBpedia provides or questions that are too general. If no specific questions are possible, provide
related general questions that are still within the scope of DBpedia’s knowledge base.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Query Prompt</title>
        <p>Generate a SPARQL query that can be executed on DBpedia to answer this question: “What is the
elevation range in Nebraska, and how might this afect the stress factors on bee populations in the state?
http://dbpedia.org/property/elevationFt, http://dbpedia.org/property/elevationM” note that the relevant
DBpedia property(s) is with the question, avoid querying other things, if it cannot answer the question, let me know.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Pattern Prompt</title>
        <p>You are an expert JavaScript developer and data visualization designer. Given these CSV headers:
Observation URI, http://purl.org/linked-data/cube#dataset, http://purl.org/linked-data/sdmx/2009/dimension#refArea,
http://purl.org/linked-data/sdmx/2009/dimension#refPeriod,
http://purl.org/linkeddata/sdmx/2009/measure#stress_pct, http://purl.org/linked-data/sdmx/2009/measure#colony_added,
http://purl.org/linked-data/sdmx/2009/measure#colony_lost ...
generate a complete JavaScript function using Chart.js that takes parsed CSV data as input and renders at least
6 diverse and meaningful visualizations that can answer this question: “What is the elevation range in Nebraska,
and how might this afect the stress factors...”
- Choose appropriate chart types based on the data.
- Group and sort data when relevant especially on the x-axis (e.g., by refArea, refPeriod, or year).
- Include meaningful axis labels, chart titles, and legends.
- Output only the function wrapped in double quotes (no markdown formatting or explanations).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We presented in this paper our vision of a Generative Knowledge Discovery framework. The
proposed framework aims to facilitate the knowledge discovery process through a human-AI collaborative
process that dynamically helps to build context and prompts to assist in the generation of questions,
queries, and visualizations of information that spans local and external knowledge sources. Our initial
demonstration on the bees’ context illustrates the feasibility of the approach.</p>
      <p>With respect to future research directions, this work can be extended from a theoretical, practical,
and technical perspective. At a theoretical level, it is worth exploring the implications of the
knowledge discovery process in the context of human-AI collaboration from a socio-technical angle. At a
practical level, it may be interesting to explore the impact of this approach in organizational contexts,
where knowledge can be locally distributed at departmental and personal levels. At a technical level, it
can be valuable to study opportunities for enhancing the framework to automate the identification of
relevant external knowledge sources based on the nature of local data. It is also valuable to investigate
means for reducing the reliance on user input during the prompt-building stage, and study optimal
strategies for user input to implement better and more sophisticated context-building techniques that
may lead to more semantically relevant and diverse questions. Furthermore, additional user studies
to evaluate the feasibility and usability of the tool in other contexts that include, for example, more
generic local knowledge graphs in other domains and schemas beyond RDF Data Cubes and DBpedia.
This will inform further design requirements, future research paths, and more potential serendipitous
knowledge discovery.</p>
      <p>This work ofers the following contributions. First, it advances the notion of human-AI collaboration
in the context of knowledge discovery on the web of data. Second, it ofers new insights into merging
the capabilities of knowledge graphs and LLMs to generate insights from data. Third, it provides
practical insights through a demo that can be used and extended by researchers and practitioners
interested in advancing the field of knowledge discovery at a web-scale.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Suliman S. Olayan School of Business (OSB) Research
Initiative fund and the American University of Beirut Research Board (URB).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT 4.1 and Grammarly (free version) in order to:
ifx coding, grammar and spelling issues. After using this tool/service, the authors debugged, reviewed
and edited the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>U. M.</given-names>
            <surname>Fayyad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Piatetsky-Shapiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          ,
          <article-title>Knowledge Discovery and Data Mining: Towards a Unifying Framework</article-title>
          .,
          <source>in: KDD</source>
          , volume
          <volume>96</volume>
          ,
          <year>1996</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Yau</surname>
          </string-name>
          ,
          <article-title>Visualize this: the FlowingData guide to design, visualization</article-title>
          , and statistics, John Wiley &amp; Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Sanders</surname>
          </string-name>
          ,
          <article-title>Data Aggregation and Exploratory Visualization</article-title>
          , in: A. R. Sanders (Ed.),
          <article-title>Visualizing History's Fragments: A Computational Approach</article-title>
          to Humanistic Research, Springer International Publishing, Cham,
          <year>2024</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>185</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -46976-
          <issue>3</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. d. Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmelzeisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , Knowledge Graphs, Synthesis Lectures on Data, Semantics, and
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>