<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Introducing GPTKB to the Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yujia Hu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuan-Phong Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shrestha Ghosh</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moritz Müller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Razniewski</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for AI, VNU University of Engineering and Technology</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ScaDS.AI Dresden/Leipzig &amp; TU Dresden</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Tübingen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge bases (KBs) are a cornerstone of the Semantic Web, yet they still struggle with scale and scope, and their construction and curation still involve a lot of manual efort. Large language models (LLMs) have recently emerged as powerful tools for a range of tasks, yet their potential for automated KB construction is still poorly understood. In this demonstrator, we showcase GPTKB, a methodology and KB entirely built from GPT-4.1. GPTKB is constructed by massive-recursive LLM knowledge materialization [1], using over 9M API calls for $14,000 to construct a 100M-triple knowledge base with over 6M entities. Our demonstration focuses on two use cases: (i) Link-based KG exploration and (ii) SPARQL-based analysis and comparison to Wikidata. The GPTKB demonstrator is accessible at https://gptkb.org.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1. Link-based interactive knowledge graph (KG) exploration;
2. SPARQL-based analysis and comparison to Wikidata.</p>
    </sec>
    <sec id="sec-2">
      <title>2. GPTKB Methodology</title>
      <p>
        The GPTKB methodology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] combines a recursive knowledge elicitation process with a post-hoc
knowledge consolidation phase.
      </p>
      <p>Knowledge elicitation Starting from a seed subject, the LLM is prompted to return knowledge
about it in the form of triples. New named entities in these triple objects are identified via LLM-based
named-entity recognition (NER) and are enqueued for further elicitation in a recursive BFS-based graph</p>
      <sec id="sec-2-1">
        <title>Entities 6.1M</title>
        <p>Triples 100M (120M with meta-relations)
Relations 936k (381k after canonicalization)
Classes 220k (32k after canonicalization)
Triple objects 59M entities, 41M literals</p>
      </sec>
      <sec id="sec-2-2">
        <title>Avg. triples/entity 16.3</title>
      </sec>
      <sec id="sec-2-3">
        <title>Avg. label length 19.8 characters</title>
        <p>Subject-precision 85.3% Verifiable, 3.4% Plausible
11.3% Unverifiable</p>
      </sec>
      <sec id="sec-2-4">
        <title>Subjects in Wikidata 43%</title>
        <p>Triple-precision 75.5% True, 5.0% Plausible,
19.5% False</p>
      </sec>
      <sec id="sec-2-5">
        <title>Cost of API-calls $14,136</title>
        <sec id="sec-2-5-1">
          <title>Wikimedia-related</title>
          <p>Wikidata
Wikidata5m
Yago 4.5
DBpedia</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>Text-extracted</title>
          <p>NELL
ReVerb</p>
        </sec>
        <sec id="sec-2-5-3">
          <title>Generative</title>
          <p>GPTKB v1.5
#entities</p>
          <p>#assertions
113M
5M
50M
3.8M
?
?
6.1M
1.62B
20M
140M
75M
12M
15M
100M
exploration process. Constrained decoding is used to make sure that outputs stay within the triple
format.</p>
          <p>Knowledge consolidation To address the redundancy and variance introduced during knowledge
elicitation, post-hoc knowledge consolidation is performed. In particular, we apply a greedy clustering
algorithm to iteratively merge relations and classes into more frequent ones, given a suficiently high
label embedding similarity.</p>
          <p>
            Further methodological details can be found in Hu et al. [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. GPTKB Construction</title>
      <p>
        Two versions of GPTKB are available, GPTKB v1.1, based on GPT-4o-mini [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and GPTKB v1.5, based
on GPT-4.1 [8]. While GPTKB v1.1 was the first proof of the viability of our methodology, the output
quality achieved was below expectations. In particular, more than 60% of the triples were estimated
to be hallucinations, and significant problems occurred with output skew, with some entities having
over 100k (virtually entirely hallucinated) triples. For v1.5, we therefore decided to use a significantly
stronger LLM. We opted for GPT-4.1, because it is among the strongest frontier models available as of
Summer 2025, and released less than 3 months ago.
      </p>
      <p>Following the paradigm described in Section 2, We extracted knowledge from GPT-4.1 starting with
the seed entity Vannevar Bush. The whole process cost $14,136 for OpenAI API calls and took 18 days.
The final KB contains 100 million triples derived from 6.1 million entities in total, organized into 381k
relations and 32k classes. We provide statistics of GPTKB v1.5 in Table 1.</p>
      <p>Since crawl parallelization distorts BFS search order, we post-hoc recomputed the shortest paths of
each node to the root, and stored this information in two meta-relations, bfsLayer and bfsParent, to
enable structural insights. To facilitate data interchange, we also converted GPTKB into RDF format,
and serialized it into Turtle syntax.</p>
      <p>
        We performed two quality evaluations. An automated method based on web search, like in Hu
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], using 1,000 random triples, and a manual assessment of 100 triples. Both annotations agree
in the fraction of correct triples (75.5% and 75%), while the automated evaluation reported a slightly
higher degree of incorrect ones (19.5% versus 14% in manual). In both cases, the truth of some triples
remains undecidable, mostly, because parts of them are semantically incomprehensible.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. GPTKB Web Provision</title>
      <p>We provide multiple modes of access to GPTKB.
URI: https://gptkb.org/entity/Nara_City
GPTKB entity
Statements (31)
Predicate Object
gptkbp:instanceOf gptkb:city
gptkbp:area 276.84 km²
gptkbp:country gptkb:Japan
gptkbp:famousFor gptkb:Todai-ji_Temple
gptkb:Kasuga-taisha_Shrine
gptkb:Nara_Park</p>
      <p>Nara deer
gptkbp:founded February 1, 1898
gptkbp:capitalOf gptkb:Nara_Prefecture
gptkbp:formerName gptkb:Heijō-kyō
gptkbp:region gptkb:Kansai</p>
      <p>. . . truncated . . .
gptkbp:bfsParent gptkb:Honshu</p>
      <p>gptkb:Nara_Park
gptkbp:bfsLayer 5</p>
      <p>Todai-ji Temple
URI: https://gptkb.org/entity/Todai-ji_Temple
GPTKB entity
Statements (51)
Predicate Object
gptkbp:instanceOf gptkb:Buddhist_temple
gptkbp:afiliation gptkb:Kegon_school
gptkbp:annualEvent gptkb:Omizutori</p>
      <p>gptkb:Shuni-e
gptkbp:annualVisitors millions
gptkbp:architecturalStyle gptkb:religious_art
gptkbp:contains gptkb:Hokkedō
gptkb:Nandaimon_(Great_South_Gate)
gptkb:Nigatsu-dō
gptkb:Shōsōin
gptkbp:coordinates 34.6889°N 135.8399°E
gptkbp:country gptkb:Japan</p>
      <p>. . . truncated . . .
gptkbp:bfsParent gptkb:Nara_City
gptkbp:bfsLayer 6</p>
      <p>Firstly, GPTKB is hosted on the https://gptkb.org web server that provides a user interface to search
entities via keyword queries and to perform link-based exploration to discover new connections and
entities. Section 5.1 and Figure 1 provide a demonstration experience of this link-based exploration.
The web interface is implemented by using the Python Django framework, and hosted on a Nginx web
server. The KB is stored in an OpenLink Virtuoso server.</p>
      <p>Secondly, we provide a SPARQL endpoint at https://gptkb.org/query/ that supports structured queries
within a timeout window of 100 seconds.</p>
      <p>Thirdly, we provide the RDF dump under the CC BY 4.0 license on the HuggingFace datasets library
at https://huggingface.co/Knowledge-aware-AI.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Demonstration Experience</title>
      <p>We divide the demonstration experience in two parts: (1) link-based KB exploration, and (2) structured
SPARQL analytics and Wikidata comparison.
5.1. Link-based KB Exploration
Data about specific entities can be accessed via multiple routes:
1. The start page features a selection of direct links to entities such as Vannevar Bush and San</p>
      <p>Francisco.
2. The web portal features a search field in the top-right corner, which can be used for string-based
search.
3. If an unambiguous entity name is known, one can directly access the entity’s KB entry via the</p>
      <p>URL https://gptkb.org/entity/&lt;NAME&gt;.</p>
      <p>Figure 1 shows how to initiate an entity search and continue with a link-based exploration. We start by
typing this year’s ISWC venue, Nara, which returns 314 results. Clicking on the Nara City result takes
us to its entity page. The entity Nara City contains 31 statements, and a user can click on any of the
entity objects, here for instance, Todai-ji Temple, to further explore connected entities.
PREFIX gptkb: &lt;https://gptkb.org/entity/&gt;
PREFIX gptkbp: &lt;https://gptkb.org/prop/&gt;
SELECT ?o (COUNT(*) AS ?ofreq)
WHERE {</p>
      <p>?s gptkbp:instanceOf ?o.
}
GROUP BY ?o
ORDER BY DESC(?ofreq)
LIMIT 100
o
gptkb:person
gptkb:human
gptkb:film
gptkb:company
gptkb:book
gptkb:song
gptkb:fictional_character
ofreq
1,077,803
138,646
120,497
118,993
111,414
103,538
90,499</p>
      <p>Since each entity in GPTKB was identified as object from a parent entity during the BFS algorithm,
we provide this information via the bfsParent relation. Additionally, the bfsLayer relation tells us at
which layer knowledge elicitation was performed for the entity. In Figure 1, we learn that Nara City is
a child entity of Honshu, and that its triples were elicited in layer 5. Clicking on the parent entity lets a
user move up the layers of GPTKB.
5.2. SPARQL Querying for Analytics and Wikidata Comparison
A core feature of structured query languages is that they allow statistical analysis at scale. For this
purpose, the GPTKB content is stored in a Virtuoso Triple store, whose content is exposed via a SPARQL
query interface available at https://gptkb.org/query/. In the following, we show enabled analyses.
Most frequent classes Just what kind of entities does GPT know about? An overview is provided by
the following query:
The results are fundamentally diferent from, e.g., Wikidata, with a much stronger focus on
digital artifacts (films, songs), and fiction.</p>
      <p>Nationality bias Existing KBs as well as LLM training corpora are known to be Western- and
Englishlanguage dominated [9], can this bias also be observed at the factual level of GPTKB? A quick glimpse
can be obtained by counting the number of citizens per country known to GPTKB:
PREFIX gptkb: &lt;https://gptkb.org/entity/&gt;
PREFIX gptkbp: &lt;https://gptkb.org/prop/&gt;
SELECT ?o (COUNT(*) AS ?ofreq)
WHERE {</p>
      <p>?s gptkbp:nationality ?o.
gptkb:American
British
gptkb:French
gptkb:German
gptkb:Indian
gptkb:Canadian
gptkb:Australian
Japanese
ofreq
Notably, English language nationalities occupy the top places, at a much stronger bias than
existing resources like Wikidata.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>Factual knowledge of LLMs is intensively researched, mostly via sample-based benchmarks or probes,
such as the seminal LAMA probe by Petroni et al. [10]. However, these works typically draw sample
from existing web resources, thereby introducing a confirmation bias that prevents the discovery of
unexpected knowledge (or errors). For example, LAMA drew 50k triples from Wikidata, Wikipedia, and
ConceptNet.</p>
      <p>
        Few works have harvested LLM knowledge at scale. Nguyen and Razniewski [11] harvested one
million commonsense assertions from BART and GPT-2, based on a pre-defined subject list. Cohen
et al. [12] proposed to crawl factual LLM knowledge by recursively prompting them. Parović et al. [13]
proposed domain-specific KB construction from LLMs, but did this only at the scale of a few hundred
thousand entities. In this demo we build upon the GPTKB methodology by Hu et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is a
recursive methodology with judicious optimizations towards scalability, prompt-eficiency, and scoping,
via parallelization, prompt-design, and dedicated NER.
      </p>
      <p>Several large knowledge bases are deployed online [14], most notably Wikidata [2], Yago [3] and
DBpedia [4]. Our web browsing and query interfaces are inspired by those.</p>
      <p>In terms of LLM-generated datasets, the closest to ours might be Cosmopedia [15], an LLM-generated
25 billion token text corpus. However, Cosmopedia is intentionally designed to synthesize
realisticlooking but invented texts, and has no goal of collecting factual LLM knowledge.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>
        We have presented the https://gptkb.org web demonstrator, a knowledge base browser and query
interface to GPTKB, a massive 100-million-triple KB built from GPT-4.1 using the GPTKB methodology
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our demonstrator enables experimental insights into the potential of LLMs for complementing
existing KB construction paradigms.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[2] D. Vrandecic, M. Krötzsch, Wikidata: a free collaborative knowledge base, Commun. ACM 57
(2014). doi:10.1145/2629489.
[3] F. M. Suchanek, G. Kasneci, G. Weikum, Yago: a core of semantic knowledge, in: WWW, 2007.</p>
      <p>doi:10.1145/1242572.1242667.
[4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. G. Ives, DBpedia: A nucleus for a web
of open data, in: ISWC, 2007. doi:10.1007/978-3-540-76298-0_52.
[5] A. Fader, S. Soderland, O. Etzioni, Identifying relations for open information extraction, in: EMNLP,
2011. URL: https://aclanthology.org/D11-1142.
[6] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi,
M. Gardner, B. Kisiel, et al., Never-ending learning, Communications of the ACM 61 (2018).
doi:10.1145/3191513.
[7] J. Z. Pan, S. Razniewski, J.-C. Kalo, S. Singhania, J. Chen, S. Dietze, H. Jabeen, J. Omeliyanenko,
W. Zhang, M. Lissandrini, et al., Large language models and knowledge graphs: Opportunities
and challenges, Transactions on Graph Data and Knowledge (2023). doi:10.4230/TGDK.1.1.2.
[8] Y. Hu, T.-P. Nguyen, S. Ghosh, M. Müller, S. Razniewski, GPTKB v1.5: A massive knowledge base
for exploring factual LLM knowledge, arXiv (2025). doi:10.48550/arXiv.2507.05740.
[9] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, É. Grave, CCNet:
Extracting high quality monolingual datasets from web crawl data, in: LREC, 2020. doi:10.48550/
arXiv.1911.00359.
[10] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, A. Miller, Language models as
knowledge bases?, in: EMNLP, 2019. doi:10.18653/v1/D19-1250.
[11] T.-P. Nguyen, S. Razniewski, Materialized knowledge bases from commonsense transformers,
in: Workshop on Commonsense Representation and Reasoning (CSRR), 2022. doi:10.18653/v1/
2022.csrr-1.5.
[12] R. Cohen, M. Geva, J. Berant, A. Globerson, Crawling the internal knowledge-base of language
models, in: EACL Findings, 2023. doi:10.18653/v1/2023.findings-eacl.139.
[13] M. Parović, Z. Li, J. Du, Generating domain-specific knowledge graphs from large language models,
in: Findings of the Association for Computational Linguistics: ACL 2025, 2025.
[14] G. Weikum, X. L. Dong, S. Razniewski, F. Suchanek, Machine knowledge: Creation and curation of
comprehensive knowledge bases, Foundations and Trends in Databases 10 (2021). doi:10.1561/
1900000064.
[15] L. B. Allal, A. Lozhkov, D. van Strien, Cosmopedia: A new frontier for open-source language
models, 2024. URL: https://huggingface.co/blog/cosmopedia, Hugging Face blog.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <article-title>Enabling LLM knowledge analysis via extensive materialization</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2025</year>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>789</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>