<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F. Brei);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5281/zenodo.8251944</article-id>
      <title-group>
        <article-title>Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lars-Peter Meyer</string-name>
          <email>lpmeyer@infai.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Frey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kurt Junghanns</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Brei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kirill Bulert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabine Gründer-Fahrer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Applied Informatics</institution>
          ,
          <addr-line>Goerdelerring 9, 04109 Leipzig, Germany, https:// infai.org</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leipzig University, Institute for Informatics</institution>
          ,
          <addr-line>Germany, https://</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>As the field of Large Language Models (LLMs) evolves at an accelerated pace, the critical need to assess and monitor their performance emerges. We introduce a benchmarking framework focused on knowledge graph engineering (KGE) accompanied by three challenges addressing syntax and error correction, facts extraction and dataset generation. We show that while being a useful tool, LLMs are yet unfit to assist in knowledge graph generation with zero-shot prompting. Consequently, our LLM-KG-Bench framework provides automatic evaluation and storage of LLM responses as well as statistical data and visualization tools to support tracking of prompt engineering and model performance.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) hold the potential to change the way how we interact with data
and technology. Especially models like GPT-3 and GPT-4 have shown proficient capabilities
in solving textual assignments [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and spawned a wave of subsequent models and the field of
prompt engineering.
      </p>
      <p>
        But the fast evolution and rapidly growing landscape of diferent LLMs make it challenging
to keep track of their individual capabilities and to choose the best model and best prompt
for the job. There exist eforts on generic LLM benchmarks (e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). However, despite these
advancements, the application and (automated) assessment of LLMs in the context of knowledge
graph engineering (KGE) and the Semantic Web is still a highly under-explored area. In response
to this gap, this paper proposes a first LLM KGE benchmarking framework
that follows our vision of an automated and continuous evaluation platform for diferent tasks
LLM-KG-Bench1
CEUR
Workshop
Proceedings
in KGE scenarios. A test of the framework is presented by comparing three LLMs for three
exemplary KGE tasks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The utilization of an LLM in the semantic web domain benefits from its capability to handle
RDFrelated syntaxes such as JSON-LD, Turtle and SPARQL. A comprehensive amalgamation of LLMs
and knowledge graphs (KGs) is described in Dagstuhl Seminar [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The Knowledge Base
Construction from Pre-trained Language Models (LM-KBC) Challenge2 emphasises the relevance
of this combination.
      </p>
      <p>
        The basis of this study is [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where ChatGPT’s use in knowledge graph engineering is
assessed. Impressive capabilities were revealed, suggesting two conclusions: Firstly, such
studies ofer insight into LLMs’ potential and limitations, aiding knowledge graph engineers.
Secondly, comparing diferent LLMs can lead to superior results by addressing inherent model
issues.
      </p>
      <p>
        Recognizing the potential of Large Language Models (LLMs) in knowledge graph engineering,
it’s vital to evaluate their performance across diverse tasks. Google’s Beyond the Imitation Game
(BIG-bench) Benchmark3[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the Large Model Systems (LMSys) leaderboard4 are community
eforts that assess the performance of various models with regard to a plethora of tasks. The
Language Model Evaluation Harness5 ofers further testing of generative language models on
various evaluation tasks. However all of them are not perfect for assessing an LLM’s use for
KGE. They are missing KGE specific scoring and do not evaluate scores relative to problem size.
The size seems to be relevant for KGE as KGs get quite big in relation to current LLMs context
sizes[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Acknowledging the existing appraoches limitations we introduce the LLM-KG-Bench
framework.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The LLM-KG-Bench Framework</title>
      <p>
        Our current (and ongoing) work presented in this paper is comprising the design and
implementation of the modular LLM-KG-Bench framework1 for benchmarking LLMs in the context
of knowledge graph engineering. The main focus is on automated evaluation procedures to
allow for many repeated test executions. The framework supports configurable task sizing, as
prior work[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] suggest the relevance of the LLM’s context size for KGE tasks.
      </p>
      <p>As we aim for as much compatibility as possible, especially in the direction of BIG-bench3,
the LLM-KG-Bench framework is organized around benchmark tasks and LLM model connectors,
glued together by some code for execution organisation and result persistence. LLM model
connectors encapsulate the connection to a specific LLM and ofer the function generate_text.
With this function a benchmark task can send a prompt to LLM and get its answer. Benchmark
tasks handle the LLM evaluation for a single task. In the function evaluate_model they usually</p>
      <sec id="sec-3-1">
        <title>2Website: https://lm-kbc.github.io/challenge2023/ 3Repository: https://github.com/google/BIG-bench 4Blogpost: https://lmsys.org/blog/2023-06-22-leaderboard/ 5Repository: https://github.com/EleutherAI/lm-evaluation-harness</title>
        <sec id="sec-3-1-1">
          <title>Connector</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Collection</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Benchmark</title>
          <p>Collection
s
k
r
a
m
h
:f)g cneB
i
con rsx
to to
g ce
in n
cdo noC
r
(ca sx
e n
r aoe
ta it
te r
I tI
x
s
e
z
i
S</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Benchmark-Runner</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Bench Task (connector, size)</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>Query generator</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Task-Info</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>Text</title>
        </sec>
        <sec id="sec-3-1-9">
          <title>AI-Model-connector</title>
        </sec>
        <sec id="sec-3-1-10">
          <title>Text addon queries</title>
        </sec>
        <sec id="sec-3-1-11">
          <title>Answer Evaluator</title>
        </sec>
        <sec id="sec-3-1-12">
          <title>Stats</title>
          <p>API
AI</p>
        </sec>
        <sec id="sec-3-1-13">
          <title>Stats</title>
        </sec>
        <sec id="sec-3-1-14">
          <title>Storage</title>
          <p>plot
build a prompt or task description for the LLM, hand this task over to a given LLM via an LLM
model connector and evaluate the given answer. If necessary the benchmark task could send
additional prompts to the LLM in the evaluation process. The evaluation results in score values
for the task specific defined score types and additional information.</p>
          <p>Due to LLM-KG-Bench’s modularization, as shown in Figure 1, additional benchmark tasks
and LLM model connectors can be added by just adding corresponding python class definitions.
The framework supports basic result visualization with the help of seaborn6. The plots shown
in Figure 2 are generated this way.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Initial Evaluation of the Framework with first Tasks</title>
      <p>To test the LLM-KG-Bench framework we added a couple of benchmark tasks and evaluated
three of the currently highest ranking LLMs at the LLMSYS Chatbot Arena Leaderboard4. The
test setup is detailed in Table 1.</p>
      <sec id="sec-4-1">
        <title>6Website: https://seaborn.pydata.org/</title>
        <p>Task a: Fixing of Errors in Turtle Files: Turtle is a common serialization format for
knowledge graphs. By asking the LLMs to fix errors in given manipulated turtle files we test
the knowledge of turtle syntax as well as strict adhering to the given task and facts. One of the
scores calculated during evaluation is the F1 measure on parsable normalized triples, comparing
LLM’s answer with a perfect answer. A plot on the F1 measure results for this task is shown in
Figure 2a. GPT-3.5 often claims that file would be correct and returns no turtle. This accounts
for the high frequency of zero-value F1 scores. The answers given by Claude-1.3 and GPT-4
score better.</p>
        <p>Task b: KG Creation from Factsheet Plaintext: To evaluate knowledge extraction and
modelling capabilities, we use a plaintext excerpt of a PDF factsheet. The text describes various
specifications of a 3D printer in a key-value style, including usual formatting irregularities
associated with PDF extraction. We ask the model to generate a Turtle file, that captures a subset
of the information. The prompt is engineered very specific with regard to which properties
or ontologies have to be used and how IRI identifiers and Literals should be represented.
Subsequently, we can evaluate the quality of a single response using the F1 measure, counting
the set of parsable triples that (mis)match or are missing compared to a manually curated
reference document. Fig. 2b shows that the GPT models outperform Claude in this task.
While GPT4 has a better mean, due to one very good response, it however replied often with
unparseable content, which in turn did not happen for GPT3.5, leading to a slightly better
median for that.</p>
        <p>Task c: Synthetic Dataset Generation: Creating example data is an important task and
the help of LLMs would be highly appreciated. We created a basic test for this capability. We ask
the LLM to generate some synthetic dataset using well known foaf:Person and foaf:knows
with a varying number of desired objects and links in the final KG. In the evaluation we used
beside other scores the persons_relative_error indicating the diference between the actual
number person objects generated and the number asked for. This value is normalized to be = 0
if they match, &gt; 0 if there are more persons than asked for and &lt; 0 if there are less persons,
with the special case of −1 meaning an empty graph. The results presented in Figure 2c show a
relation between the persons_relative_error and the problem size, in this case number of person
objects to generate.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>We showed that there is a need for measuring the knowledge graph engineering capabilities
of the rapidly evolving LLMs. We proposed and describe the novel LLM-KG-Bench framework
for this task. A first evaluation of three high ranking LLMs with first benchmarks shows the
benefit of the automated evaluation with the new framework.</p>
      <p>The LLM-KG-Bench framework is prepared to enable dialogs between benchmark tasks and
LLMs. It will be interesting to evaluate LLMs capabilities to fix their answers with some feedback
like e.g. error codes in improved or additional tasks. We are looking forward to extending to
more LLMs and more benchmark tasks with the help of a bigger community.
(a) Turtle Fixing
(b) Fact Extraction
(c) Mean Error Dataset Generation</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by grants from the German Federal Ministry for Economic
Afairs and Climate Action (BMWK) to the CoyPu project (01MK21007A) and KISS project
(01MK22001A) as well as from the German Federal Ministry of Education and Research (BMBF)
to the projects StahlDigital (13XP5116B) and KupferDigital (F13XP5119F).</p>
    </sec>
    <sec id="sec-7">
      <title>A. Online Resources</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , et al.,
          <article-title>Beyond the imitation game: Quantifying and extrapolating the capabilities of language models</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          (
          <year>2023</year>
          ). arXiv:
          <volume>2206</volume>
          .
          <fpage>04615</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          , E. Simperl, M. van
          <string-name>
            <surname>Erp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs and their role in the knowledge engineering of the 21st century (dagstuhl seminar 22372) (</article-title>
          <year>2023</year>
          ).
          <source>doi:10.4230/ DAGREP.12.9</source>
          .60.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2306</volume>
          .
          <fpage>08302</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , C. Stadler,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Radtke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Junghanns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meissner</surname>
          </string-name>
          , G. Dziwis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Llm-assisted knowledge graph engineering: Experiments with chatgpt</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .06917, to appear
          <source>in proceedings of AI-Tomorrow track on Data Week 2023 in Leipzig.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>• LLM-KG-Bench</surname>
            <given-names>repository</given-names>
          </string-name>
          : https://github.com/AKSW/LLM-KG-Bench or doi:
          <volume>10</volume>
          .5281/zenodo.8251944 • experiment data: https://github.com/AKSW/LLM-KG-Bench-Results/tree/main/ 2023-SEMANTICS_
          <article-title>LLM-KGE-Bench-Results or</article-title>
          doi:
          <volume>10</volume>
          .5281/zenodo.8250646
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>