<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3233/SSW240028</article-id>
      <title-group>
        <article-title>do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Desiree Heim</string-name>
          <email>desiree.heim@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars-Peter Meyer</string-name>
          <email>lpmeyer@infai.org</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schröder</string-name>
          <email>markus.schroeder@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Frey</string-name>
          <email>frey@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Dengel</string-name>
          <email>andreas.dengel@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DFKI</institution>
          ,
          <addr-line>Kaiserslautern</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>InfAI</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RPTU</institution>
          ,
          <addr-line>Kaiserslautern</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>TU Chemnitz</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Uni Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>3874</volume>
      <fpage>35</fpage>
      <lpage>53</lpage>
      <abstract>
        <p>When using Large Language Models (LLMs) to support Knowledge Graph Engineering (KGE), one of the first indications when searching for an appropriate model is its size. According to the scaling laws, larger models typically show higher capabilities. However, in practice, resource costs are also an important factor and thus it makes sense to consider the ratio between model performance and costs. The LLM-KG-Bench framework enables the comparison of LLMs in the context of KGE tasks and assesses their capabilities of understanding and producing KGs and KG queries. Based on a dataset created in an LLM-KG-Bench run covering 26 open state-of-the-art LLMs, we explore the model size scaling laws specific to KGE tasks. In our analyses, we assess how benchmark scores evolve between diferent model size categories. Additionally, we inspect how the general score development of single models and families of models correlates to their size. Our analyses revealed that, with a few exceptions, the model size scaling laws generally also apply to the selected KGE tasks. However, in some cases, plateau or ceiling efects occurred, i.e., the task performance did not change much between a model and the next larger model. In these cases, smaller models could be considered to achieve high cost-efectiveness. Regarding models of the same family, sometimes smaller models performed worse than larger models of the same family. These efects occurred only locally. Hence it is advisable to additionally test the next smallest and largest model of the same family.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KGs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] store facts about real-world domains in a structured way that facilitates
reasoning to derive new information based on rules and existing knowledge. However, their creation
and maintenance, commonly known as Knowledge Graph Engineering (KGE), usually requires manual,
labour-intensive eforts including activities such as drafting an appropriate ontology, transforming
data sources to fit the required format, and solving data integrity problems. With the emergence
of Large Language Models (LLMs), various approaches were developed with LLMs to support KGE
tasks [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7 ref8">2, 3, 4, 5, 6, 7, 8</xref>
        ]. Once LLMs are employed, the question arises of how well they can cope with
KGs and KGE challenges. To address this, the LLM-KG-Bench benchmark framework [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] assesses
the performance of LLMs on tasks requiring the comprehension of KGs [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], their schemata and
query languages [
        <xref ref-type="bibr" rid="ref10">10, 13</xref>
        ].
Slovenia
(J. Frey); 0000-0002-6100-8255 (A. Dengel)
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>The results on this benchmark not only show single model performances but might also provide
valuable indications on KGE-specific scaling laws of LLMs. Such scaling laws typically examine LLM
task performances concerning their model sizes, training data size, or utilized computational training
resources [14]. Especially regarding model sizes, a usual expectation is that the larger the LLMs, the
better their performances on downstream tasks. Yet, this assumption can be wrong. Moreover, larger
models typically involve higher costs. In particular, the higher memory consumption of larger models
compared to smaller ones is a highly relevant cost factor since more or more powerful hardware like
GPUs are required. At the same time, the parameter size also influences inference time assuming
using the same hardware setting as with more parameters more weights need to be calculated. Here,
Mixture-of-Expert (MoE) models form an exception since only the number of active parameters, i.e.,
the proportion of parameters from the total parameters selected at runtime, influences the inference
time. Hence, when choosing between an MoE LLM and another one with the same amount of total
parameters and a similar task performance, the MoE has a higher cost-efectiveness. However, using the
same hardware, for hosting either smaller and larger LLMs exclusively on, is, in practice, not necessarily
realistic since smaller models would not fully exhaust the hardware’s, e.g., GPUs, capability for the
same setting, e.g., the same targeted number of concurrent request and maximum input lengths. Hence,
except for MoE models in comparison with similarly-sized LLMs, considering the memory requirement
of LLMs is preferable. In conclusion, considering the cost-efectiveness, i.e. particularly the memory
requirements, the largest models may not be the best choice and a good trade-of between model
performance and the model’s resource demand has to be found.</p>
      <p>
        In this paper, we therefore analyse LLM scaling laws on KGE tasks with respect to model sizes. The
data for our analysis are drawn from a recently published LLM-KG-Bench benchmark run [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It
covers 26 open state-of-the-art LLMs from five providers and in total eleven model families, i.e., series
of models released by a specific provider. Using the benchmark results and a combination of statistical
analysis and visualizations, we would like to give initial answers to the following questions: How do
benchmark scores …
1. …relate to diferent LLM model size groups?
2. …develop with respect to model sizes in general?
3. …develop with respect to model sizes within specific model families?
      </p>
      <p>By answering these questions, we aim to get more general insights about model size scaling laws on
KGE tasks that can also be transferred to models not included in the benchmark run.</p>
      <p>This paper is structured as follows: Section 2 introduces related works. In Section 3, we describe the
LLM-KG Bench run and the obtained dataset used in our analyses. In Section 4, we analyse the dataset
in particular with respect to the correlation between model size and performance on the benchmark.
Following the analysis, we summarize and discuss the gained insights in Section 5. Section 6 concludes
this paper and gives an outlook to future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In order to compare the vast amount of LLMs, there are several LLM leaderboards, which rank various
LLMs based on a selection of benchmarks or workloads. Among the well-known leaderboards are
Chatbot Arena [15], which evaluates models using human preference on interactive tasks, and
OpenLLMLeaderboard [16], covering numerous standard tasks like MMLU, BBH, and GPQU across more than
2, 000 models. Similarly, HELM [17] provides comprehensive evaluations including domain-specific
benchmarks such as LegalBench and MedQA.</p>
      <p>Regarding code generation, which shares a few conceptual similarities with KGE benchmarking,
several dedicated benchmarks and leaderboards exist too. Prominent code benchmarks include
HumanEval and MultiPL-E, evaluated in the Big Code Models Leaderboard1, as well as EvalPlus [18],
employing both HumanEval and the Mostly Basic Python Programming (MBPP) benchmark. The
1https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
CanAiCode Leaderboard2 specifically targets text-to-code tasks for smaller LLMs. These code-focused
benchmarks emphasize structured output, syntactical correctness, and execution correctness, mirroring
the evaluation criteria in KGE tasks, thereby ofering insights relevant to benchmarking structured
outputs from LLMs.</p>
      <p>However, the mentioned attempts do not cover the evaluation of tasks specifically relevant to
Knowledge Graph Engineering (KGE) [19], such as RDF syntax correctness, SPARQL query generation,
or graph comprehension.</p>
      <p>
        Eforts addressing KG-related evaluations frequently target specific problems like Text-to-RDF
conversion [20, 21], Knowledge Graph Question Answering (KGQA) [22], and SPARQL query generation [
        <xref ref-type="bibr" rid="ref7">7, 23</xref>
        ].
These evaluations typically focus only on isolated tasks and often involve manual assessments, which
limits scalability and adaptability to newer LLMs and task variations. An exception closely related to
our interest in structured output is StructuredRAG [24], evaluating JSON-based structured responses
from LLMs.
      </p>
      <p>
        To address gaps in existing benchmarking eforts, especially regarding RDF and SPARQL tasks,
LLM-KG-Bench [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] provides a specialized automated benchmarking environment for evaluating
semantic correctness and syntax handling in RDF and SPARQL tasks. In contrast to general benchmarks
like HELM or BigBench [25], LLM-KG-Bench emphasizes semantic and syntactic correctness rather than
multiple-choice accuracy, significantly reducing technological complexity for creating and evaluating
KG-related tasks [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12, 13</xref>
        ].
      </p>
      <p>Prior research already investigated the correlation between LLM parameter size and task
performance [14, 26]. Larger LLMs typically outperform smaller models for the same tasks but also exhibit
emergent capabilities (not present in smaller models) such as complex reasoning or nuanced
instructionfollowing abilities [26, 27]. However, this relationship is not universally linear; task type, complexity,
and input and output structure can significantly influence whether larger models yield proportionally
better performance. Scenarios and tasks w.r.t. Knowledge Graph Engineering, which typically requires
dealing with RDF serialization formats and paradigms, remains underexplored. This study addresses
this gap by explicitly examining how model size impacts performance across diverse RDF and SPARQL
tasks within the context of KG engineering.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        This work analyses data generated by the LLM-KG-Bench framework [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. The LLM-KG-Bench
framework ofers the infrastructure to define KG engineering-related automated tasks that can be
repeatedly executed on many LLMs to evaluate their performance. Since also the evaluation is automated,
the same experiments can be repeated which increases reproducibility and gives a broader sample size
for statistical analysis to take the probabilistic nature of LLM-generated answers into account.
      </p>
      <p>The dataset used in this work evaluated over 30 open and proprietary LLMs on 26 RDF- and
SPARQLrelated task variations.</p>
      <p>The dataset contains LLMs from three open LLM providers: Qwen, Meta-LLama, and Microsoft-Phi.
They were selected because of providing oficial instruction-finetuned Large Language Models that
were the highest-ranked on the Open LLM Leaderboard [16] in December 2024 with respect to their
average score across all benchmarks included in the leaderboard 3. We restricted our selection to
models of up to 80B parameters due to restrictions on hardware resources available to us. In addition to
that, the dataset includes three LLMs fine-tuned or optimized for code understanding and generation
which also requires handling structured data similar to KG-related tasks: Qwen2.5-Coder-Instruct-32B,
Infly-OpenCoder-8B-Instruct, and deepseek-coder-33b-instruct. For the selection of these models, we
consulted the Mostly Basic Python Programming (MBPP) Benchmark score reported by the EvalPlus
Leaderboard [18] and decided for the top-ranked instruction-finetuned models not larger than 80B
2https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
3Upstage providing the solar LLM family was excluded here since the models only support a maximum context length of up
to 4096k Token which was not suficient for all tasks
parameters that are reported to be explicitly optimized or fine-tuned for code.</p>
      <p>
        The model sizes range from 0.5 billion parameters up to 72 billion parameters. Two included LLMs
are Mixture-of-Experts models: Qwen2-Instruct with 57 billion parameters (14 billion active) and
Phi-3.5-instruct with 42 billion parameters (6.6 billion active). With mixture-of-expert models only a
subset of parameters is active during inference, resulting in a lower efective parameter count compared
to the total model size. An overview of the evaluated models can be found in Table 1. More details on
the models and their selection can be found in the dedicated paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In addition to the open LLMs, several proprietary LLMs from the OpenAI GPT, Google Gemini and
Anthropic Claude families that achieved constantly high scores on the Chatbot Arena Leaderboard [15]
were included in the benchmark run, namely ChatGPT 3.5 turbo, ChatGPT 4o, ChatGPT 4o-mini,
ChatGPT o1, ChatGPT o1-mini, Gemini 2.0 Flash, Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.5 Sonnet
and Claude 3.5 Haiku. However, since model sizes for proprietary LLMs are not documented, we
selected only the remaining 26 open LLMs for our main analysis and refer to the achieved scores of the
proprietary models only briefly for comparison to better classify the open LLM performance.</p>
      <p>From the 26 task variations included in the dataset, we analyse 23 variations of seven task classes in
the KG engineering areas of RDF and SPARQL handling. To focus the comparison on various input
formats for consistency reasons, three task variations of the Text2Sparql consisting of other KGs as
datasets were excluded and eight task variations of RdfFriendCount in the dataset were aggregated into
four task variations for the analysis. Task variations of a task class have a similar prompt and evaluation
but difer e.g. by the serialization format (JSON-LD, N-Triples, Turtle, XML) presented to the LLM.</p>
      <p>
        For each open LLM, respectively 50 repetitions per task variation were executed. To assess the
performance of the LLMs, tasks compute several measures based on the LLM answers with values in
the interval [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] with better answers resulting in higher scores. These measures often include ones
based on recall, precision, and f1 measure as well as e.g. brevity measures or ones indicating if the
answer was at least syntactically correct. For some tasks, there are variations of measures defined with
diferent levels of strictness in the evaluation.
      </p>
      <p>We selected measures that examine the result correctness in suficiently diferent ways to provide a
concise overview. Therefore, brevity measures are not included and F1-based measures were selected
over precision- and recall-based measures. In the case of similar measures, only one representative was
chosen, e.g. measures that check the responses relying on the requested output format or measures
that search the answer for expected components. Here, stricter measures have been preferred to more
relaxed ones. Regarding measures that operate on output lists, we selected measures that remove
leading and trailing white space, since it is only a minor correction. Additionally, for tasks yielding
RDF graphs or SPARQL queries, measures indicating their syntactical correctness were included.</p>
      <p>The diferent calculated measures can be classified into three types:
Central These format-sensitive answer quality measures assess the answer correctness sensitive to
the instructed output format, i.e., the output accuracy is assessed assuming that the requested
format is respected. They are listTrimF1, f1, strSimilarity and trimF1.</p>
      <p>Fragment The fragment-based answer quality measures measure the answer correctness but are less
strict regarding the answer format when evaluating the answer and account for correct answer
parts. They include textHttpF1, contentF1 and sparqlIrisF1.</p>
      <p>Syntax Syntactical answer correctness measures inspect whether the output is syntactically correct, i.e.
fulfills all criteria for valid graphs or queries. The two measures parsableSyntax and answerParse
belong to this category.</p>
      <p>
        In the following, the seven task classes and the measures selected for this analysis are briefly described.
More information can be found in the LLM-KG-Bench documentation4 or in the articles introducing
them [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12, 13</xref>
        ].
4Task documentation: https://github.com/AKSW/LLM-KG-Bench/blob/v3.0.0/LlmKgBench/tasks/README.md
57B* 72B
72B
70B
70B
70B
70B
33B
42B*
Model (Family) Name
Qwen2-Instruct
Qwen2.5-Instruct
Qwen2.5-Coder-Instruct
Meta-LLama-3-Instruct
Meta-LLama-3.1-Instruct
Meta-LLama-3.2-Instruct
Meta-LLama-3.3-Instruct
Microsoft-Phi-3-instruct
Microsoft-Phi-3.5-instruct
Infly-OpenCoder-8B-Instruct
deepseek-coder-33b-instruct
      </p>
      <p>Model Sizes = Number of Parameters
0.5B
0.5B
1.5B
1.5B
1B
3B
3B
3.8B
3.8B
7B
7B
8B
8B
8B
7B
14B
14B
32B
32B
RdfConnectionExplain This task consists of finding the shortest connection between two nodes
in a small KG which requires a basic understanding of serialization formats and RDF concepts. There
are four variations of this task. Each presents the graph in a diferent serialization format: JSON-LD,
N-Triples, Turtle, or RDF/XML. Here, a list of IRIs representing the shortest path is the expected answer
format. For the given answer, the task computes listTrimF1 as F1-measure on trimmed list entries
without leading or trailing whitespaces. The textHttpF1 measure is an F1-measure on IRI-like answer
parts starting, e.g., with “http://”.</p>
      <p>RdfFriendCount This task presents a small KG with nodes of one type and edges of one type. The
LLM is asked to return the node with the most incoming edges. There are 4 KG serialization format
variations: JSON-LD, N-Triples, Turtle, and RDF/XML. The task computes the f1 measure on the nodes
found in the answer.</p>
      <p>RdfSyntaxFixing A KG with syntax errors is provided and the LLM is queried to correct it. There
are 3 variations introduced with the serialization formats JSON-LD, N-Triples, and Turtle. The measure
parsableSyntax equals 1 if the RDF syntax in the answer is parsable (0 otherwise). strSimilarity is
computed by comparing the given RDF with the expected answer, and contentF1 is the F1-measure on
the given RDF graph on a triple level.</p>
      <p>Sparql2Answer In this task, the LLM is asked to respond with the result set for a given SPARQL
SELECT query given the KG. There are 2 variations with the graph serialization formats JSON-LD and
Turtle. The answer should be a list of entities and the trimF1 measure is computed as F1-measure on
the trimmed entities, where leading and trailing whitespaces are removed.</p>
      <p>SparqlSyntaxFixing Similar to the RdfSyntaxFixing task, the LLM is asked to fix syntactically
erroneous SPARQL SELECT queries. The measure answerParse equals 1 if the adapted SPARQL query
syntax is correct (0 otherwise). sparqlIrisF1measure is the F1-measure on the IRIs found in the modified
SPARQL query. f1measure refers to the result set obtained when executing the corrected SPARQL
SELECT query.</p>
      <p>Text2Answer Similar to the Sparql2Answer task, the LLM is asked to respond with the result set for
a given natural language question given a small KG. There are 2 variations of the graph presented in the
serialization formats JSON-LD and Turtle. Similar to the Sparql2Answer task, the answer is expected as
a list and the trimF1 measure is computed on the trimmed list elements.</p>
      <p>Text2Sparql Here a natural language question is presented together with information on a KG and
the LLM is asked to translate the question into a suitable SPARQL SELECT query. There are 3 variations
with the KG presented in the form of a complete schema, only the relevant schema or the relevant
subgraph, all in Turtle syntax. The same measures as described for the SparqlSyntaxFixing task were
selected: answerParse, sparqlIrisF1measure and f1measure.</p>
      <p>For all tasks, the prompts are kept relatively simple and are not specifically optimized using prompt
engineering to assess the basic capabilities of LLMs. Moreover, we restrict from using advanced
prompting techniques to prevent certain models from gaining an unfair advantage that could occur after
they are used in the prompt engineering process. In the following section, we analyse the described
dataset.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Result Analysis</title>
      <p>In this section, we report and analyse the results of the benchmark run. First, the overall task
performance is examined to explore task-centered tendencies (Section 4.1). Second, we take a closer
look at task performances with respect to model sizes and shed light on two aspects: comparison of
performances between diferent size categories (Section 4.2) and the development of scores with respect
to model sizes and families (Section 4.3).</p>
      <sec id="sec-4-1">
        <title>4.1. Overall Task Performance</title>
        <p>To get an overview of the benchmark scores achieved by the open LLMs included in the experiments,
Table 2 lists the means and standard deviations of all LLM scores per task variation. Additionally, mean
scores of individual LLMs as well as the highest and lowest intra-LLM mean are reported.</p>
        <p>Regarding mean calculation, missing values of central and fragment measures originating from
unparsable RDF or SPARQL outputs were filled with 0 to account for the fact that the outputs do
not even meet the minimum quality criterion of syntactical correctness. Concerning tasks that allow
corrections to initial answers (i.e. multiple Prompt-Answer-Evaluate loops), only the last answer scores
are considered in the table, since the mean of all scores for the first and last answers show only minor
diferences.</p>
        <p>In the following, we will examine the scores for each measure type that are listed in Table 2.</p>
        <p>Regarding the mean scores of central measures they are in average medium-high being close to a
score of 0.6 on the tasks SparqlSyntaxFixing, RdfConnectionExplain, Text2Answer, RdfSyntaxFixing and
Sparql2Answer. In contrast, RdfFriendCount stands out with low means between 0.06 and 0.29. For the
Text2SPARQL task, the two input variations turtle schema and subschema also got low scores of 0.13 and
0.10, while the input variation turtle subgraph achieved a comparably high mean score of 0.57. For other
tasks, the diference between the mean scores for input variations is comparably small with a maximum
diference of 0.25 between the lowest and highest mean in the task class RdfSyntaxFixing. Moreover, no
clear task-overarching preference for a specific KG format (turtle, nt, jsonld, xml) is recognizable.</p>
        <p>Looking at the standard deviation, the scores on the central measures are widespread. Roughly 20%
of the central measures have a standard deviation between 0.2 and 0.3, 40% have one larger than 0.3 to
0.4 and the remaining 40% have a dispersion larger than 0.4 to 0.5. This is also reflected in the minimum
and maximum average central measure score per LLM. The highest minimum mean is 0.12 while, except
for three outliers, the maximum means are 0.75 or higher, and most have even a mean of 1 or close
values. Here, only the turtle input variation of the RdfFriendCount task with a maximum score of 0.47
and the turtle schema and turtle subschema input variations of the Text2SPARQL task with a maximum
intra-LLM mean of 0.31 and 0.28 difer substantially from the other means. In all of these cases, this
circumstance is also apparent in a low overall mean score.</p>
        <p>RdfFriendCount
RdfSyntaxFixing *</p>
        <p>The fragment measures show tendencies similar to the central measures. As expected, for all
task variations their means are higher compared to the central measures. This is also reflected in the
minimum and maximum intra-LLM means. Only for the RdfSyntaxFixing task, the minimum mean
per LLM is lower than those of the central measures. Notably, in contrast to the central measures, the
fragment measures are only calculated if the output graph is syntactically correct, otherwise, they are 0.</p>
        <p>Last but not least, the means of syntax measures for the RdfSyntaxFixing, SparqlSyntaxFixing, and
Text2Sparql tasks are rather high ranging between 0.68 and 0.81. However, the dispersion of values
around the means is relatively high with standard deviations between 0.39 and 0.47. Without exceptions,
the minimum intra-LLM means are all close or equal to 0, and the maximums close or equal to 1.</p>
        <p>This subsection gave an overview of the models’ performances on all tasks to classify task classes
and their variations. Building upon that, the following two subsections focus on comparing the models’
answer quality based on their sizes, i.e., trained parameters. First, in Section 4.2, task performances are
compared between model size categories, and in Section 4.3 the development of scores in general and
within families is visually assessed with respect to model sizes.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Size Category Performance Similarities</title>
        <p>
          In the following analyses, only the central measures are included since they indicate most accurately
whether a given answer is correct also taking into account the adherence to the requested output format.
To examine whether LLM size afects task performance, we first divided the LLMs into four groups with
respect to their sizes. We classify models into the size categories tiny [0 − 3] , small (3 − 8] , medium
(8 − 33] and large (33 − 72] . Subsequently, to assess whether there are any significant diferences
in achieved central benchmark scores between the LLM size groups, we conducted Kruskal–Wallis
tests [
          <xref ref-type="bibr" rid="ref13">28</xref>
          ] for each task variation with the null hypothesis that the score distributions of all groups
are identical. For all tests performed, null hypotheses were rejected with a significance level of less
than 0.001 indicating that for all task variations, there are significant diferences between model size
groups. The highest significance level was obtained for the RdfConnectionExplain xml variation with
p≈5 − 12 and the lowest significance level was p ≈7 − 122 for the Text2Sparql turtle subschema variation
indicating highly significant diferences between the groups.
        </p>
        <p>
          Since the Kruskal-Wallis test only measures whether there is a significant diference between a set
of groups, next, posthoc Dunn tests [
          <xref ref-type="bibr" rid="ref14">29</xref>
          ] with Bonferroni correction [
          <xref ref-type="bibr" rid="ref15">30</xref>
          ] were conducted to examine
which groups are dissimilar. Again, the null hypothesis was that there is no diference between the
group pairs. Table 3 shows the results of the post-hoc tests for each task variation. Group pairs that
are dissimilar with a significance of 5% or less are blank. For all pairs not classified as dissimilar, the
p-value is provided. The higher this value is, the more similar groups can be considered. Additionally,
on the left-hand side of the table, the mean scores per model size group are given as a reference. Groups
with a high standard deviation (0.3, 0.4] were marked with a ∼, and those with a very high standard
deviation (0.4, 0.5] were marked with ≈. All other groups have a standard deviation of 0.3 or lower.
        </p>
        <p>Overall, as expected, most comparisons show, with a significance of 5%, a dissimilarity between the
respective group pairs, i.e., their respective scores were significantly diferent (the null hypotheses
were rejected). Except for six dissimilar pairs, the diferences were also very significant with  &lt; 0.001 .
Typically, the identified significant score diferences between groups are associated with rising scores
from groups containing smaller LLMs to groups with larger model sizes. However, only the input
variations turtle and xml of the RdfFriendCount Task show unexpectedly decreasing scores from a group
of smaller to a group of larger LLMs. Additionally, deviating from the assumptions, there were also
pairs for which no significant diferences were recognizable. For these groups, the p-value is given
in Table 3. Higher values indicate that size categories can be considered more similar with respect
to their task performance. For the pairs of medium and large groups, this applies to roughly half of
the cases. Mostly, for both groups, the average scores were high, i.e., a ceiling efect occurred. In
three cases, for the RdfFriendCount nt, Text2Sparql turtle schema and Text2Sparql turtle subschema task
variations the overall scores are low and show plateau efects, i.e. they do not change perceivably, so
no significant diferences were detected. The pair with the second highest number of insignificant
score diferences is small-medium with five cases. Here, mainly plateau efects occurred. Similarly,
the three insignificant diferences between scores of the pair tiny-small are plateaus. In contrast to
groups adjacent with respect to the size class they represent, pairs that are not directly adjacent have
predominately significantly diferent scores. Hence, there are only two cases of rather similar scores
forming plateaus between the groups tiny and medium and one case between the pair small and large.</p>
        <p>In summary, overall the most frequent perceivable efect is a rise of the average scores from smaller
Task
RdfConnection</p>
        <p>Explain
RdfFriend</p>
        <p>Count
RdfSyntax</p>
        <p>Fixing
to larger model groups. Notable exceptions are the task variations turtle and xml RdfFriendCount, for
which the scores between smaller and larger model size groups decrease significantly. Other than that,
plateaus occur for which the scores between adjacent groups do not change significantly. However, in
some cases, the plateaus occur only locally and a rise of scores is detectable in particular between the
medium and large groups (see e.g. RdfFriendCount jsonld or RdfConnectionExplain nt ). For some task
variations, the scores of groups medium and large also almost reach the upper score bound of 1.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Task Performance by Model Size and Family</title>
        <p>Complementary to the last section, Figure 1 shows the average central measure scores for each task
class by LLM relative to their sizes. In addition, dashed lines connect LLMs of the same family. For
the tasks Text2Sparql and RdfFriendCount, the plots again expose the overall poor performance of the
included LLMs. Moreover, other previously found patterns are visible in the figures. Hence, the overall
tendency of scores to rise with the model size is noticeable. In addition, the plateau and ceiling efects
can be seen.</p>
        <p>Furthermore, additional insights are visible in the plots. For reference, the plots also show the highest
average score achieved by a proprietary model included in the benchmark run as a red horizontal line.
Here, we see that, except for the Text2Sparql task, the best proprietary LLM always reaches an average
score of 0.99 or 1.00. With that, the best-performing proprietary model is, except for the RdfFriendCount
task in which the best open LLMs achieves only a score of 0.55, on par with the best-performing open
LLM. Especially the mean scores of tiny [0 − 3] and small (3 − 8] models difer greatly.</p>
        <p>Moreover, for most tasks, the highest score growth occurs within the range of tiny to smaller
mediumsized models around 13B. Frequently, for tasks showing ceiling efects, already some smaller models of
around 8B or 14B reach average scores of 0.8 or higher. Here, especially the 8 and 14B Qwen2.5 models
stand out.</p>
        <p>The two included Mixture-of-Experts (MoE) LLMs, namely Phi-3.5-MoE-instruct (42B parameters
thereof 6.6B active) and Qwen2-57B-A14B-Instruct (57B parameters thereof 14B active) show for most
tasks scores similar to models of a similar size regarding their total parameters that use all parameters
during inference time. Nevertheless, there are models with lower total parameter counts in the range of
the MoE models’ active parameter counts that perform comparably.</p>
        <p>Besides, the code-specialized models, Qwen2.5-Coder (32B), OpenCoder (8B), and Deepseek-Coder
(33B) are all roughly on par or in a few cases slightly better performing compared to similarly sized
LLMs regarding their task performance on the RdfSyntaxFixing, SparqlSyntaxFixing and Text2Sparql
that all yield either a RDF graph or a Sparql query. Here, DeepSeek-Coder and Qwen2.5-Coder perform
similarly to the best-performing open models but this also holds for some non-code specialized models.
For the other tasks, except for the RdfFriendCount task, Qwen2.5-Coder also performs comparably
similar or even slightly better than similar-sized models. In contrast, DeepSeek-Coder and OpenCoder
perform worse than other similarly-sized models on the tasks not yielding a KG or SPARQL query.</p>
        <p>In the next paragraphs, we look at intra-family and inter-family developments of benchmark scores
with respect to the model size.</p>
        <p>Models of the same family also reflect the overall tendency of scores to rise with the model size. In
addition, the largest models of the families are typically the best performing. However, on the family
level, occasionally the task performance also drops between size-wise adjacent smaller and larger
models. These drops in performance with larger models typically remain only local and the next larger
model of the family often shows a higher or at least steady task performance compared to the model
before the drop. Global, family-overarching saturation efects, i.e., ceiling and plateau efects, are also
visible, in particular, for families with many models covering all size bands.</p>
        <p>In between diferent families, in some cases, the score development with rising model size is similar
and almost parallel. However, in general, clear global parallelisms in the score developments with rising
model sizes are not recognizable.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this section, we summarize and discuss key insights from our analysis structured into paragraphs
each covering a diferent insight.</p>
      <p>Larger models typically achieve higher scores than smaller ones but there are plateau and
ceiling efects. In the size category analyses, we saw that most of the time, as expected, typically
larger model size groups achieved significantly higher benchmark scores (see also Table 3). For easier
tasks, especially the medium and large category pairs got similar high scores. Hence, here medium-sized
models could be a good choice to optimize cost-efectiveness. In contrast, especially for more dificult
tasks, plateaus occurred. Some were only local and larger models got significantly higher scores than
the models within the plateau range. Consequently, it makes sense to consider the detected local
plateaus and decide on a larger LLM since even if the costs increase, also the performance increases
significantly. For global plateaus that are not yet close to the maximum score and spanning up to the
large models, it could also make sense to use smaller models since it saves costs and does not afect the
task performance much.</p>
      <sec id="sec-5-1">
        <title>Some smaller models also perform comparably well. However, the performance across</title>
        <p>individual smaller models varies. Furthermore, also Figure 1 confirms that even some small ( 8B)
or medium-sized ( 13B) models might be a good choice since they already achieve reasonably high
scores. Nevertheless, individual models within the same size band have to be tested explicitly, since
their performance also varies. However, the insights help to guide the overall model search and indicate
whether it seems promising or likely to consider models within a certain size band or not.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Performance drops occur between smaller and larger models from the same family. Moreover,</title>
        <p>within model families, we again saw that local task performance drops can occur between smaller and
larger members. Hence, it is advisable to also study models adjacent with respect to their size within a
family.</p>
      </sec>
      <sec id="sec-5-3">
        <title>The examined Open LLMs cannot cope well with the RdfFriendCount and Text2Sparql tasks. In</title>
        <p>addition to guiding the model choice within open LLMs, the results also indicate that current
state-ofthe-art open LLMs as of December 2024 up to a parameter count of 70B, cannot cope well with the
RdfFriendCount and the Text2Sparql tasks. Here, the tasks likely require an even larger model. For the
RdfFriendCount task, a proprietary LLM included in the benchmark run got a mean central score of 0.99
or 1.00. Hence, here current proprietary models can cope well with the task in contrast to comparably
much smaller large open LLMs. Nevertheless, for the Text2Sparql task also the proprietary LLMs achieve
not substantially higher mean scores. Here, the best-performing model achieves only a mean score of
0.49 indicating that the identified plateau efect even continues for much larger proprietary models.</p>
      </sec>
      <sec id="sec-5-4">
        <title>The examined code-specialized models showed better performance on tasks where a KG</title>
        <p>or a SPARQL query was requested. Compared to the other tasks, especially DeepseekCoder
and OpenCoder performed best on the tasks RdfSyntaxFixing, SparqlSyntaxFixing and Text2Sparql in
comparison to other tasks.</p>
      </sec>
      <sec id="sec-5-5">
        <title>The examined Mixture-of-Expert (MoE) models do not show superior performance in com</title>
        <p>parison with models of the MoE’s active parameter count. Looking at the individual task scores
of models, the MoE LLMs Phi-3.5-MoE-instruct and Qwen2-57B-A14B-Instruct performed mostly
comparably to models having a similar total parameter count. However, models of a size in the range of the
MoE models’ active parameters performed also similarly. Hence, for the given tasks, it makes sense to
prefer these smaller models instead of the MoE models concerning their cost-efectiveness.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we analysed the scores of open LLMs from a run of the LLM-KG-Bench benchmark for
knowledge graph engineering-related tasks with a focus on the correlation between model size and
achieved scores. Overall, we saw that, as expected, usually the larger the model was, the higher the
scores were. However, our analysis also showed plateau and ceiling efects in which model scores
did not difer substantially between smaller and larger models. Hence, for comparably easy tasks,
also smaller models already achieved reasonably high scores. Consequently, it makes sense to also
consider smaller models for similarly complex tasks. For the RdfFriendCount and Text2Sparql tasks,
the benchmark scores were overall low, plateau efects spanned up to the largest models analysed.
Here, we can conclude that the capabilities of SOTA open LLMs are not yet suficient to solve tasks
of this complexity. While the RdfFriendCount task can be solved by much larger proprietary models,
for Text2Sparql plateau efects continue, and potentially even larger models are required to suficiently
solve this task.</p>
      <p>For future works, we believe that for benchmark runs similar analyses are meaningful to get an
overview of the status of SOTA models but also derive generalizable insights that might help to classify
whether newly introduced models or models not part of the benchmark run seem promising to consider.
Here, it would be also interesting to examine additionally other scaling law-related factors like the
training data, the number of training steps, and factors related to the model architecture. This would
allow for further examination and possible explanations of efects that became apparent in this work
like the performance diferences between similarly-sized models or performance drops of larger models
compared to smaller ones belonging to the same model family. Moreover, it is meaningful to extend the
LLM-KG Bench framework by more complex variations of already well-solved tasks to be able to figure
out whether medium-sized models are still on par with large models in more dificult scenarios. Besides,
also motivated by the preference of code-specialized LLMs towards tasks requiring a KG or SPARQL
query as output, exploring which kind of capabilities are required for specific tasks and why some tasks
seem particularly challenging would be a meaningful future contribution to guide targeted solutions.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partially supported by grants from the German Federal Ministry of Education and
Research (BMBF) to the projects ScaleTrust (16DTM312D) and KupferDigital2 (13XP5230L) as well
as from the German Federal Ministry for Economic Afairs and Climate Action (BMWK) to the KISS
project (01MK22001A).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT4o and ChatGPT4.5-RP to: Grammar and
spelling check, paraphrase, and reword to improve the writing style. After using these tools/services,
the authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Online Resources</title>
      <p>The LLM-KG-Bench framework is available here: LLM-KG-Bench. The raw benchmark run results and
further figures are available here: Results LLM-KG-Bench v3. The code written to perform this analysis
can be found here: Analysis Code.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. de Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmelzeisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 54</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1145/3447772.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (TKDE) (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2024</year>
          .
          <volume>3352100</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Knowledge engineering using large language models (</article-title>
          <year>2023</year>
          ).
          <source>doi:10.4230/TGDK.1.1.3.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Buchmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eder</surname>
          </string-name>
          , H.-G. Fill,
          <string-name>
            <given-names>U.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Laurenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mylopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Plexousakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <article-title>Large language models: Expectations for semantics-driven systems engineering</article-title>
          ,
          <source>Data and Knowledge Engineering</source>
          <volume>152</volume>
          (
          <year>2024</year>
          )
          <article-title>102324</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.datak.
          <year>2024</year>
          .
          <volume>102324</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Tafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <article-title>Leveraging LLMs in scholarly knowledge graph question answering (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2311.09841.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          , E. Rahm,
          <article-title>Towards self-configuring knowledge graph construction pipelines using llms - a case study with rml</article-title>
          ,
          <source>in: Fifth International Workshop on Knowledge Graph Construction @ ESWC2024</source>
          , volume
          <volume>3718</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3718</volume>
          /paper6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kovriguina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Teucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radyush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mouromtsev</surname>
          </string-name>
          , Sparqlgen:
          <article-title>One-shot prompt-based approach for sparql query generation</article-title>
          ,
          <source>in: International Conference on Semantic Systems</source>
          , volume
          <volume>3526</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3526</volume>
          / paper-08.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Babaei Giglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <source>LLMs4OL: Large Language Models for Ontology Learning</source>
          , Springer Nature Switzerland,
          <year>2023</year>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>427</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -47240-4_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , J. Frey,
          <string-name>
            <given-names>K.</given-names>
            <surname>Junghanns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gründer-Fahrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Developing a scalable benchmark for assessing large language models in knowledge graph engineering</article-title>
          , in: N.
          <string-name>
            <surname>Keshan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Neumaier</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          <string-name>
            <surname>Gentile</surname>
          </string-name>
          , S. Vahdati (Eds.),
          <source>Proceedings of the Posters and Demo Track of the 19th International Conference on Semantic Systems (SEMANTICS</source>
          <year>2023</year>
          ), volume
          <volume>3526</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3526</volume>
          /paper-04.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Heim</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Brei</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Stadler</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Junghanns</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Martin, LLM-KG-Bench 3.0: A compass for semantic technology capabilities in the ocean of LLMs</article-title>
          ,
          <source>in: Proceedings of ESWC 2025 Resources Track</source>
          ,
          <year>2025</year>
          .
          <article-title>Accepted for publication</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          , L. Meyer,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arndt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulert</surname>
          </string-name>
          ,
          <article-title>Benchmarking the abilities of large language models for RDF knowledge graph creation and comprehension: How well do llms speak turtle?</article-title>
          , in: M.
          <string-name>
            <surname>Alam</surname>
          </string-name>
          , M. Cochez (Eds.),
          <source>Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG</source>
          <year>2023</year>
          )
          <article-title>co-located with the 21th International Semantic Web Conference (ISWC</article-title>
          <year>2023</year>
          ), Athens, November 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , volume
          <volume>3559</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3559</volume>
          /paper-3.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          , L.-P. Meyer,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gruender</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Martin, Assessing the evolution of llm capabilities for knowledge graph engineering in 2023, in: Proceedings of Special Track Large Language Models for Knowledge Engineering at Extended Semantic Web Conference 2024 (ESWC24</article-title>
          ),
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -78952-
          <issue>6</issue>
          _5.
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hennigan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Noland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Millican</surname>
          </string-name>
          , G. van den Driessche,
          <string-name>
            <given-names>B.</given-names>
            <surname>Damoc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osindero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Elsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Rae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , L. Sifre,
          <article-title>Training compute-optimal large language models (</article-title>
          <year>2022</year>
          ). arXiv:
          <volume>2203</volume>
          .
          <fpage>15556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Kruskal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <article-title>Use of ranks in one-criterion variance analysis</article-title>
          ,
          <source>Journal of the American Statistical Association</source>
          <volume>47</volume>
          (
          <year>1952</year>
          )
          <fpage>583</fpage>
          -
          <lpage>621</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>O. J.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <article-title>Multiple comparisons using rank sums</article-title>
          ,
          <source>Technometrics</source>
          <volume>6</volume>
          (
          <year>1964</year>
          )
          <fpage>241</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Bonferroni</surname>
          </string-name>
          ,
          <article-title>Il calcolo delle assicurazioni su gruppi di teste</article-title>
          ,
          <year>1935</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>