<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1162/qss_a_00162</article-id>
      <title-group>
        <article-title>Models for SPARQL Query Generation in Scientific Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonello Meloni</string-name>
          <email>antonello.meloni@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Reforgiato Recupero</string-name>
          <email>diego.reforgiato@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Osborne</string-name>
          <email>francesco.osborne@open.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Salatino</string-name>
          <email>angelo.salatino@open.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <email>enrico.motta@open.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sahar Vahadati</string-name>
          <email>sahar.vahdati@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmann@tu-dresden.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Knowledge Graphs, Large Language Models, Machine Translation, SPARQL</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Knowledge Media Institute, Open University</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ScaDS.AI - TU Dresden</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>3617</volume>
      <fpage>1</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Scientific question answering remains a significant challenge for the current generation of large language models (LLMs) due to the requirement of engaging with highly specialised concepts. A promising solution is to integrate LLMs with knowledge graphs of research concepts, ensuring that responses are grounded in structured, verifiable information. One efective approach involves using LLMs to translate questions posed in natural language into SPARQL queries, enabling the retrieval of relevant data. In this paper, we analyse the performance of several LLMs on this task using two scientific question-answering benchmarks: SciQA and DBLP-QuAD. We explore both few-shot learning and fine-tuning strategies, investigate error patterns across diferent models, and propose directions for future research.</p>
      </abstract>
      <kwd-group>
        <kwd>Answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Answering scientific questions poses a significant challenge for current LLMs due to the need to engage
with highly specialised and complex concepts. In this domain, common limitations of LLMs, such
as hallucinations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the “long-tail” issue [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where LLMs struggle with rare or less frequently
occurring concepts, become especially crucial. While a new generation of LLM-based systems for
literature reviews and scientific writing support has emerged [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], their output still falls short of the
standards expected in high-quality scientific literature.
      </p>
      <p>
        One promising solution is the integration of LLMs with Knowledge Graphs (KGs) of research
concepts, which helps ensure that responses are grounded in structured, verifiable information [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The
scientific domain benefits from a wide array of knowledge organization systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], such as taxonomies
and ontologies of research topics, which play a crucial role in categorizing, managing, and retrieving
information. Additionally, numerous knowledge graphs have been developed in this space, providing
machine-readable, semantically rich, and interlinked descriptions of the content of research
publications [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 8, 9, 10</xref>
        ]. Notable examples include the Open Research Knowledge Graph (ORKG)1 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the
      </p>
      <p>
        An efective approach for integrating LLMs with KGs involves using LLMs to translate scientific
questions, posed in natural language, into SPARQL queries [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This allows for the retrieval of relevant
data from the KG, which can either be presented directly to the user or further refined by the LLM. This
CEUR
      </p>
      <p>ceur-ws.org
solution also enables less technically proficient users to query and navigate complex knowledge graphs
of scientific concepts through a natural language interface.</p>
      <p>
        In this paper, we evaluate the performance of several LLMs in translating scientific questions into
SPARQL queries. We first conduct a comprehensive evaluation on the SciQA benchmark [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], followed
by testing the best-performing methods on the DBLP-QuAD benchmark [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Our goal is to assess
how efectively LLMs perform this task and determine whether current training or prompting methods
are suficient or if more advanced techniques are required. We explore the efects of fine-tuning
and various prompting strategies, including zero-shot and few-shot learning, using diferent example
selection methods such as semantic similarity [15] and diversity [16, 17]. Furthermore, we analyse error
patterns across models to identify areas for improvement. The insights gained from this study provide
a foundation for advancing the field by developing more comprehensive benchmarks and designing
systems better equipped to answer complex scientific questions.
      </p>
      <p>This short paper extends [18] by introducing several additional experiments. These include the
evaluation of Mistral, enhanced error analysis, and the integration of the DBLP-QuAD benchmark. It
should be noted that here we intentionally focus on testing of-the-shelf LLMs using general-purpose
optimisation strategies that can be widely applied across diverse tasks and datasets. In contrast, other
researchers have focused on developing specialised approaches for SciQA, often incorporating additional
components to integrate information from the ORKG ontological schema [19, 20, 21].</p>
      <p>In summary, the key contributions of this study are as follows: i) we conduct performance analysis
of five language models, evaluated across zero-shot, few-shot, and fine-tuning approaches; ii) we
demonstrate that the best models can achieve an F1 score exceeding 97% on both benchmarks; and iii)
we release the complete codebase of our experiments to support further research into LLMs performance
on similar benchmarking tasks4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experiments on the SciQA dataset</title>
      <p>The SciQA dataset contains 2,565 pairs of natural language questions and corresponding SPARQL
queries, designed to retrieve relevant information from the ORKG, which includes 170,000 resources
detailing research from nearly 15,000 scholarly articles across 709 topics. The dataset consists of
both manually curated and automatically generated question-query pairs. Specifically, 100 pairs were
manually created, revealing eight distinct question templates. Using these templates, an additional
2,465 pairs were generated by GPT-3 and verified by human experts [ 22]. The SciQA benchmark dataset
is divided into three parts: 70% for training (1,795 samples), 10% for validation (257 samples), and 20%
for testing (513 samples).</p>
      <p>For our experiments, we evaluated five LLMs: T5-base 5[23], GPT-2-large6[24], Dolly-v2-3b7[25],
Mistral-7B-v0.18[26], and GPT-3.5 Turbo9[22]. We examined three optimization methods for LLMs:
ifne-tuning (FT), zero-shot learning (ZSL), and few-shot learning (FSL). In FSL, to evaluate a question
from the test set, we used diferent methods from the literature to select the most relevant samples for
each question.</p>
      <p>Random: Select  samples randomly from the training set for each test question.</p>
      <p>Similarity: Order samples by their semantic similarity to the test question, and choose the top  most
similar samples.</p>
      <p>Diversity - Test A (All Diverse Templates): Rank samples by semantic similarity to the test question
and select the top  samples, ensuring they represent diferent templates.</p>
      <p>Diversity - Test B (Same Template for All): Rank samples by semantic similarity and select the top 
samples that share the same template as the first sample.
4Codebase and prompts - https://github.com/paper-support-materials/Analysis-of-the-SciQA-Benchmark
5T5-base - https://huggingface.co/t5-base
6GPT-2-large - https://huggingface.co/gpt2-large
7Dolly-v2-3b - https://huggingface.co/databricks/dolly-v2-3b
8Mistral-7B-v0.1 - https://huggingface.co/mistralai/Mistral-7B-v0.1
9GPT-3.5 Turbo - https://platform.openai.com/docs/models/gpt-3-5
Diversity - DPPs (Determinantal Point Processes): Start with the most similar sample to the question
and select additional samples with minimal semantic similarity to each other and the initial sample [27].</p>
      <p>Table 1 provides a comparative analysis of all configurations according to their F1 scores and number
of exact match rates. The fine-tuned Mistral achieved the highest F1 score (97.69%), slightly surpassing
the fine-tuned T5 (97.51%) and GPT-3.5, using the 7-sample few-shot method based on similarity (97.36%).
Next is the fine-tuned Dolly (96.58%) and GPT-2 (96.69%). In terms of exact matches, T5 and Dolly
attained the highest score (483/513, 94.1%). The model utilising FSL performed well overall, though it
did not achieve the same level of performance as the fine-tuned model. For example, Mistral, with a
7-sample FSL approach based on similarity, achieved a solid 94.7% F1. Semantic similarity is the most
efective method for FSL across all models. Notably, the benchmark proved highly challenging for all
models under ZSL conditions, as none achieved any exact matches and F1 scores were under 26%.</p>
      <p>We performed an in-depth review of the queries generated by the top three models that were classified
as incorrect due to their deviation from the benchmark responses. The most common error category
involved the generation of incorrect predicates (56.8% of erroneous queries on average), followed by
semantic errors, where the query failed to accurately reflect the user’s question (51.4%), and misspelled
entities (51.3%). A query can be associated with multiple categories, meaning the total percentage
across all categories exceeds 100%.</p>
      <p>The most common error types for both T5 and GPT-3.5 stemmed from a limited understanding of
the underlying ontological schema. A prevalent issue was the generation of misspelled entities, which
constituted 60.0% of T5’s and 60.5% of GPT-3.5’s incorrect outputs. Furthermore, both models had
dificulty accurately assigning entity types, contributing to 36.6% of T5’s errors and 52.6% of GPT-3.5’s
errors.</p>
      <p>In contrast, Mistral performed significantly better in these two categories, with error rates of 33.3%
and 15.2%, respectively. However, Mistral exhibited a much higher rate of semantic misunderstandings,
with 69.7% of its errors resulting from incorrect interpretations of the query.</p>
      <p>The first type of error appears easier to address, potentially by incorporating an additional entity
recognition component. To investigate this approach further, we evaluated the top three models
(fine-tuned Mistral, fine-tuned T5, and GPT-3.5 with 7-shot learning) on the DBLP-QuAD benchmark.
DBLP-QuAD is similar to SciQA but also includes relevant entities and relationships as part of the input.
This enabled us to assess whether providing the correct entity could help reduce the occurrence of</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments on the DBLP-QuAD dataset</title>
      <p>
        The DBLP-QuAD benchmark10 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] includes 10,000 distinct question-query pairs, divided into training,
validation, and test sets in a 7:1:2 ratio. The dataset covers 13,348 entities (creators and publications)
and 11 predicates from the DBLP Knowledge Graph11. It ofers 10 query types, each with 1,000
questionquery pairs, equally split between creator-focused and publication-focused queries. Additionally, 2,350
of the questions in DBLP-QuAD are temporal, requiring the analysis of statistics across a specified
timeframe, e.g., “in the last five years”. The key diference from the SciQA dataset is that the entities
and relationships involved in the natural language query are also provided.
      </p>
      <p>
        We evaluated the top three models from our previous experiments on the DBLP-QuAD dataset. Please
note that, unlike the original paper that introduced the benchmark [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we fine-tuned the T5 model
for 20 epochs instead of 5, and applied a slightly modified prompt. Full implementation details can be
found in the GitHub repository linked in the introduction.
      </p>
      <p>As reported in Table 2, T5-base achieved the best results (97.5%), slightly outperforming GPT-3.5
(94.7%). Mistral exhibited lower performance on this benchmark (88.7%). The advantage of T5-base
becomes even more evident when considering exact matches. In this case, T5-base leads significantly
(1,693/2,000, 84.6%), outperforming both GPT-3.5 at (68.2%) and Mistral (60.7%).</p>
      <p>These findings are consistent with previously observed error patterns. T5 and GPT-3.5, which had
previously struggled the most with entity identification, now seem to leverage the provided entities
efectively and outperform Mistral. Additionally, the results suggest that a lightweight encoder-decoder
model such as T5, which can be fine-tuned efectively for translating natural language into SPARQL,
has significant potential as a scalable, resource-eficient, and efective method for this task, particularly
when paired with an entity resolution mechanism.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>The experiments reported in this short paper provide several valuable insights about the capability of
LLMs to address scientific question answering on KGs.</p>
      <p>First, it appears that current benchmarks are not suficiently challenging for the latest generation of
LLMs, as the best configurations achieved over 97% F1 on both benchmarks. This may be attributed to
the regularities within the benchmarks, allowing fine-tuned models to learn and reproduce patterns. To
advance this field further, it is crucial to develop more diverse and challenging datasets that cover a
broader range of realistic query types, while actively involving human users in the dataset creation
process to minimise the risk of LLMs learning only a limited set of templates. Conducting user studies
with real-world applications could also be beneficial, as users tend to formulate more varied and complex
questions [28].</p>
      <p>Second, incorporating additional components to resolve entities and relations seems to be highly
useful, particularly for encoder-decoder models like T5 and generalist LLMs using few-shot learning,
such as GPT-3.5. This approach allows LLMs to focus on semantic interpretation and the generation of
10https://huggingface.co/datasets/awalesushil/DBLP-QuAD?row=89
11https://blog.dblp.org/tag/knowledge-graph/
accurate, well-formed SPARQL queries without needing to understand the specific schema of a given KG.
A possible enhancement would be to also provide systems with an ontological schema representation
as context, as explored in recent specialised approaches [19, 20].</p>
      <p>
        We are currently developing a new and more challenging benchmark, building upon the
Academia/Industry DynAmics (AIDA) Knowledge Graph [29] and the Computer Science Knowledge Graph
(CS-KG) [30]. To expand the diversity of question types, we are leveraging various question
templates drawn from large-scale question-answering benchmarks, such as Mintaka [31]. We are also
analysing the performance of large language models across several tasks relevant to scientific research,
including the construction of scientific knowledge graphs [ 32], link prediction between research
concepts [33, 34], research paper classification [ 35], citation recommendation [36], and the generation of
literature reviews [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
[32] D. Dessí, F. Osborne, D. R. Recupero, D. Buscaldi, E. Motta, Scicero: A deep learning and nlp
approach for generating scientific knowledge graphs in the computer science domain,
KnowledgeBased Systems 258 (2022). URL: https://oro.open.ac.uk/85472/. doi:10.1016/j.knosys.2022.
109945.
[33] M. Nayyeri, G. M. Cil, S. Vahdati, F. Osborne, M. Rahman, S. Angioni, A. Salatino, D. R.
Recupero, N. Vassilyeva, E. Motta, et al., Trans4e: Link prediction on scholarly knowledge graphs,
Neurocomputing 461 (2021) 530–542.
[34] A. Borrego, D. Dessì, I. Hernández, F. Osborne, D. R. Recupero, D. Ruiz, D. Buscaldi, E. Motta,
Completing scientific facts in knowledge graphs of research concepts, IEEE Access 10 (2022)
125867–125880.
[35] A. Cadeddu, A. Chessa, V. De Leo, G. Fenu, E. Motta, F. Osborne, D. R. Recupero, A. Salatino,
L. Secchi, A comparative analysis of knowledge injection strategies for large language models in
the scholarly domain, Engineering Applications of Artificial Intelligence 133 (2024) 108166.
[36] D. Buscaldi, D. Dessí, E. Motta, M. Murgia, F. Osborne, D. R. Recupero, Citation prediction by
leveraging transformers and natural language processing heuristics, Information Processing &amp;
Management 61 (2024) 103583.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , D. Han,
          <string-name>
            <surname>D</surname>
          </string-name>
          . Lo,
          <article-title>The devil is in the tails: How long-tailed code distributions impact large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2309.03567</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bolanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          , E. Motta,
          <article-title>Artificial intelligence for literature reviews: Opportunities and challenges</article-title>
          ,
          <source>arXiv preprint arXiv:2402.08565</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>graphs</given-names>
          </string-name>
          ,
          <source>ACM Computing Surveys (Csur) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseriparsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs: Opportunities and challenges</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge organization systems of research fields: Resources and challenges</article-title>
          ,
          <source>arXiv preprint arXiv:2409.04432</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          , et al.,
          <article-title>Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Knowledge Capture</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birukou</surname>
          </string-name>
          , E. Motta,
          <article-title>Improving editorial workflow and metadata quality at springer nature</article-title>
          ,
          <source>in: The Semantic Web - ISWC 2019</source>
          , Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>507</fpage>
          -
          <lpage>525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wijkstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Welbers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Steijaert</surname>
          </string-name>
          , Living literature reviews,
          <source>arXiv preprint arXiv:2111.00824</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Angioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Recupero</surname>
          </string-name>
          , E. Motta,
          <article-title>AIDA: A knowledge graph about research dynamics in academia and industry</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>1356</fpage>
          -
          <lpage>1398</lpage>
          . URL: https://doi.org/10.1162/qss_a_00162. doi:
          <volume>10</volume>
          .1162/qss_a_
          <fpage>00162</fpage>
          . arXiv:https://direct.mit.edu/qss/article-pdf/2/4/1356/2007973/qss_a_00162.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gibson</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Velterop,</surname>
          </string-name>
          <article-title>The anatomy of a nanopublication</article-title>
          ,
          <source>Information Services &amp; Use</source>
          <volume>30</volume>
          (
          <year>2010</year>
          )
          <fpage>51</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Meyer</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Brei</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Arndt</surname>
          </string-name>
          ,
          <article-title>Assessing sparql capabilities of large language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2409.05925. arXiv:
          <volume>2409</volume>
          .
          <fpage>05925</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A. C.</given-names>
            <surname>Barone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Karras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mouromtsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pliukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radyush</surname>
          </string-name>
          , I. Shilin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stocker</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Tsalapati,</surname>
          </string-name>
          <article-title>The sciqa scientific question answering benchmark for scholarly knowledge</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <article-title>7240</article-title>
          . URL: https://doi.org/10.1038/s41598-023-33607-z. doi:
          <volume>10</volume>
          .1038/s41598- 023- 33607- z.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Awale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <article-title>Dblp-quad: A question answering dataset over the DBLP scholarly knowledge graph</article-title>
          , in: I. Frommholz,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mayr</surname>
          </string-name>
          , G. Cabanac,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          , J. Brennan (Eds.),
          <source>Proceedings of the 13th International Workshop on Bibliometric-enhanced Information Retrieval co-located with 45th European Conference on Information Retrieval (ECIR</source>
          <year>2023</year>
          ), Dublin,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>