<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Study on Schema Coverage and Query Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maryam Mohammadi</string-name>
          <email>m.mohammadi@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Dumontier</string-name>
          <email>michel.dumontier@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>RDF Knowledge Graph</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SPARQL Query Logs</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SPARQL Shema Coverage</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Data Science, Maastricht University</institution>
          ,
          <addr-line>Paul-Henri Spaaklaan 1, 6229 GT, Maastricht</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>26</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>The advent of Knowledge Graphs (KGs) has revolutionized knowledge representation, enabling enhanced understanding, reasoning, and interpretation of complex data for both humans and machines. As a crucial tool for addressing real-world challenges, KGs rely heavily on SPARQL queries for data access and manipulation. Despite their extensive use, the alignment of these queries with the underlying KG schema is not wellunderstood. This paper introduces 'SPARQL schema coverage' as a novel measure to assess the extent to which SPARQL queries reflect the KGs' content and structure. Utilizing Bio2RDF SPARQL logs as a case study, the paper reveals a SPARQL schema coverage of 98%, demonstrating a strong alignment between user queries and the KG schema, thereby highlighting high user engagement. This finding is significant for KG engineers in reshaping ontology and for triple store administrators in enhancing performance through targeted caching. The study addresses key research questions regarding the nature and extent of KG elements referred to in user queries and their coverage of the available data. This approach not only provides a new perspective in KG utilization but also aids in optimizing KG design and application, ofering valuable insights for the future development and optimization of knowledge graphs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advent of Knowledge Graphs (KGs) has revolutionized knowledge representation, enhancing the
understanding, reasoning, and interpretation of complex data for both humans and machines [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As
the primary tool for a myriad of real-world challenges, KGs have necessitated sophisticated tools
for data access and manipulation, with SPARQL queries playing a pivotal role [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, despite
the extensive use of SPARQL queries, our understanding of their alignment with the underlying
KG schema remains limited. This paper seeks to bridge this gap by introducing a novel measure of
’SPARQL schema coverage’ to assess the extent to which SPARQL queries reflect the content and
structure of the KGs.
      </p>
      <p>
        Previous studies have predominantly focused on the syntactical aspects of SPARQL queries or
their general structure [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ], neglecting the rich insights that can be gleaned from their content.
For instance, Asprino et al. (2023) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] emphasized categorizing SPARQL queries into templates to
identify common usage patterns, while Bielfeldt et al. (2018) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] diferentiated between organic and
robotic queries in Wikidata, uncovering distinct patterns in human and automated querying behaviors.
These studies, while insightful, stop short of examining how comprehensively these queries cover
the schema of the targeted KGs.
      </p>
      <p>This paper aims to fill this research gap. We propose a comprehensive analysis method to determine
how well user queries in SPARQL logs align with the KG’s schema. This study involves a case study
revealing a schema coverage of Bio2RDF SPARQL logs. This insight is crucial for KG engineers and</p>
      <p>CEUR
triple store administrators, guiding them in optimizing KG performance and reshaping ontology
models. Our research addresses two fundamental questions:</p>
      <p>1) To what extent do user queries in SPARQL logs refer to specific elements of the knowledge
graph? 2) How comprehensive is this coverage in terms of the diversity of data available in the
knowledge graph?</p>
      <p>By answering these questions, this study not only contributes to the technical understanding
of KG utilization but also provides practical insights for optimizing KG systems for enhanced user
engagement and eficiency.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Previous work by Asprino et al. (2023) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced a content-centric method on LSQ logs. Their
method, termed ’query log summarization’, categorizes SPARQL queries into common templates
based on their content. In their experiments, they examined the ten most executed templates for
each log. This examination led to the identification of common usage patterns and the prevalence of
queries originating from a single code source. Moreover, their study explored template relationships,
uncovering evidence that related templates often participate in a common process, frequently executed
by a similar set of hosts. Additionally, the authors observed relationships between diferent queries
applied to data across various logs, indicating a systematic, automated approach to data querying.
      </p>
      <p>
        Bielfeldt et al. (2018) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduced the concept of organic and robotic SPARQL queries as a
fundamental principle for query log analysis. They proposed a method to partition a dataset of over
200 million SPARQL queries of Wikidata into organic and robotic queries. They defined robotic
queries as those generated by software tools or bot-like agents, while organic queries originated from
browser-like agents. By distinguishing between these two types of queries, the authors identified
clear patterns in human usage, while observing that the robotic component displayed high volatility
and unpredictability, even over extended time periods. In their experiments, they showed that organic
and robotic trafic are significantly diferent in many respects. For instance, from examining the most
frequent Wikidata properties used in queries as annotations on statements and the most frequent
Wikidata properties of statements whose complex form occurs in queries, they conclude that complex
statements are a larger fraction in organic queries. Another finding from their experiments suggests
that few users from Asia are accessing Wikidata via SPARQL-based applications. Although these
studies focus on content-centric analysis of queries, to the best of our knowledge to date, there are
no existing studies that show whether SPARQL query logs provide comprehensive coverage of the
schema of the target KG, or if certain sections of the target KG remain largely unexplored.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>This section outlines the workflow of our method, illustrated in Figure 1. The workflow comprises
four main steps. The first step involves gathering all schema elements from the KG. Next, we clean
the SPARQL logs in preparation for the third step, which entails extracting the schema elements used
within these logs. The final step involves calculating the SPARQL schema coverage, as detailed in
Equation 1, to assess how well the queries in the logs represent the KG’s schema. Detail of each step
are described below:</p>
      <p>In Step 1, we retrieve the Bio2RDF schema elements by executing a series of queries against
the KG, as depicted in Figure 2. We initiate this process with Query 1, which isolates
Bio2RDFspecific schema elements, deliberately excluding other domain datasets linked to Bio2RDF, such as
“http://www.openlinksw.com/schemas/virtrdf#”. Following this, Query 2 is employed to extract unique
classes from the Bio2RDF graphs. Notably, our approach ensures that only those classes with at least
one instance are returned, omitting unused classes like “http://bio2rdf.org/hgnc_vocabulary:Status”.
Finally, Query 3 is executed to retrieve unique schema predicates.</p>
      <p>In Step 2, we downloaded and perform data cleaning on Bio2RDF SPARQL query logs from the
period 2019-2021, which contain 3.880.939 queries, available at https://download.dumontierlab.com/
bio2rdf/logs/. We applied the following processes: the elimination of redundant HTTP Parameters,
variable standardization, and prefix addition to remove repetitive and invalid queries. The output of
this preprocessing is unique, normalized, valid queries.</p>
      <p>For the elimination of redundant HTTP Parameters, queries were stripped of non-essential HTTP
parameters to resolve parsing issues. We further standardized the queries to ensure uniformity. This
step is crucial because, in the next phase of our method, we aim to distinguish variables from actual
entities. By substituting variable names with placeholders like ’varX’, we achieve this distinction. For
variable standardization, names were standardized as shown in Figure 3. For example, three diferent
variable names (e.g., ’?drug’, ’?compound’, and ’?s1’) were changed to ’?var1’.</p>
      <p>
        For the Prefix Addition step, essential RDF syntax prefixes (such as rdf, rdfs, owl, etc.) were
incorporated to enhance the syntactical validity of the queries. This addition was necessary because
some queries lacked the required prefix declarations. In the context of the Virtuoso endpoint, users
are not obliged to explicitly include these prefixes in their queries, as it contains pre-registered prefix
declarations. However, our parser, ’sparqljs’ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a SPARQL 1.1 parser for JavaScript, implemented in
Java, did not have this feature. Therefore, we manually added these prefixes to ensure proper parsing
of the SPARQL queries.
      </p>
      <p>In Step 3, triple patterns are extracted from the queries, and unique schema elements are then
isolated from these triples. To identify schema classes, the type of URLs or literals in subject and object
positions is determined using Query 4 and Query 5 as depicted in Figure 4. For schema predicates,
Query 5 is executed to check their validity within the Bio2RDF schema. The entities that start with
’var’ (indicating a Variable node resulting from the variable standardization step), or ’g_’ (indicating a
blank node as per sparqljs parser) are disregarded.</p>
      <p>In Step 4, we calculate SPARQL schema coverage using the results from Steps 1 and 3, applying the
formula outlined in Equation 1. This equation involves the count of all distinct classes and predicates
in the Bio2RDF dataset (Total Schema Elements, or TSE) and the count of all distinct classes and
predicates used in user SPARQL query logs (Used Schema Elements, or USE). The SPARQL schema
coverage (SC) is then calculated using Equation 1. The code for this Method is available on our
GitHub repository: https://github.com/marmhm/SPARQL_queries.</p>
      <p>SC (%) = UTSSEE × 100
(1)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we describe the outcomes from the diferent steps of our method. Query 1, in Figure 1
generated 44 results, of which only 26 were used, as these are graphs within the Bio2RDF domain.
This was determined by filtering out results that do not have “bio2rdf.dataset” in their graph URL.
In the preprocessing step, from an initial dataset of 3.880.939 queries, we derived 633.453 unique
and valid queries. It is important to note that the addition of prefixes allowed for the parsing of an
additional 8.176 queries.</p>
      <p>The variable standardization process further refined the dataset to 545.046 queries. The details of
these findings are summarized in Figure 5. Our analysis extracted 2.022.827 triples from these queries,
which included 30.918 unique entities (URLs and Literals), 506 unique entity types, and 1.100 unique
predicates.
SC (%) = UTSSEE × 100 = 11663056 × 100 = 98%
(2)</p>
      <p>This assessment quantifies the extent of Bio2RDF schema utilization, indicating a substantial
coverage of approximately 98%.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>
        This study sheds light on the alignment between SPARQL query logs and the Knowledge Graphs
(KGs) schema, enhancing our understanding of schema coverage of queries. This work reveals that
the SPARQL queries make references to nearly the entirety of the actual RDF schema. This result
is surprising because we expected that only some parts of the RDF graph would be of interest to
users. Historically, there are two main reasons for the inclusion of datasets: (1) to support query
answering / data mining use cases that spanned a subset of elements across multiple dataset, and (2)
the opportunistic inclusion of new or popular datasets as biological linked data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For future work,
we plan to undertake two main initiatives: Firstly, we aim to incorporate additional datasets, like
the LSQ, to broaden our analysis. Secondly, we intend to segment the dataset used in this paper into
shorter periods, such as six or three-month intervals, to investigate how schema coverage fluctuates
over these timescales. These steps are expected to deepen our understanding of schema usage patterns
and improve the methodology’s applicability and efectiveness in diverse real-world scenarios.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgment</title>
      <p>This project has received funding from the European Union’s Horizon 2020 research and innovation
program under the Marie Skłodowska-Curie grant agreement No 860801.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Abu-Salih</surname>
          </string-name>
          ,
          <article-title>Domain-specific knowledge graphs: A survey</article-title>
          ,
          <source>Journal of Network and Computer Applications</source>
          <volume>185</volume>
          (
          <year>2021</year>
          )
          <fpage>103076</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <article-title>A survey of rdf stores &amp; sparql engines for querying knowledge graphs</article-title>
          ,
          <source>The VLDB Journal</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Timm</surname>
          </string-name>
          ,
          <article-title>An analytical study of large sparql query logs</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>655</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buil-Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ugarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumontier</surname>
          </string-name>
          ,
          <article-title>A preliminary investigation into sparql query complexity and federation in bio2rdf</article-title>
          ,
          <source>in: Alberto mendelzon international workshop on foundations of data management</source>
          ,
          <year>2015</year>
          , p.
          <fpage>196</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <article-title>Lsq: the linked sparql queries dataset</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2015</year>
          : 14th International Semantic Web Conference, Bethlehem, PA, USA, October
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2015</year>
          , Proceedings,
          <source>Part II 14</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>261</fpage>
          -
          <lpage>269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buil-Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumontier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , A.-C.
          <article-title>Ngonga Ngomo, Lsq 2.0: A linked dataset of sparql query logs</article-title>
          , Semantic
          <string-name>
            <surname>Web</surname>
          </string-name>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Asprino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceriani</surname>
          </string-name>
          ,
          <article-title>How is your knowledge graph used: Content-centric analysis of sparql query logs</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2023</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bielefeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonsior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Practical linked data access via sparql: The case of wikidata</article-title>
          .,
          <source>in: LDOW@ WWW</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] SPARQL.js, Sparql.js-a sparql 1.1 parser for javascript https://www</article-title>
          .npmjs.com/package/sparqljs,
          <year>2023</year>
          . URL: https://www.npmjs.com/package/sparqljs.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Callahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cruz-Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ansell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Dumontier, Bio2rdf release 2: improved coverage, interoperability and provenance of life science linked data</article-title>
          ,
          <source>in: The Semantic Web: Semantics and Big Data: 10th International Conference, ESWC</source>
          <year>2013</year>
          , Montpellier, France, May
          <volume>26</volume>
          -30,
          <year>2013</year>
          . Proceedings 10, Springer,
          <year>2013</year>
          , pp.
          <fpage>200</fpage>
          -
          <lpage>212</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>