<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discoverability of SPARQL Endpoints in Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Hertling</string-name>
          <email>hertling@ke.tu-darmstadt.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universitat Darmstadt Knowledge Engineering Group</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mannheim, Germany Research Group Data and Web Science</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Accessing Linked Open Data sources with query languages such as SPARQL provides more exible possibilities than access based on derefencerable URIs only. However, discovering a SPARQL endpoint on the y, given a URI, is not trivial. This paper provides a quantitative analysis on the automatic discoverability of SPARQL endpoints using di erent mechanisms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Strategies for Discovering SPARQL Endpoints
We examine two basic strategies for discovering SPARQL endpoints from a URI:
trying to retrieve VoID descriptions, and leveraging external catalogs of datasets.</p>
      <sec id="sec-1-1">
        <title>1 http://datahub.io/ 2 http://dsi.lod-cloud.net/</title>
        <p>2.1</p>
        <sec id="sec-1-1-1">
          <title>Retrieving VoID Descriptions</title>
          <p>
            The VoID speci cation [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] recommends to use the RFC 5785 standard [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] for
publishing discoverable VoID descriptions of a dataset. This means that a URI of the
form http://hostname/.well-known/void is to be used for publishing VoID
descriptions. Although the speci cation states that the /.well-known/void path
segment should be located at the root level, i.e., directly follow the host name
part of the URI, our experiments have shown that it is sometimes located at
deeper locations. Thus, we use the following approach for trying to discover
VoID vocabularies:
          </p>
          <p>Given a URI, remove the portion after the last slash (/), and append
.well-known/void. If no VoID description is found at that location,
and there are segments left after the host name, continue from the
start.</p>
          <p>
            For example, given the URI http://www.example.org/data/xyz, we would
try the following URLs for retrieving a VoID description, using the VoID [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] and
Provenance [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] vocabularies:
1. http://www.example.org/data/.well-known/void and
2. http://www.example.org/.well-known/void,
assuming that the rst URL does not return a VoID description.
          </p>
          <p>As a second strategy to retrieving VoID descriptions, we retrieve the RDF
dataset from the (dereferencable) sample URI, and look for one of the following
axioms:
1. ?x void:inDataset ?d
2. ?x prv:containedBy ?d3</p>
          <p>
            Although, in the literature, means other than VoID descriptions have been
proposed to link data to SPARQL endpoints [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], we do not expect them to be
too widely spread, since they are not backed by a standardization document.
2.2
          </p>
        </sec>
        <sec id="sec-1-1-2">
          <title>Leveraging External Catalogs</title>
          <p>Catalogs of datasets, such as datahub, list datasets as well as their metadata,
including SPARQL endpoints, if applicable. For our prototype, we use the datahub
catalog, which lists data sets as well as their SPARQL endpoints. Similar searches
could be issued on any catalogs of Linked Open Data.</p>
          <p>We have implemented all those strategies in the SEnF (SPARQL Endpoint
Finder) service, a simple web service which can be used to retrieve SPARQL
endpoints for a URI.4
3 We do not demand that ?x is connected to &lt;URI&gt;, e.g., by a rdfs:definedBy
statement, in order to make this approach as versatile as possible, and since we assume
that a VoID description linked from a dataset will in most cases be the description
of that dataset, and not of another one.
4 http://tinyurl.com/sparqlsenf
We have tested the approaches discussed above on a random sample of 10,000
subjects in the 2012 billion triple challenge dataset,5 which we deem a
representative sample of Linked Open Data in the wild. Out of those 10,000 URIs, 8,893
were dereferencable.</p>
          <p>For each endpoint retrieved by any strategy, we have checked the correctness
of the result by issuing a query of the form ASK f&lt;URI&gt; ?r ?xg at the endpoint,
and consider the returned endpoint as a valid result if TRUE is returned upon
the query.</p>
          <p>Table 1 shows the results of our evaluation. The rst observation is that in
many cases and by most strategies, more than one endpoint is returned, which
shows that there is some redundancy in terms of SPARQL endpoints (i.e., more
than one endpoint may contain information on a resource).</p>
          <p>The main observation is that using external catalogs clearly outperforms
other methods in terms of coverage, being able to locate endpoints for 74% of all
URIs. However, only in 14% of the cases, at least one of the retrieved endpoints6
was online during our experiment7 and actually contains data about the resource
in question, which also demonstrates the limitations of the approach.</p>
          <p>The approaches using VoID and the provenance vocabulary are still not
adopted on a large scale, thus, the coverage of those approaches is much lower.
On the other hand, the data found by following /.well-known/void is much
more precise than those delivered by catalogs, showing a precision of 0.48 (in
contrast to 0.19 for the catalog based approach). The approach looking for direct
links to VoID descriptions provided information on endpoints in some cases,
however, the SPARQL statement for checking the validity of the endpoint failed in
those nine cases because the original URI was redirected, and the redirect URI,
which pointed to the dataset, not the resource, was not found in the endpoint.</p>
          <p>Furthermore, we can observe that there is a deviation between the standard
speci cation for providing VoID descriptions (i.e., providing them at the server's
root directory), and the actual deployment (in some cases, they are located at
deeper levels). This may hint at a practical problem with implementing the</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>5 http://km.aifb.kit.edu/projects/btc-2012/</title>
        <p>6 In some cases, more than one endpoint is retrieved.
7 Carried out between July 22nd and July 23rd, 2013
standard, i.e., hosting data sets on servers for which the authority providing the
data set does not have root access rights.</p>
        <p>It is further remarkable that for no URI in our sample, an endpoint could
be retrieved by every strategy. This shows that there is a need to use multiple
strategies in parallel, like our implementation of the SEnF service does.
4</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Conclusion</title>
      <p>The capability of locating SPARQL endpoints for a given URI has been stated
as a desired property of Linked Open Data. In this paper, we have evaluated
several strategies for performing that URI-to-endpoint resolution, based on a
large random sample of the Billion Triple Challenge Dataset.</p>
      <p>Approaches using proposed methods such as VoID and the provenance
vocabulary are scarcely in use (and sometimes not implemented according to the
speci cation), they lead to a valid SPARQL endpoint in less than 1% of all
cases. That nding means that catalogs are essential for discovering SPARQL
endpoint, at least in the short and medium term. However, although
performing better than the approaches mentioned before, catalogs also do not provide
information in su cient quality at the time being.</p>
      <p>Overall, we were not able to locate suitable SPARQL endpoints in most of
the cases { for more than 85% of all URIs, no SPARQL endpoint could be found.
The reasons may be two-fold: (i) it is not possible to discover the endpoints with
the methods described in this paper, or (ii) no such endpoints exist. While in
many cases, the latter case is likely (e.g. for single FOAF documents at websites,
or blogging software that publishes RDF(a), but does not provide a SPARQL
endpoint), it is beyond the scope of this paper (if not completely infeasible due
to the open world assumption) to make a statement about the actual availability
of SPARQL endpoints for Linked Open Data URIs.</p>
      <p>Our evaluation has furthermore shown that no single strategy outperforms all
other strategies. Thus, for practical purposes, using multi-strategy approaches
such as the SEnF service is the most suitable way for discovering endpoints. Since
the SEnF service follows a modular architecture, new catalogs and/or resolution
strategies may be plugged in as they become available and/or standardized.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Keith</surname>
            <given-names>Alexander</given-names>
          </string-name>
          , Richard Cyganiak,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Describing Linked Datasets with the VoID Vocabulary</article-title>
          . http://www.w3.org/TR/void/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Tim</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data</article-title>
          . http://www.w3.org/DesignIssues/LinkedData. html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Hartig</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Provenance Vocabulary Core Ontology Speci cation</article-title>
          . http://trdf.sourceforge.net/provenance/ns.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Kjetil</given-names>
            <surname>Kjernsmo</surname>
          </string-name>
          .
          <article-title>The necessity of hypermedia RDF and an approach to achieve it</article-title>
          .
          <source>In Proceedings of the First Linked APIs Workshop</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Mark</given-names>
            <surname>Nottingham</surname>
          </string-name>
          and
          <article-title>Eran Hammer-Lahav</article-title>
          . RFC 5785 {
          <article-title>De ning Well-Known Uniform Resource Identi ers (URIs)</article-title>
          . http://tools.ietf.org/html/rfc5785.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>