<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Highway to Queryable Linked Data: Self-Describing Web apis with Varying Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laurens De Vocht</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miel Vander Sande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joachim Van Herwegen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Mannens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rik Van de Walle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Lab - Ghent University - iMinds Gaston Crommenlaan 8 bus 201</institution>
          ,
          <addr-line>B-9050 Ledeberg-Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Making Linked Data queryable on the Web is not an easy task for publishers, for technical and logistical reasons. Can they afford to offer a sparql endpoint, or should they offer an api or data dump instead? And what technical knowledge is needed for that? This demo presents a user-friendly pipeline to compose apis for Linked Datasets, consisting of a customizable set of reusable features, e.g., Triple Pattern Fragments, substring search, membership metadata, etc. These apis indicate their supported features in hypermedia responses, so that clients can discover which server-provided functionality they understand, and divide the evaluation of sparql queries accordingly between client and server. That way, publishers can determine the complexity of the resulting api, and thus the maximal set of server tasks. This demo shows how publishers can easily set up an api with this pipeline, and demonstrates the client-side execution of federated sparql queries against such apis.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>self-descriptive apis</kwd>
        <kwd>querying</kwd>
        <kwd>sparql</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Querying live Linked Data on the Web used to be a story of two extremes, with Linked
Data documents and the sparql protocol on opposite sides of the spectrum. Despite
the advancements in sparql query techniques, current numbers1 indicate a rather low
presence of queryable Linked Data. The limitations of Linked Data traversal and the
limited availability of existing public sparql endpoints call for exploring other client–
server trade-offs. Recently, Linked Data Fragments [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] were introduced to analyze such
trade-offs by proposing a uniform view on all interfaces to rdf data. This view reveals a
complete spectrum between Linked Data documents and sparql endpoints to publish
Linked Data. Each possible interface in this spectrum sends its own kind of responses,
which are characterized by three parts:
Data, determined by which selectors a server allows (e.g., sparql query, triple pattern);
Metadata, extra triples about the data (e.g., statistics, provenance);
Controls, guidance on how to access the data (e.g., forms, links).
1 http:// linkeddatacatalog.dws.informatik.uni-mannheim.de/ state
      </p>
      <p>
        For example, the Triple Pattern Fragments (tpf) interface [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] responds to a client
request for a triple pattern with a paged document containing matching triples (data),
the estimated total number of matches (metadata), and a form to retrieve other tpfs
(controls). Complex sparql queries can be evaluated on the client side by requesting
multiple tpfs, and using the data, metadata, and controls inside of tpf responses sent
by the server. A higher query execution time and bandwidth usage compared to sparql
endpoints are accepted in exchange for a minimal server load. This interface thereby
strikes a more sustainable load balance between clients and servers.
      </p>
      <p>By defining new features for such interfaces that vary among the data and metadata
dimensions, we can realize new balances of client–server trade-offs. Allowing more
complex data selectors makes individual requests more expensive for the server, but the
client might be able to evaluate sparql queries faster and with less bandwidth. Additional
metadata in responses might enable clients to plan query executions more efficiently, at
the cost of a server-side preparation step. Setting up such interfaces, however, is currently
a difficult task for Linked Data publishers, and it is not straightforward for them to decide
which features they want to enable or disable, especially if features change over time.</p>
      <p>
        To this end, this demonstration extends two of our iswc2015 research track papers [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]
by proposing a pipeline which can i) publish a dataset as a queryable Linked Data api
fast and easily, and ii) customize the mix of supported features on demand. We will
demonstrate the applicability of this pipeline through an in-browser federated sparql client
that dynamically discovers supported features through self-describing controls in apis
created by this pipeline. Dynamic feature support is exemplified by the following features:
– a data feature for substring search from the corresponding research track paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ];
– a metadata feature for approximate membership functions (e.g., Bloom Filters)
from the corresponding research track paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ];
– a metadata feature for dataset summaries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The self-descriptiveness allows clients to dynamically discover and use features if
they are offered by a server, and to ignore metadata or controls it does not recognize.
This means that feature-rich clients can consume single-feature apis, while single-feature
clients can also consume feature-rich apis—regardless of which features are supported
on either side. This contrasts with hard-coded client–server contracts, in which servers
sometimes unilaterally decide not to implement a certain part of a specification.</p>
      <p>In the remainder of this paper, we first present the publication pipeline in more detail
(Section 2). Then, we explain how clients use the self-descriptive features to execute
sparql queries (Section 3), and conclude with the demonstration setup (Section 4).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Queryable Linked Data api feature pipeline</title>
      <p>
        Many data dumps on the Web have quality issues, and a significant portion of them
contain rdf syntax errors [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Quality issues can be a source of frustration, as they require
restarting data ingestion when errors are found. The first pipeline step is therefore to
repair data automatically to the extent possible. We achieve this using a component of
the lod Laundromat [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that takes care of refining a dataset and removing invalid or bad
triples. The Laundromat guarantees that data conforms to a specified set of best practices,
thereby greatly improving the chance of data actually being (re)used.
      </p>
      <p>The Highway to Queryable Linked Data</p>
      <p>Once data has been cleaned, it needs to be converted into one or more (indexed) formats
that support the given features. The pipeline chooses formats based on performance and
server impact, and can combine multiple formats if needed to satisfy a certain feature
combination. For instance, triple-pattern searches might be supported by one format,
whereas full-text searches are covered by a dedicated index. Below, we discuss some of
the available features.</p>
      <p>
        Triple Pattern Fragments The tpf interface allows clients to decompose basic graph
patterns of sparql queries into more elementary triple patterns, providing metadata
for planning. This interface could be realized by a triple store, which—being designed
for more complex patterns—often involve too much overhead for such simple patterns.
Therefore, the pipeline will likely opt for the compressed hdt format [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which requires
less server resources to look up triple patterns. The pipeline covers the entire hdt
generation process, which currently consists of manual steps and choices that are rather
difficult for data publishers.
      </p>
      <p>
        Substring search There are multiple ways to support substring search on the server.
Either the pipeline generates an hdt file with a dedicated literal index (an fm-index),
or an Elasticsearch index is populated [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This choice can be influenced by the
desired performance and the availability of an Elasticsearch instance on the server. The
cumbersome process of setting up either option is handled by the pipeline and can be
switched on demand. When activated, this feature introduces a new kind of supported
requests, allowing clients to request a list of all literal objects that contain a given
substring. This can greatly reduce the cost of queries that use textual features such as
regular expressions, at the cost of more expensive requests.
      </p>
      <p>
        Membership metadata Membership metadata provides a client with the possibility of
locally checking whether certain triple patterns are present on the server [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This allows
for smarter decision making during query execution since the membership of a lot of
triples can be verified without actually having to contact the server, thereby reducing the
number of required http calls. For example, servers can send approximate membership
metadata with Bloom Filters or Golomb codes to compactly indicate whether a triple
pattern has potential matches [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Another possibility are dataset summaries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which
contain uri authorities of subject and object, partitioned by predicate. The trade-off is
that metadata needs to be prepared on the server side and that individual responses can
become larger. The pipeline can provide these features by pre-generating summaries.
This requires extra storage space and processing time, so the pipeline lets a Linked Data
publisher decide whether this is acceptable and/or desired.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Self-descriptive api publication and consumption</title>
      <p>The premise of the feature-based publication approach is that data publishers decide
which features they offer on the server. Therefore, clients need a means to discover what
features a server supports in order to split a sparql query appropriately into http requests.
Our multi-feature federated sparql client takes as parameters a sparql query and a list
of urls to rdf interfaces. To evaluate the sparql query, it i) requests each of the interface
urls through http GET; ii) looks inside of the responses for control triples which describe
the features of the interface; iii) splits the sparql query based on the features of the
interface that both the client and the server support. This highlights the necessity of
explicit feature description on the server side, which the pipeline automatically sets up.</p>
      <p>
        As is the case for the tpf interface, each of the features are described in api responses
through the Hydra Core Vocabulary [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This vocabulary can be regarded as the rdf
equivalent of html’s &lt;a&gt; and &lt;form&gt; tags, which similarly inform human users of
the “features” a website supports. In contrast to implicit static contracts, such in-band
hypermedia controls dynamically inform clients about how to use the interface [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For
example, when an interface supports substring search through Elasticsearch, but the index
is rebuilding, it can make substring search temporarily unavailable. Also, publishers can
switch off certain features at peak moments, or try new features if they have spare server
resources. In either case, clients can adapt by interpreting the controls in the response.
With the same mechanism, more simple clients can safely ignore unrecognized features.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Demonstration</title>
      <p>The demonstration at iswc2015 will allow people to set up a custom api through a
userfriendly pipeline that allows them to select the features they want to offer. Visitors will
be able to bring their dataset, which will then be automatically transformed and loaded
into a dedicated server during their visit at the booth.</p>
      <p>Afterwards, they can execute federated sparql queries on their and others’ datasets
using the in-browser client. This multi-feature federated client is available online2 with
example sparql queries. For instance, the federated query at http:// bit.ly/ cubist-works is
evaluated using dbpedia and viaf interfaces set up by the pipeline. We will show how the
client can use these new features to improve sparql queries, and how it handles multiple
endpoints with a heterogeneous feature set.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rietveld</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bazoobandi</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wielemaker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlobach</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>: lod laundromat: a uniform way of publishing other people's dirty data</article-title>
          . In: iswc (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fernández</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martínez-Prieto</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutiérrez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arias</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Binary rdf representation for publication and exchange (hdt)</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>19</volume>
          ,
          <fpage>22</fpage>
          -
          <lpage>41</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kjernsmo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The necessity of hypermedia rdf and an approach to achieve it</article-title>
          .
          <source>In: Proceedings of the Workshop on Linked apis for the Semantic Web (May</source>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lanthaler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gütl</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Hydra: A vocabulary for hypermedia-driven Web apis</article-title>
          .
          <source>In: Proceedings of the 6th Workshop on Linked Data on the Web (May</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngomo</given-names>
            <surname>Ngonga</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.C.</surname>
          </string-name>
          :
          <article-title>HiBISCuS: Hypergraph-based source selection for sparql endpoint federation</article-title>
          . In: eswc (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Van</given-names>
            <surname>Herwegen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>De Vocht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Substring filtering for low-cost Linked Data interfaces</article-title>
          .
          <source>In: iswc (</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          , R., Van Herwegen,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Opportunistic Linked Data querying through approximate membership metadata</article-title>
          . In: iswc (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haesendonck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Vocht</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Querying datasets on the Web with high availability</article-title>
          .
          <source>In: 13th International Semantic Web Conference (Oct</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Coppens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Web-scale querying through Linked Data Fragments</article-title>
          .
          <source>In: Linked Data on the Web (Apr</source>
          <year>2014</year>
          )
          <volume>2</volume>
          http:// client.linkeddatafragments.org/ (source code: http:// github.com/ LinkedDataFragments/ )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>