<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>To SCRY Linked Data: Extending SPARQL the Easy Way</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bas Stringer</string-name>
          <email>b.stringer@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Albert Meroño-Peñuela</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonis Loizou</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanne Abeln</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaap Heringa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Integrative Bioinformatics, VU University Amsterdam</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Archiving and Networked Services</institution>
          ,
          <addr-line>KNAW, NL</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Knowledge Representation and Reasoning Group, VU University Amsterdam</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Scientific communities are increasingly publishing datasets on the Web following the Linked Data principles, storing RDF graphs in triplestores and making them available for querying through SPARQL. However, solving domain-specific problems often relies on information that cannot be included in such triplestores. For example, it is virtually impossible to foresee, and precompute, all statistical tests users will want to run on these datasets, especially if data from external triplestores is involved. A straightforward solution is to query the triplestore with SPARQL and compute the required information post-hoc. However, post-hoc scripting is laborious and typically not reusable, and the computed information is not accessible within the original query. Other solutions allow this computation to happen at query time, as with SPARQL Extensible Value Testing (EVT) and Linked Data APIs. However, such approaches can be difficult to apply, due to limited interoperability and poor extensibility. In this paper we present SCRY, the SPARQL compatible service layer, which is a lightweight SPARQL endpoint that interprets parts of basic graph patterns as calls to user defined services. SCRY allows users to incorporate algorithms of arbitrary complexity within standards-compliant SPARQL queries, and to use the generated outputs directly within these same queries. Unlike traditional SPARQL endpoints, the RDF graph against which SCRY resolves its queries is generated at query time, by executing services encoded in the basic graph patterns. SCRY's federation-oriented design allows for easy integration with existing SPARQL endpoints, effectively extending their functionality in a decoupled, tool independent way and allowing the power of Semantic Web technology to be more easily applied to domain-specific problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Semantic Web continues to grow, reaching an ever growing number of
scientific communities [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This is driven in part by the adoption of Linked Data
principles and convergent practices by these communities [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which publish a
great variety of linked scientific datasets in the Linked Open Data (LOD) cloud.
This cloud currently contains over 600K RDF dumps (37B triples), ready to be
queried through 640 SPARQL endpoints[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The diversity of available Linked Data is matched by the diversity of its
consumers and their needs. For example, statisticians may want to exclude outliers
from their analysis, or filter results based on the p-value of some statistical test;
geographers typically need to select coordinates which fall within a certain area
or distance from another point; and bioinformaticians often use shared
evolutionary ancestry to transfer information between entities.</p>
      <p>These and many other cases rely on information which is either impossible
or impractical to materialize in triplestores beforehand. Whether or not an
observation should be treated as an outlier depends on how one defines outliers,
and the observations it is being compared with. One could precompute all
pairwise distances between coordinates, but this scales quadratically with the
number of entries and precludes queries spanning multiple datasets.
Bioinformaticians use many different methods to predict evolutionary relatedness between
biomolecules, and interpretting their results is highly context-dependent. More
generally, solving domain-specific problems typically requires domain-specific
tools and algorithms, whose outputs can not always be sensibly precomputed.
Thus, querying such information requires it to be derived at query time.</p>
      <p>
        Several approaches enabling the generation of new data and relations at query
time already exist:
– The SPARQL query language includes built-in functions for basic arithmetic
and string handling, and widely supported extensions are available for the
most common forms of data processing, such as datatype-aware handling of
literals annotated with XML schema [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, such general extensions
can not facilitate the diverse set of domain-specific algorithms and
procedures required by many users.
– SPARQL 1.1 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] allows the definition of customizable procedures attached to
a specific URI via Extensible Value Testing (EVT), which is currently
supported by several triplestore vendors. However, EVT has some fundamental
limitations: custom procedures are restricted to appear in limited query
environments (e.g. BIND(), FILTER()), and queries incorporating them are not
interoperable between endpoints.
– Linked Data APIs[
        <xref ref-type="bibr" rid="ref1 ref6">1,6</xref>
        ] offer access to Linked Data in Web standard formats
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] without requiring users to have extensive knowledge of RDF or SPARQL.
They offer access to custom procedures through user-friendly interfaces,
accessing Linked Data under the hood. Such APIs enable functional extension
of Linked Data queries in a more flexible way than EVT, but greatly restrict
interoperability with other Linked Data sources and the type of information
that can be retrieved.
– Several SPARQL endpoints allow expert users to define custom functions
under the hood, e.g. Virtuoso, Jena and Stardog. Although very powerful,
these features typically have a steep learning curve and, like EVT, are not
interoperable with other endpoints.
      </p>
      <p>Each of these approaches varies in terms of flexibility, interoperability, ease
of implementation and user-friendliness. We argue many scientific communities
would benefit from a combination of SPARQL’s flexible, efficient manner of
querying RDF data, and user-friendly access to easily customized procedures
which generate RDF data at query time.</p>
      <p>In this paper we present SCRY, the SPARQL compatible service layer. SCRY
is a lightweight SPARQL endpoint that allows users to define their own services,
assign them to a URI, and incorporate them in standards-compliant SPARQL
queries. These services take RDF data as input and return RDF data as
output, allowing users to generate and incorporate relevant information at query
time. SCRY leverages SPARQL’s query federation protocol to maintain
interoperability with other SPARQL endpoints. Essentially, this embeds API-like
functionality into pure SPARQL queries, in a standards-compliant format.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem Definition</title>
      <p>Domain-specific questions often require domain-specific solutions, particularly
with regard to information and relations which must be derived at query time
because they are impractical to precompute. Currently available approaches
facilitating this are limited in terms of flexibility, interoperability, user-friendliness,
ease of implementation, or a combination thereof. We propose to address these
issues by executing services at query time, generating requested data on demand.
Consider the following query:</p>
      <sec id="sec-2-1">
        <title>SELECT * WHERE { ?array stats:mean ?mean ;</title>
        <p>stats:sd ?sd . }</p>
        <p>If treated as a standard graph pattern, this query would only return arrays
which have their mean and standard deviation materialized in the triplestore.
However, if interpretted as service calls, the query engine could execute matching
statistical procedures for stats:mean and stats:sd, and return bindings with a
mean and standard deviation generated at query time.</p>
        <p>Derived values like means and standard deviations are impractical to
materialize statically in a dataset, but there are many scenarios where making them
query-accessible is useful, if not essential, to answer domain-specific questions.
Thus, the problem we address in this paper is to access Linked Data in a
manner which (1) combines SPARQL’s flexibility and efficiency with the functional
extension provided by Linked Data APIs or under-the-hood endpoint
customisation, (2) coexists and integrates with extant SPARQL tools and endpoints by
complying with current standards and (3) is easy for users to extend with their
own domain-specific services.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>SCRY</title>
      <p>Our SPARQL compatible service layer (SCRY) acts as a lightweight SPARQL
endpoint, granting users access to easily customized services during query
execution. SCRY allows services and their inputs to be encoded by URIs in the
basic graph patterns of SPARQL queries. Users must configure an instance of
SCRY, which we will hereafer refer to as an orb, with a set of services and
associated URIs. Whenever a SCRY orb is queried, it searches for these URIs in
the query’s graph patterns and executes the associated services, prior to
resolving the query itself. Upon execution, services generate RDF data, populating an
RDF graph against which the original query will be resolved. Thus, what sets
SCRY apart from traditional endpoints, is that it resolves queries against RDF
data generated at query time, rather than against a persistent RDF graph.</p>
      <p>Services accessed through SCRY can involve simple tasks like rounding off a
number, or running complex secondary programs using local or remote resources.
Typical use involves sending a query to any conventional SPARQL endpoint,
which then invokes SCRY through a federated query. Information retrieved from
the primary endpoint’s persistent RDF graph can be used as input for a service
made available through a personalized, locally hosted SCRY orb. The SCRY
orb then generates an RDF graph by executing the encoded services, evaluates
the federated query against said graph, and returns the results to the primary
endpoint (see Figure 1). Use case-driven examples are given below.</p>
      <p>This federation-oriented design is completely compliant with current
standards, allowing SCRY to be used with any federation-capable primary endpoint.
However, it also means SCRY necessarily inherits a susceptibility to network
latency, from the way in which the SPARQL protocol implements query federation.
Using HTTP to push serialized SPARQL queries and RDF data back and forth
is relatively expensive in terms of overhead, which is particularly wasteful if the
computational steps to get from input to output are short and straightforward.</p>
      <p>
        SCRY is implemented in Python, using the RDFLib package [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to interpret
and resolve SPARQL queries, and the Flask microframework [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to handle
federation via HTTP. Services must thus be accessible from Python, either by being
implemented as Python code or via calls to the shell, e.g. with os.system().
SCRY’s source code, including the services demonstrated below, is available at
https://github.com/bas-stringer/scry/.
We have implemented several services to supplement SPARQL’s basic built-in
arithmetic, for example to calculate the standard deviation of an array, and the
Pearson correlation between two arrays5. In SCRY, these can be implemented
in as little as 2 lines of code each – roughly an order of magnitude fewer lines
than needed in Jena, Virtuoso or Stardog (see table 1).
      </p>
      <p>
        Social historians running the CEDAR project published statistical data of
Dutch historical censuses [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. They can now run queries which include, for
example, the standard deviation of the population counts of 1899 6.
      </p>
      <p>Likewise, the Linked Statistical Data Analysis project7 provides Linked Data
for various metrics, including several precomputed statistics such as Kendall’s
correlation. However, we are interested in Pearson’s r correlation instead.
Querying the raw data through their endpoint and federating it to a SCRY orb allows
us to calculate Pearson’s correlation coefficient between, for example, infant
mortality rate and corruption perception indices in 2009 8.
3.2</p>
      <p>Use Case 2: Bioinformatics
Homology is one of the most important concepts in bioinformatics. It is a term
used to indicate two entities share evolutionary ancestry, which suggests those
entities have a similar biological function. Thus, knowledge of an entity can
cautiously be inferred from knowledge about its homologs.</p>
      <p>
        The Bio2RDF project has compiled one of the largest collections of biological
Linked Data, comprising nearly 12B triples which describe 1.1B unique entities
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. More recently published sources of RDF data, such as neXtProt [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the
Human Protein Atlas (HPA) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], are not yet included therein.
      </p>
      <p>
        Given the sheer volume of biological RDF data, making homology a
queryaccessible property would have many applications in bioinformatics. To this end,
we have implemented a procedure that runs the BLAST program[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: the most
commonly used method to find homologs, cited nearly 55 000 times to date9.
      </p>
      <sec id="sec-3-1">
        <title>5See http://bit.ly/stats-impl</title>
        <p>6See query at http://bit.ly/scry-sd
7See http://stats.270a.info/.html
8See http://bit.ly/transparency-270a
9Citations counted by Google Scholar.</p>
        <p>The Human Protein Atlas lists which proteins are found where in the human
body. This information is exposed as RDF, which we have loaded in a private
primary endpoint. From this endpoint, we can now federate queries to a SCRY
orb to invoke services. Using our BLAST procedure, for example, we can
investigate coexpression: for a given query protein, we ask the primary endpoint in
which tissues it is expressed; we invoke the BLAST service through a federated
query to find the protein’s homologs; and we ask the primary endpoint how many
of those homologs are expressed in the same tissues - within a single SPARQL
query. Running such a query for hemoglobin reveals it is found in 8 different
tissues, and that at least 3 of its homologs are found in each of those tissues (see
table 2).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>An ever increasing number of scientific communities are adopting Semantic Web
technology and Linked Data principles. Their many domain-specific problems
require equally many domain-specific solutions. This is especially true when
considering derived information, which is impractical to precompute, and thus must
be generated at query time.</p>
      <p>We present SCRY, an easily customized, lightweight SPARQL endpoint that
facilitates executing user-defined services at query time, making their results
accessible immediately within SPARQL queries. Custom procedures are
implemented with relative ease, whether they perform simple statistical analysis or run
complex secondary programs like BLAST. We find that extending SPARQL in
this novel way is (i) an order of magnitude faster than extending other SPARQL
endpoints, and (ii) compatible with any existing SPARQL 1.1 compliant
endpoint.</p>
      <p>These benefits come at the cost of a dependence on SPARQL’s
implementation of query federation. In particular, network latency can become an issue.
Despite this limitation, SCRY provides a platform through which statistics,
bioinformatics, and a variety ofz other scientific disciplines can incorporate
domainspecific programs and algorithms within SPARQL queries, better enabling these
diverse communities to harness the power of Semantic Web technologies.</p>
      <p>Many roads are open for the future. First and foremost, we intend to develop
a community-managed service repository, through which users can share and
receive feedback on the services they implement. Furthermore, we plan on
extending this work by implementing: (i) a browser-based query interface, allowing
users to query their SCRY orb directly (i.e. not through federated queries); (ii)
efficiency, security and authorization features, which will make it feasible to host
public SCRY orbs; and (iii) more domain-specific procedures, to further
demonstrate SCRY’s versatility and enable more scientific communities to harness the
power of semantic web technologies.</p>
      <p>Acknowledgements. The authors wish to express great gratitude towards Frank van Harmelen
and Paul Groth, for their advice and feedback during the project; resident Python guru Maurits
Dijkstra for his support with development and implementation of the program; and Laurens Rietveld
and Ali Khalili for their valuable comments on this manuscript.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Linked Data API</article-title>
          .
          <source>Tech. rep., UK Government Linked Data</source>
          (
          <year>2009</year>
          ), https:// github.com/UKGovLD/linked-data-api
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Altschul</surname>
            ,
            <given-names>S.F.</given-names>
          </string-name>
          , et al.:
          <article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</article-title>
          .
          <source>Nucleic Acids Research</source>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Beek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , et al.:
          <article-title>LOD Laundromat: A Uniform Way of Publishing Other People's Dirty Data</article-title>
          .
          <source>In: ISWC</source>
          <year>2014</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Belleau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , otherse: Bio2RDF:
          <article-title>Towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>David</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fallside</surname>
            ,
            <given-names>P.W.:</given-names>
          </string-name>
          <article-title>XML Schema Part 0: Primer Second Edition</article-title>
          .
          <source>Tech. rep., World Wide Web Consortium</source>
          (
          <year>2004</year>
          ), http://www.w3.org/TR/xmlschema-0/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>API-centric Linked Data integration: The Open PHACTS Discovery Platform case study</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>29</volume>
          (
          <issue>0</issue>
          ),
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2014</year>
          ), http://www.sciencedirect.com/science/ article/pii/S1570826814000195, life Science and e-Science
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SPARQL 1.1 Query Language</article-title>
          .
          <source>Tech. rep., World Wide Web Consortium</source>
          (
          <year>2013</year>
          ), http://www.w3.org/TR/sparql11-query/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          , vol.
          <volume>1</volume>
          :
          <fpage>1</fpage>
          . Morgan and Claypool, 1st edn. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Krech</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>RDFLib Python Library</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2002</year>
          ), https://github. com/RDFLib/rdflib
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>neXtProt: a knowledge platform for human proteins</article-title>
          .
          <source>Nucleic Acids Research</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Max</surname>
            <given-names>Schmachtenberg</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.J.</given-names>
            ,
            <surname>Cyganiak</surname>
          </string-name>
          , R.:
          <article-title>Linking Open Data cloud diagram 2014</article-title>
          . http://lod-cloud.net/ (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Meroño-Peñuela</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guéret</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashkpour</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlobach</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>CEDAR: The Dutch Historical Censuses as Linked Open Data</article-title>
          . Semantic Web - Interoperability, Usability, Applicability (
          <year>2015</year>
          ), under review
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ronacher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Flask Python micro web application framework</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2010</year>
          ), http://flask.pocoo.org/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sporny</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>JSON-LD: A JSON-based Serialization for Linked Data</article-title>
          .
          <source>Tech. rep., World Wide Web Consortium</source>
          (
          <year>2014</year>
          ), http://www.w3.org/TR/json-ld/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Uhlé</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Tissue-based map of the human proteome</article-title>
          .
          <source>Science</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>