<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Simple API to the KnowledgeStore</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ian Hopkinson</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven Maude</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Rospocher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>18 Via Sommarive, 38123 - Trento Povo (TN)</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ScraperWiki</institution>
          ,
          <addr-line>ic2, Liverpool Science Park, 146 Brownlow Hill, Liverpool L3 5RF</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>7</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>RDF and SPARQL technologies have not gained widespread acceptance amongst web developers. We describe a simple API which is used to provide access to the KnowledgeStore, a scalable, fault-tolerant, and Semantic Web grounded storage system for interlinking structured and unstructured data, developed in the contect of the FP7 NewsReader EU project. The simple API wraps a set of parameterised SPARQL queries to access the KnowledgeStore RDF structured content, and calls to the KnowledgeStore CRUD endpoint to retrieve unstructured resources. Responses are delivered as JSON, JSONP, HTML or CSV. The API is largely self-documenting. The API is written using the Flask Python library, and includes an extensive suite of tests. It is modular, so that new SPARQL queries can be added easily or it could be used as a template for providing an API to any SPARQL endpoint. The simple API was succesfully exploited by 38 web developers, many of them unfamiliar with RDF and SPARQL technologies, to build web applications on top of the KnowledgeStore.</p>
      </abstract>
      <kwd-group>
        <kwd>SPARQL</kwd>
        <kwd>Virtuoso</kwd>
        <kwd>Python</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        NewsReader [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is an FP7 project which seeks to provide a ”news recorder” which
processes news articles to extract ”events” using a combination of natural language
processing and semantic web technologies with the aim of enhancing our ability to
”understand” the news. News articles are marked up using an NLP pipeline. Events are
extracted from the output of the pipeline, and they are coreferenced across multiple
documents.
      </p>
      <p>
        The original articles, marked up articles and RDF event descriptions are stored in the
KnowledgeStore [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ], a scalable, fault-tolerant, and Semantic Web grounded storage
system for interlinking structured and unstructured data. Specifically, the
KnowledgeStore consists of a HBase database, for storing (marked up) articles and metadata, and a
Virtuoso [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] triplestore, containing details of the events and their participants. In addition
a subset of DBpedia [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is included in the triplestore component to provide background
knowledge.
      </p>
      <p>The KnowledgeStore exposes the triplestore component via a SPARQL endpoint,
while documents and document metadata are made available through a CRUD
(create, retrieve, update, delete) endpoint. Both endpoints are part of the KnowledgeStore
HTTP ReST API. However, SPARQL (and RDF) has not gained widespread acceptance
amongst web developers, and in addition there are issues with managing free form
SPARQL queries against a large sized triplestore ( 300M triples). Therefore there is a
need to provide a Simple API to provide access to this resource, this we describe in the
following sections.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        Python is a well-established language for general computing with a rich, accessible
ecosystem of libraries. In this work we use the Flask [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] web serving library which
provides simple routing to help build web applications. Although there are Python
libraries designed to interact with SPARQL and RDF they do not yet appear to be
stable or well-supported. Therefore we handle the SPARQL queries and responses using
the requests library. A useful feature of the requests [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] library is that caching can be
implemented very easily via [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This allows us to reduce load on the SPARQL endpoint
and improve responsiveness to user queries.
      </p>
      <p>The Simple API is currently on-line1 and its code is available2 and released under
the Apache License.</p>
      <p>The API supports the following queries:
– actors of a type
– describe uri
– event details filtered by actor
– event label frequency count
– get document
– get document metadata
– get mention metadata
– people sharing event with a person
– properties of a type
– property of actors of a type
– summary of events with actor
– summary of events with actor type
– summary of events with event label
– summary of events with framenet
– summary of events with two actors
– types of actors</p>
      <p>Which can variously take these parameters:
– callback = a function with which to wrap the JSON response to make JSONP
– datefilter = YYYY, YYYY-MM or YYYY-MM-DD, filter to a year, month or day
– filter = a character string on which to filter, it can take combinations such as
bribery+OR+bribe
– limit = a number of results to return
– offset = an offset into the returned results
1 https://newsreader.scraperwiki.com/
2 http://www.bitbucket.org/scraperwikids/newsreader_api_flask_
app</p>
      <p>Fig.1.ExampleofAPIresponse
– output = json—html—csv
– uris.[n] = a URI to a thing, e.g. dbpedia:David Beckham</p>
      <sec id="sec-2-1">
        <title>Queries look like this:</title>
        <p>– http://newsreader.scraperwiki.com:5000/actors_of_a_type?
uris.0=dbo:Person&amp;filter=david
– http://newsreader.scraperwii.com:5000/summary_of_events_
with_framenet?uris.0=framenet:Omen</p>
      </sec>
      <sec id="sec-2-2">
        <title>A typical response is shown in Figure 1.</title>
        <p>The application contains two modules: A viewer module, based on Flask, contains
routing patterns which handle the URL queries the user can present, converting them
into calls to the queries module. The viewer module also converts the query responses
to the required format and similarly presents any error messages. The queries module
ISWC2014DevelopersWorkshop
Copyrightheldbytheauthors
defines a parent SPARQLQuery class which contains functions to process the query
arguments supplied by the viewer module and feed them into SPARQL query templates.
Each SPARQL query is implemented in a separate file which defines a class based on the
parent SPARQLQuery class. The key component of this file is a template for the query
into which parameters are inserted. It also contains fields for a description of the query,
an example query, and the required and optional parameters. This separation of concerns
makes implementing new queries straightforward.</p>
        <p>The queries module also defines a CountQuery class and a CRUDQuery class,
derived from the SPARQLQuery class. Each SPARQL query also defines a count query
so that the user can be informed of the number of results returned by a query, and the
results can be paged. The CRUDQuery class is defined to cover the three queries that use
the CRUD endpoint (thus, the Simple API has also the effect of hiding the two separate
KnowledgeStore endpoints behind a common interface).</p>
        <p>The filter arguments for queries are implemented in a modular fashion such that they
can be added to suitable SPARQL queries with little effort. If no filter arguments are
supplied, then the statements implementing the filter are not applied.</p>
        <p>Visiting the API root provides an HTML documentation page which is built
dynamically from the queries that have been defined. In early development the example queries
were used directly as a manual integration test. The advantage of this approach is that
text required for documentation is largely derived directly from executable code, and
”optional” descriptive text is entered close to where the thing it describes is implemented.</p>
        <p>During the Hack Day (see Section 2.1) we identified that URIs containing the period
character outside of the prefix were not handled by the SPARQL endpoint, although
expanding prefixes and wrapping in &lt; and &gt; worked for these URIs. Therefore we
implemented prefix expansion.</p>
        <p>The API was implemented on REST principles which implies that it holds no state
information, meaning that queries must be delivered ”live” - i.e. they cannot ask the
user to come back later for a response. This presents some problems for the Simple API
since the underlying triplestore is large and queries can be potentially slow running. We
mitigated these risks by optimising the available SPARQL queries - we were able to
achieve speed increases up to a factor of 10 by re-ordering execution order for SPARQL
statements and using some Virtuoso built-in (e.g. BIND) rather than generic functions.</p>
        <p>
          The API contains moderate levels of testing which we implemented using the
nosetests [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] library. We used the mock library to emulate error states on the SPARQL
endpoint in order to demonstrate correct handling of these error states.
        </p>
        <p>
          The Bootstrap [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] framework is used to style the HTML output of the API, and the
jinja [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] templating library is used to build HTML pages from the SPARQL query
responses. Since the majority of SPARQL queries implemented use the SELECT keyword,
the output is tabular in form. This means that the output of new queries of this form can
easily be inserted into a single template which handles all tabular output. Graph-like
output is more difficult to handle generically and for this reason it is currently only
presented as JSON.
2.1
        </p>
        <sec id="sec-2-2-1">
          <title>Applications</title>
          <p>The Simple API was used by 38 web developers, many of them unfamiliar with RDF and
SPARQL technologies, for a Hack Day. The SPARQL endpoint received 30,000 queries
during the course of the day, with a peak of 20 queries per second. The SPARQL server
rebooted automatically three times during the day and only one query timed out. This
highlights the protective functionality of the Simple API. Users, in general, only had
access to pre-determined queries which were well understand in terms of their execution
costs and had been optimised upfront. The penalty for this was reduced flexibility, but
for a SPARQL-naive audience this is no penalty. It does mean the onus is on the authors
of the Simple API to provide the right suite of queries.
2.2</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Further work</title>
          <p>The Simple API is a new piece of work, and we have identified a number of areas for
improvement.</p>
          <p>Currently the API does not output graph responses from SPARQL queries as HTML,
it simply provides the raw query response in JSON. During our Hack Day we observed
that users frequently made describe uri queries; typically for a limited range of entity
types. Therefore there is the potential to provide HTML templates for certain types of
query returning graph responses.</p>
          <p>In principle, we can combine the results of multiple SPARQL or CRUD queries
inside the API, to provide results which could not be returned by a single SPARQL
query.</p>
          <p>During the Hack Day users created a number of applications. These were good
proofs-of-concept, though somewhat unpolished given the time constraints. We would
like to build some sample applications on top of the API.</p>
          <p>We made the decision early on to provide output in paged form, this is in keeping
with how web APIs typically deliver responses. This is an issue for our implementation of
the Simple API since we use the LIMIT and OFFSET statements to page through results,
Virtuoso has a a maximum OFFSET (currently set to 10,000) therefore for queries
returning a large number of results we cannot page to all the results. The maximum
OFFSET is in place for a reason: performance degrades when handling large offsets.
There are a couple of approaches to dealing with this problem, one would be to return
entire result sets, without using the LIMIT and OFFSET keywords. Alternatively, SQL
databases use cursors to handle this problem and it may be possible to apply this approach
using the Virtuoso triplestore.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Lessons learned</title>
      <p>
        The core developers for the Simple API were effectively SPARQL-naive at the start
of the process but with a background in SQL. This was remedied by Bob Ducharme’s
book[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Learning SPARQL and consulting with other members of the team. The
biggest challenge was that learners typically learn against toy datasets rather than a
system containing hundreds of millions of triples, such as the KnowledgeStore. This
slows their skill acquisition.
      </p>
      <p>Conceptually the triple-matching style of SPARQL querying is alien but in some
ways more straightforward than SQL particularly when carrying out the equivalent of a
join. Using SELECT queries means that responses are returned in familiar tables shapes
for which we found generic processing patterns.</p>
      <p>The second challenge for RDF naive developers is the feeling that they are forever
being pointed somewhere else for an answer to a query. A URI is not an answer, it’s a
pointer to an answer. Discovering the rdfs:label and rdfs:comment types helps resolve
this tension, and should highlight to triplestore developers their importance.</p>
      <p>By the end of the development process the naive members were writing functional
SPARQL queries. The SPARQL-expert members of the team were able to significantly
optimise those queries using a combination of query ordering, a wider knowledge of
SPARQL constructs and better knowledge of the particular triplestore implementation.</p>
      <p>Non-RDF lessons learned:
– It is easy to make everything look better with Bootstrap;
– Testing web applications using mocking is surprisingly straightforward with the
appropriate libraries, and very useful.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have described the development of a Simple API which can be used to access a
complex resource containing both a triplestore accessed via a SPARQL endpoint and a
document collection with a CRUD API. The Simple API has been used in a Hack Day
where it enabled users to access content with the need to know SPARQL. Furthermore,
the API ensured a responsive system by limiting the available queries.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The research leading to this paper was supported by the European Union’s 7th Framework
Programme via the NewsReader Project (ICT-316404).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. NewsReader, http://www.newsreader-project.eu/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Corcoglioniti</surname>
          </string-name>
          , Marco Rospocher, Roldano Cattoni, Bernardo Magnini, Luciano Serafini:
          <article-title>Interlinking Unstructured and Structured Knowledge in an Integrated Framework</article-title>
          .
          <source>7th IEEE International Conference on Semantic Computing (ICSC)</source>
          , Irvine, CA, USA,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. KnowledgeStore, http://knowledgestore.fbk.eu</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>4. Virtuoso, http://virtuoso.openlinksw.com/</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. DBpedia, http://dbpedia.org/</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>6. Flask, http://flask.pocoo.org/</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. Requests, http://docs.python-requests.org/en/latest/</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>8. requests cache, http://requests-cache.readthedocs.org/</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. nosetests, https://nose.readthedocs.org/en/latest/</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bootstrap</surname>
          </string-name>
          , http://getbootstrap.com/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>11. jinja, http://jinja.pocoo.org/</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ducharme</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Learning</surname>
            <given-names>SPARQL</given-names>
          </string-name>
          , O'Reilly, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>