<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Use cases for triple stores and graph databases in scalable data infrastructures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© Vasily Bunakov Science</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Technology Facilities Council</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harwell Oxfordshire</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>United Kingdom vasily.bunakov@stfc.ac.uk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XVII International Conference «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL'2015)</institution>
          ,
          <addr-line>Obninsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>37</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>Various types of NoSQL databases may present a sound alternative to relational databases in eInfrastructures that require managing and analysing data supplied from disparate sources. This work considers a few use cases where particular types of NoSQL databases - triple stores and graph databases may be a natural choice for scalable data services. The purpose of this work is to brief on the experiments performed and to provide a roadmap for further technology evaluation. Cases suggested are mapped to EUDAT common services [11] yet should be of interest to other eInfrastructures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>EUDAT project www.eudat.eu builds common data
sharing, data preservation, data discovery and data
analysis services for multidisciplinary and cross-border
European research communities. Services under
development or in their pilot phase now are:</p>
      <p>B2SHARE - a data publishing service,
B2SAFE - a secure and reliable data
replication service,
B2FIND - a data discovery service,
B2STAGE - a service for the delivery of data
to high-performance computation,
with further services for semantic annotation,
provenance and data retrieval that are under
consideration.</p>
      <p>There are indications that some mainstream
platforms chosen for pilots of the aforementioned
services during the previous phase of EUDAT, often
underpinned by relational databases, may not be the
best for scalability; also there are cases when having
relational back-end means performing an excessive
mapping in order to adopt certain metadata structures
which more naturally fit into graph representation.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Use cases explained</title>
      <sec id="sec-2-1">
        <title>2.1 Triple store as a back-end to data catalogue</title>
        <p>
          The pilot version of B2FIND service relies on CKAN
platform [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] which has an advantage of configurability
and enjoys support of a thriving community of the
platform adopters and software developers.
        </p>
        <p>
          Experiments showed though that CKAN, even after all
the tuning recommended by the platform developers
like switching off database triggers before bulk data
upload, is not particularly powerful with data ingest.
Taking in a few hundred thousand metadata records for
B2FIND data catalogue using CKAN API could take
over one full day [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]; this might present a problem in
production environments with their needs of
occasionally moving data, or restoring it from back-ups,
or data replication.
        </p>
        <p>Of course, it is possible to bypass CKAN API and
ingest data records directly in a relational database that
underpins CKAN instance, yet this solution defies the
initial reason for choosing CKAN as a ready-to-use
platform for the data catalogue. Also, although the
metadata schema offered by CKAN fits the initial
B2FIND requirements, it may present difficulties for
expanding it with semantically meaningful links to
richer metadata in domain-specific external repositories.</p>
        <p>
          In search of alternatives to CKAN, EUDAT
B2FIND looked into the results of triple stores
evaluation performed earlier by Europeana project [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
and selected Jena TDB triple store [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] for experiments
on scalability.
        </p>
        <p>To populate the Jena TDB instance with data, RDF
triples exported from CKAN B2FIND instance were
used; this ensured that experimentation was done on the
metadata of the same complexity. About 25 thousand
unique RDF triples have been harvested with from
B2FIND CKAN instance, then multiplied by factors of
10, 20, 30 and 40 to simulate EUDAT catalogue with
250K, 500K, 750K and 1M records respectively (so that
some records – up to 40 of them – differed only by their
IDs with all other attributes being the same). An
average number of RDF triples per B2FIND record
happened to be 33.3 so the test data resulted in
corresponding RDF graphs of 8.5M, 17M, 25.5M and
34M triples.</p>
        <p>
          The evaluation results obtained [
          <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
          ] showed higher
performance of Jena TDB for data records ingest
compared to CKAN. Jena TDB performance for data
search was lower than CKAN when search requests
required ordering of search results; when no ordering of
search results was requested, Jena TDB demonstrated
high performance on par with CKAN. This suggests
that a triple store can be a competitive back-end for
faceted search with graphs pre-indexed by certain
attributes but less so in cases when search results
ordering should be specifically defined by the user.
        </p>
        <p>Overall, the experiments showed linear performance
both for data ingest and data search in Jena TDB in the
range of up to 1 million B2FIND records that should be
enough to satisfy the current needs of B2FIND.</p>
        <p>
          CKAN still has the aforementioned advantage of
high configurability with multiple ready-to-use modules
developed by CKAN user community, so it remains the
EUDAT B2FIND engine for the time being. Another
B2FIND concern why a triple store might have
disadvantage before CKAN was a need in a
configurable user interface to the B2FIND back-end
which CKAN can provide out-of-box. To meet this
need of having a GUI to triple store, Elda platform [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
which is a Java implementation of Linked Data API
Specification [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] was installed atop of the Jena TDB
instance, and tested with EUDAT B2FIND data records.
This provided a GUI and an ability to get search results
in a variety of popular data formats: RDF Turtle, JSON,
XML.
        </p>
        <p>CKAN and triple store-based solutions are likely to
co-exist in EUDAT B2FIND, with the decision about
full service migration to a triple store back-end made
later on depending on volumes of data records acquired,
as well as further evaluated usability and performance
of triple stores across a few instances of differently
configured infrastructure, to clearly distinguish between
effects of the infrastructure quality and performance of
the database engines. The Figure 1 presents the current
vision of RDF technology in EUDAT B2FIND
technology stack.</p>
        <p>
          Hence the current use case for a triple store in
B2FIND is using it as a supplement to the existing data
catalogue, to cater for machine agents that use SPARQL
endpoints or LOD API in order to support third-party
information services. These services can be data
cleansing and enriching in spirit of “five stars” model of
Web content quality [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], or mash-ups that mix
EUDAT B2FIND triples with those for DBpedia entries
and other established Linked Open Data sources.
        </p>
        <p>A further development and a specific incentive for
the adoption of triple stores and associated RDF
technology in EUDAT B2FIND could be the
exploration of a federated search using remote
SPARQL endpoints, opposed to currently adopted
stance of harvesting data records for a centralized
B2FIND data catalogue. SPARQL allows mixing up
requests to local stores with those to remote stores; this
logical scalability as well as the actuality of data records
retrieved from the source of their origin may prove
attractive for certain EUDAT user communities and
third-party software developers, even taking into
account all physical communication overheads of
sending requests to remote hosts.</p>
        <p>
          Further experiments planned for B2FIND involve
setting up the neo4j graph database [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] as a back-end
to a triple store. The advantage of this can be using the
same piece of infrastructure (scalable graph database)
for more than one EUDAT service that, unlike
B2FIND, may benefit from using native graph database
methods (where B2FIND is likely to be interested in
only RDF representation). Neo4j has been tried out on a
relatively small (a few thousand) number of records
from B2SHARE with good signs of scalability, so
B2SHARE and B2FIND could be the good candidates
for being backed by one graph database instance. The
disadvantage of a graph database with a triple store built
upon it could be lower performance, compared to a
native triple store like Jena TDB. So thorough
performance measuring is required, as well as balancing
the benefits of a unified infrastructure (where graph
database supports all EUDAT services) against higher
requirements to the infrastructure that are potentially
required for graph database.
        </p>
        <p>Simplification of the technology stack when a triple
store or a graph database can replace CKAN backed by
a relational database will depend on further technology
evaluation, software licensing considerations and
business sustainability model chosen for EUDAT
B2FIND service.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Data provenance and semantic logging</title>
        <p>Another opportunity for the triple stores use in
EUDAT is B2SAFE service that requires collection and
management of data provenance records.</p>
        <p>
          There are a few semantic models that can support
data provenance use case in EUDAT: the group of
PROV recommendations developed by W3C [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], Open
Provenance Model [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], or CERIF with its semantic
representation under way [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], as well as simpler models
suitable for the definition of granular research activities,
including related to data moves and transformations [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
All these models can be supported by a triple store
back-end with a potential need to get graph databases
employed, too, as they can present additional means of
naming and manipulating graphs so that a data
provenance record can be clearly defined as a persistent
chain (graph) of all actions performed over a particular
dataset. An example of where a graph database may
have an advantage before a triple store for the use case
of data provenance is the quick extraction of
provenance subgraphs and calculating their properties
with graph-optimized algorithms.
        </p>
        <p>A specific case of data provenance is data
movement between EUDAT services. One of the
scenarios could be User placing “long tail” research
data in B2SHARE which automatically pushes it then to
B2SAFE for long-term preservation and to B2FIND to
get it registered in a common data catalogue. The data
then can be retrieved by request coming through
B2SHARE, B2FIND or B2STAGE (for computation); a
user or a machine agent that retrieved data may be
interested in its origins, checksum and other parameters
typically associated with the notion of provenance.</p>
        <p>One way to achieve this is the construction of
requests, using the dataset PID, to each of the EUDAT
services involved, and the construction of a provenance
chain on-the-fly; if built upon SPARQL endpoints or
other sorts of APIs to EUDAT services, this will require
sending requests to each of them. Another way is
writing down the granular actions of data movement
between EUDAT services in a log, and then building
data provenance chains (graphs) upon information
obtained from the log.</p>
        <p>In the former case of on-the-fly inquiries about data
provenance, a triple store-based engine sending
federated requests to multiple EUDAT services will be
more appropriate; in the latter case of producing
permanent data provenance records, graph database
may suit it better, perhaps accompanied by a triple store
top-up for harnessing the power of the mentioned
semantic models based on RDF technology.</p>
        <p>
          Another reason for using graph databases can be the
availability of mature open source frameworks for them
such as TinkerPop [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] which can be harnessed for the
development of middleware, so that even if a graph
database may not present a data modeling advantage
over a triple store, there may be technological
considerations that make the choice of a graph database
more reasonable. The technology stack for the hybrid
data provenance platform is presented by the Figure 2.
Data provenance can be seen as a special case of a more
common use case of “semantic logging”. Such big
players as Microsoft started offering software
development frameworks that allow sensible recording
of events within software applications, and feeding
these events into the event tracing services on an OS
level (ETW – Event Tracing for Windows in Microsoft
case). The back-end for capturing application-specific
events can be a flat file, a relational database, or Azure
Table Storage [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>If one foresees any kind of machine reasoning over
application events captured, then a graph database with
a semantic top-up or a triple store can be a more natural
choice than the mentioned back-ends; one of the key
factors for the actual adoption of triple stores and graph
databases in semantic logging will be their performance
for capturing (writing down) application-specific and
service-specific events.</p>
        <p>An EUDAT candidate service for the adoption of
semantic logging can be B2STAGE, so that all data
supplied for high-performance computation as well as
resulted from it are supplied with clear provenance and
contextual information, for its inclusion in the events
chain/network that is common with other EUDAT
services. More candidates will be third-party services
that use other EUDAT common services (B2FIND,
B2SHARE) and are willing to share or mix up their
internal event logs with those generated by EUDAT
services.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Data retrieval via PIDs and semantic annotation</title>
        <p>
          Using persistent data identifiers for data citation is
becoming a commonplace across many research
disciplines; there are good services that help researchers
with minting data PIDs, e.g. DataCite [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] or CrossRef
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However, what is called “data” for the purposes of
citation is highly specific to a particular research
domain or actual practices of data centres that mint
PIDs [
          <xref ref-type="bibr" rid="ref4 ref5">4-5</xref>
          ], so a reasonable idea to use data PIDs for
automated data retrieval presents a real challenge.
        </p>
        <p>Fig. 3. A place of data PID semantic annotation service
in PID curation and data retrieval workflow.</p>
        <p>
          One of the possible responses to this challenge is
using semantic annotation in order to describe the
context of data PIDs including the actual protocol for
data retrieval [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The Figure 3 presents a flow of
information entities in data retrieval service based on
data PIDs semantic annotation.
        </p>
        <p>Triple store seems a natural choice to support this
service, yet a hybrid solution that combines a graph
database topped-up with triple store and semantic
reasoner (in spirit of the Figure 2) may prove a more
viable solution if we want persistence layer for data
retrieval protocols (executable workflows) presented as
identifiable graphs. Data retrieval via HTTP using data
PIDs can be used by either B2STAGE service, or by
third-party services built by EUDAT user communities
who exploit other EUDAT services (B2FIND,
B2SHARE).</p>
        <p>Semantic annotation of data PIDs is just one case for
semantic annotation in EUDAT. There may be more
cases suggested by EUDAT user community or
discovered via EUON (European Ontology Network).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Conclusion</title>
      <p>EUDAT started considering triple stores and graph
databases in support of existing and emerging
requirements of EUDAT infrastructure. The major use
cases, with cross-links and generalizations between
them outlined earlier, are:</p>
      <p>Massive migration of data records from B2FIND
and B2SHARE data catalogues
Machine access to B2FIND, with the ability of
third parties to build their own information
services supplementing the “mainstream” EUDAT
B2FIND Web interface
Federated search in B2FIND using requests to
remote sources of data records, opposed to the
current data records harvesting
Data provenance within particular EUDAT
services (B2SHARE, B2SAFE, B2STAGE) and
across them
Semantic logging in software applications, which
can be used by EUDAT B2STAGE or by
thirdparty applications calling other EUDAT services
Data retrieval via data PIDs using semantic
annotation, which can be used by EUDAT
B2STAGE or by third-party applications calling
other EUDAT services
Other use cases for semantic annotation required
by EUDAT services, or suggested by EUDAT user
community, or identified through information
practitioners’ networks, specifically EUON
This work refers to already performed technology
evaluation and aims to identify sensible use cases that
contribute to the technology evaluation roadmap, with
more experiments to be performed in support of
existing and prospective eInfrastructure services.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work is supported by funding from EUDAT
project www.eudat.eu The author would like to thank
his colleagues in EUDAT for their input although the
views expressed are the views of the author and not
necessarily of the project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Vasily</given-names>
            <surname>Bunakov</surname>
          </string-name>
          .
          <article-title>Triple store evaluation. Presented in EUDAT project meeting</article-title>
          , Amsterdam, Netherlands,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          Jan
          <year>2014</year>
          . http://purl.org/net/epubs/work/11477713
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Vasily</given-names>
            <surname>Bunakov</surname>
          </string-name>
          .
          <article-title>Triple store testing on DKRZ virtual machine</article-title>
          .
          <source>EUDAT internal report</source>
          , May
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Vasily</given-names>
            <surname>Bunakov</surname>
          </string-name>
          .
          <article-title>Core semantic model for generic research activity</article-title>
          .
          <source>In 15th All-Russian Conference "Digital Libraries: Advanced Methods and Technologies, Digital Collections" (RCDL</source>
          <year>2013</year>
          ), Yaroslavl, Russia,
          <fpage>14</fpage>
          -17
          <source>Oct</source>
          <year>2013</year>
          ,
          <source>CEUR Workshop Proceedings (ISSN 1613-0073) 1108</source>
          (
          <year>2013</year>
          ):
          <fpage>79</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Vasily</given-names>
            <surname>Bunakov</surname>
          </string-name>
          .
          <article-title>Investigation as a member of research discourse</article-title>
          .
          <source>In 16th All-Russian Conference "Digital Libraries: Advanced Methods and Technologies, Digital Collections"</source>
          , Dubna, Russia,
          <fpage>13</fpage>
          -
          <lpage>16</lpage>
          Oct
          <year>2014</year>
          .
          <source>CEUR Workshop Proceedings</source>
          Vol-
          <volume>1297</volume>
          (
          <year>2014</year>
          ):
          <fpage>160</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Vasily</given-names>
            <surname>Bunakov</surname>
          </string-name>
          .
          <article-title>Service for data retrieval via persistent identifiers</article-title>
          .
          <source>In DATA 2015: 4th International Conference on Data Management Technologies and Applications</source>
          . Colmar, Alsace, France,
          <fpage>20</fpage>
          -
          <issue>22</issue>
          <year>July</year>
          ,
          <year>2015</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] CERIF ontology</article-title>
          . http://www.eurocris.org/ontologies/semcerif/1.3
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>[7] CKAN open source data portal software</article-title>
          . http://ckan.org/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>[8] CrossRef. http://www.crossref.org/</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>[9] DataCite. http://www.datacite.org/</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>Elda: the linked-data API in Java</article-title>
          . http://www.epimorphics.com/web/tools/elda.html
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>EUDAT project</article-title>
          . http://www.eudat.eu/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Binyam</given-names>
            <surname>Gebrekidan</surname>
          </string-name>
          <article-title>Gebre</article-title>
          .
          <article-title>CKAN evaluation. EUDAT internal report</article-title>
          ,
          <year>September 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Bernhard</surname>
            <given-names>Haslhofer</given-names>
          </string-name>
          , Elaheh Momeni, Bernhard Schandl, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Zander</surname>
          </string-name>
          .
          <source>Europeana RDF Store Report</source>
          .
          <article-title>The results of qualitative and quantitative study of existing RDF stores in the context of Europeana</article-title>
          .
          <source>March</source>
          <year>2011</year>
          . https://eprints.cs.univie.ac.at/2833/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jena</surname>
            <given-names>TDB</given-names>
          </string-name>
          . https://jena.apache.org/documentation/tdb/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Is your Linked Open Data 5 Star?</article-title>
          http://www.w3.org/DesignIssues/LinkedData.html
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>Linked Data API Specification</article-title>
          . https://code.google.com/p/linked-dataapi/wiki/Specification
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Microsoft semantic logging: patterns &amp; practices</article-title>
          . https://msdn.microsoft.com/enus/library/dn775006.aspx
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>neo4j graph database</article-title>
          . http://neo4j.com/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>[19] Open Provenance Model. http://openprovenance.org/</mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>PROV-Overview</surname>
          </string-name>
          .
          <article-title>An Overview of the PROV Family of Documents</article-title>
          . http://www.w3.org/TR/2012/WD-prov-overview20121211/
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <article-title>TinkerPop: An Open Source Graph Computing Framework</article-title>
          . http://tinkerpop.incubator.apache.org/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>