<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Implementing Triple-Stores using NoSQL Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eleni Stefani</string-name>
          <email>eleni.stefani@fshnstudent.info</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klesti Hoxha</string-name>
          <email>klesti.hoxha@fshn.edu.al</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, Faculty of Natural Sciences, University of Tirana</institution>
          ,
          <addr-line>Tirana</addr-line>
          ,
          <country country="AL">Albania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>Knowledge bases empower various information-retrieval systems nowadays. The usual implementation of them is through RDF based triple-stores. The available toolkits that enable this are usually less mature in comparison with well established document-oriented NoSQL databases. In this work we report on an alternative implementation of triple-stores using NoSQL databases. In comparison with similar solutions in this regard, we decided to not use RDF at all, therefore no data or query mapping was needed. We propose the implementation of a vocabolary using a separate document collection. This would also facilitate the dynamic enlargement of it in automated fact-extraction scenarios. Our results show that using a document-oriented NoSQL database for storing and retrieving triples o ers a considerable performance. The involved preprocessing needed because of the limitations of a non RDF based solution, did not a ect this. The achieved performance was also higher in comparison with doing the same knowledge retrieval operations using a purposely built linked data toolkit.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>an industry acceptance comparable with other
traditional ways of storing data (i.e. RDBMS, NoSQL).</p>
      <p>Big players of the information retrieval landscape
have already reported success stories by using a
knowledge base for enriching their information output
[Don16], [Pau17]. However these are proprietary
implementations that have not been open sourced so far.
Considering the increase in usability perception when
incorporating information from a knowledge base in
search results [ALC15], many have created or planned
adding a knowledge base (usually tripe-store powered)
in their actual information retrieval system setup.</p>
      <p>The lack of fully mature solutions in this regard and
the complexity of them, have hindered the ubiquity of
knowledge bases in comparison with traditional data
stores. Furthermore, for simple scenarios that just try
to gain advantage of linked data, having to deal with
a constraining RDF schema might be unnecessary.</p>
      <p>On the other hand, NoSQL databases are very well
accepted at the time of this writing. They o er an
incomparable performance in extensive data creation
scenarios, are very scalable, and the existing solutions
for implementing them allow for quick deployment in
traditional servers or in the cloud. Furthermore, there
is an already trained crowd of developers with hands
on experience with this data store category.</p>
      <p>In this work we report on a prototype
implementation of a triple-store using a document-oriented
NoSQL database. It does not use RDF, just simple
subject-predicate-object triples stored in a MongoDB1
database. Our goal is to provide preliminary insights
of using already established NoSQL databases for
storing graph oriented data (a typical setting for linked
data contexts). We aimed on experimenting about
incorporating triple structured data in information
systems without having to rely on a heavyweight RDF
manipulation framework.
We provide a sample vocabulary based on DBPedia
[ABK+07] predicates. It was stored in a separate
document collection. Our experiments were performed
using a subset of DBPedia knowledge. We evaluate
our approach in terms of performance comparing it
also with Apache Jena2, an open source Linked Data
framework that supports RDF based implementations
of triple-stores.</p>
      <p>In the rest of this paper after giving a short overview
of various triple-store implementation approaches in
section 2, we give detailed insights of the developed
prototype in section 3. The paper is concluded with
detailed data about the evaluation of our prototype.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Triple-Store</title>
      <p>proaches</p>
    </sec>
    <sec id="sec-3">
      <title>Implementation Ap</title>
      <p>There are several approaches for implementing
triplestores. The most frequent ones are purpose-built
implementation frameworks or graph databases.
2.1</p>
      <sec id="sec-3-1">
        <title>Purpose-built frameworks</title>
        <p>Also named as native triple stores, purpose-built
frameworks are technologies developed for storage and
retrieval of RDF data.</p>
        <sec id="sec-3-1-1">
          <title>Apache Jena</title>
          <p>Jena is a Java based framework for dealing with
semantic web/linked data scenarios. It provides a Java
library that allows the manipulation of RDF graphs. It
supports RDF, RDFS, RDFa, OWL for storing triples
(according to published W3C recommendations) and
SPARQL Query Language for retrieving information
from graphs. The allowed data serialization are
RDF/XML, Turtle and Notation 3. Apache Jena also
includes a SPARQL server, Apache Jena Fuseki which
can be run as a stand alone server. It o ers access to
the same Jena features using a HTTP interface.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Sesame</title>
          <p>Sesame is another alternative for implementing
triplestores using RDF data [BKVH02]. It needs a
repository for data storage, but this repository is not
included in Sesame architecture, that's why Sesame is
database-independent. It can be combined with a
variety of DBMSs. In its architecture it contains a layer,
named SAIL (Storage and Inference Layer) for
managing communication with the database in use. Sesame
can only accept queries written in SeRQL (a RDF
query language) and converts them in queries suitable
to run on the underlying repository.</p>
          <p>2https://jena.apache.org/
4Store
4Store is a RDF DBMS which stores RDF data as
quads, adding an additional property for storing the
graph name. 4store uses a custom data structure for
storing the quads data and it also uses its own tool for
querying them, 4s-query [CMEF+13].
2.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Graph Databases</title>
        <p>Because of their structure, graph databases o er a
natural option for storing triples since the standard
representation of triples is also a graph. In this section we
describe some examples of this category of databases.</p>
        <sec id="sec-3-2-1">
          <title>AllegroGraph</title>
          <p>AllegroGraph3 enables linked data applications
through a graph database and also an application
framework. It o ers similar features as the above
described tools: storing and retrieving triple data. Data
retrieval can be done using SPARQL or Prolog. It
supports data serializations like N-Quads, N-Triples,
RDF/XML, TriG, TriX, and Turtle formats and it can be
used with various programming languages. Similarly
to a relational database, it supports ACID
transactions.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Virtuoso</title>
          <p>Virtuoso Universal Server4 is another alternative for
implementing triple-stores. It can access RDF data
stored in a RDBMS repository which may be part of
Virtuoso itself, or an external one [EM09]. The usual
database schema is relatively simple, RDF data are
stored as quads in a table with four columns. A quad
includes the triple and the graph name, subject S,
predicate P, object O and graph G. Regarding the query
language, Virtuoso uses a combination of SPARQL
and SQL. It translates SPARQL queries to SQL ones
according to the database schema.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Research Prototypes</title>
        <p>Other approaches of implementing triples stores
include research prototypes which try to exploit
technologies that weren't initially developed for this
purpose.</p>
        <p>Dominik Tomaszuk [Tom10] has also experimented
on implementing RDF triples in JSON or BSON
documents in MongoDB. The document structure that
he suggests consists of storing subject, predicate and
object data as document elds. Consequently, in each
document can be stored only one triple. About
knowledge retrieval the author analyzes some algorithms for
interpreting SPARQL queries.</p>
        <p>3https://franz.com/agraph/
4https://virtuoso.openlinksw.com/</p>
        <p>Subject which will be replaced by its id on entity
document. By default subject is an entity.</p>
        <sec id="sec-3-3-1">
          <title>Predicate will be stored as it is. Object which will be replaced by its id de ned in entity documents, if is an entity, or will be stored as it is if is not.</title>
          <p>Data Structure Schema of Knowledge Documents
Let T(S,P,O) be a triple where S is subject, P is
predicate and O is object. Then a knowledge document
structure will be as follows:
"_id": document id,
"subject": S(T),
"P(T)": O(T)</p>
          <p>A document can store information about only one
subject. Two elds are required in each document: id,
an auto generated number attached to every document
and subject, the actual subject of the triple. The third
pair has as key the predicate of the triple in question
and as value the object of the triple. In the same
way we can continue adding knowledge about a certain
subject. If S(T) is a set with triples then a knowledge
document is created as:
" i d " : document id ,
" s u b j e c t " : S ( T1 ) ,
"P( T1 ) " : O( T1 ) ,
"P( Tn ) " : O( Tn )</p>
          <p>Franck Michel [MFZM16] in its work represents
xR2RML as a mapping language for querying
MongoDB with SPARQL.</p>
          <p>Another approach tends to store RDF triples into
CouchDB using JSON documents [CMEF+13]. In this
approach, each document can store only one JSON
object (even if technically can be more than one) where
the key represents the subject of the triple. The value
of the object consists of two JSON arrays, one for
storing predicates and the other for storing objects of
triples. The relation of predicates and objects is done
according to their indexes on array. Because of this
structure, a document can store more than one triple
only if all triples share the same subject. To add new
triples, already existing documents can be modi ed or
new documents can be created. For running queries,
the proposed system accepts SPARQL queries which
are then converted to queries CouchDB can process.
3
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Our Approach</title>
      <sec id="sec-4-1">
        <title>Data Structure Schema</title>
        <p>In this section is described the schema that is used
for storing triples into JSON documents and how we
dealt with unique identi cation of entities. We did not
use a RDF based speci cation for this approach, so our
sources won't be described by URIs. We wanted to
experiment with creating simple knowledge graphs
without the overhead involved by using traditional linked
data tools (RDF, SQARQL). To cope with unique
identi cation of entities we suggest the use of two type
of documents:
knowledge documents for storing knowledge,
entities documents for identifying entities
Each of them needs to be stored in a separate
MongoDB collection, consequently two collections are
needed: KNOWLEDGE collection and ENTITIES
collection.</p>
        <p>An entity document de nes only one entity and
includes two elds, id and name. A document of this
type would look like this:
{
}
"_id": entity ID,
"name": entity name</p>
        <p>Knowledge documents are responsible for storing
triples. The set of all knowledge documents stored
in database represents the knowledge graph.</p>
        <p>We also run some preprocessing when inserting
triples in our triple-store. Let T(S,P,O) be a triple
where S is subject, P is predicate and O is object. Then
in the respective knowledge document will be stored:
{
}
f
g</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2 Interaction with database</title>
        <p>In this work, we did not to use SPARQL. SPARQL is
suitable only for querying RDF data which also have
been excluded from this approach. Under these
conditions, for database interaction we have utilized the
MongoDB Query Language. The developed system
o ers the possibility of adding and retrieving
information through triples. This can be done through an
exposed RESTful API.
3.2.1</p>
      </sec>
      <sec id="sec-4-3">
        <title>Adding Knowledge</title>
        <p>One of the rst things needed for managing knowledge
is the speci cation of a vocabulary. We have de ned
the vocabulary as the set of all valid predicates that
can be added to our knowledge base, specifying also
the valid data type for each vocabulary entry. There
are three valid data types:</p>
        <sec id="sec-4-3-1">
          <title>1. Single value,</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>2. Array of values, 3. Map, or in MongoDB language inner document. In this work we support only one level of inner documents.</title>
          <p>For example, if we specify a vocabulary entry
name j single value
and we try to add two triples</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>John name John</title>
        </sec>
        <sec id="sec-4-3-4">
          <title>John name Johny</title>
          <p>our developed prototype will store the last valid triple,
John name Johny. Table 1 presents some
vocabulary entries used in our prototype mostly for storing
triples about locations.</p>
          <p>Other than the document schema, there is also the
need for specifying a strategy of managing documents
when new knowledge is added (through an API request
in our case). Generally there are two options,
modifying current documents in database in order to add
new triples, or creating new documents. The one
followed in this approach is to create new documents for
each API request that adds new data. If the request
contains an array of triples, the system groups them
by subject and then for each group (subject) adds a
new knowledge document in the database.
3.2.2</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Querying Knowledge</title>
        <p>As mentioned above, for retrieving knowledge the
system uses MongoDB Query Language. An external user
only needs to create a JSON lter according to
MongoDB speci cation and pass it to the system through
RESTful calls. Again, preprocessing might be needed.
It consists of replacing entity names found in lter
with their corresponding ids as de ned in entities
documents.</p>
        <p>There are two particular cases of querying
knowledge:</p>
        <sec id="sec-4-4-1">
          <title>1. the requested entity is known, 2. the requested entity or entities are not known, but have to be in some relation (expressed through JSON lters)</title>
          <p>In the rst case, users can send an HTTP GET
request with the name of entity/subject. For
example, if knowledge about Albania is requested, user
sends http://server/Albania request and the system
responds with all available triples about this subject.
This is the only case where there is no need of creating
a JSON lter.</p>
          <p>In the second case, users can send an HTTP GET
request that contains a JSON lter. For example,
http://server/f"type":"country","capital":"Tirana"g
which requests an entity of type country and its
capital is Tirana. Other supported logical operators
(other than and) are or, in. It is also possible to
use the following comparison operators: equal, less
than, and greater than.</p>
          <p>Figure 1 shows the work ow diagram of querying
knowledge using lters. After nding the requested
entities, the system gets all available knowledge stored
about them and performs a merge process for each
entity.</p>
          <p>Merge is the process of combining documents that
share a common subject into one. The nal document
will have a eld containing the subject and all the pairs
of predicate/object found in the merged documents.
The second process that is performed before
returning a response is serialization. Serialization refers to
replacing entity ids with actual names as de ned in
entities documents. Figure 2 shows the work ow
diagram of processing knowledge before returning a
response.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>In this section we describe the evaluation that we have
performed for the developed prototype. We have
performed our experiments with a set of 3.000 triples
extracted from DBPedia. The system was tested for both
use cases: adding and retrieving knowledge. In
addition, we have also executed some tests of querying
triples in Apache Jena in order to make a
comparison with our approach in terms of performance. For
this purpose the same dataset has been fed as RDF to
Serialization of filter</p>
      <p>Find Entities
Find all knowledge
about entities</p>
      <p>Knowledge
Documents about an</p>
      <p>entity</p>
      <p>Merge process
Serialization process</p>
      <p>Final
document of
the entity
Send document
the Jena framework. Because of this we cannot
compare insert operations with our implemented insertion
strategy.</p>
      <sec id="sec-5-1">
        <title>Adding Knowledge</title>
        <p>In our rst experiment (Table 2) we test the
performance of adding all triples with a single HTTP POST
request. We measure the execution time which also
includes all needed preprocessing before storing the
knowledge: grouping triples by subject, creating
entities documents, and nally adding the knowledge
documents. Our triples contain knowledge about 15
entities (15 unique subjects), hence in our database
after insertion will be 15 knowledge documents since
all triples will be added from one request.</p>
        <p>In the second experiment (also Table 2) we send an
HTTP request for each triple that needs to be stored.
Because each request contains a single triple, there will
be 3.000 knowledge documents in the database.</p>
        <p>It can be noticed that the rst insertion strategy
performs better. Using a single document that
contains all knowledge related triples performs faster.
In order to evaluate the performance of knowledge
retrieving operations in our developed prototype, we
have also executed the same queries to the same
knowledge set (RDF) loaded in Apache Jena. We performed
this experiment using di erent numbers of triples. The
execution time for our prototype includes all the
required preprocessing steps. Other than this, we
measured the performance of these queries using both
triple insertion strategies described above. The
experiments were performed using the same hardware in
order to avoid biased results. We executed the
following queries:</p>
      </sec>
      <sec id="sec-5-2">
        <title>1. get all triples,</title>
      </sec>
      <sec id="sec-5-3">
        <title>2. get triples where subject is Albania,</title>
      </sec>
      <sec id="sec-5-4">
        <title>3. get triples where type is poet,</title>
        <p>4. get triples where types is poet or political leader
Table 3 shows the results of these experiments. It
can be noticed that our approach is more e cient in
terms of performance. Also, when considering the
response size for the same number of triples, it is bigger
in Apache Jena. This is because of the RDF data
overheads. In regards to the two data insertion approaches
described above (that a ect the number of documents
in our database), results show that storing triples in
fewer documents reduced the response time when
retrieving knowledge.
The performance of our developed prototype is also
tested in regards to updating documents. Table 4
shows the results. The dataset contains 3000 triples
in total, but we experiment with a di erent number
of documents that store them. We run the same
update query for all setups. As we also noticed with
our knowledge retrieval experiments, results show that
storing triples in fewer documents gains a shorter
response time when updating knowledge.
In this work we propose the implementation of
knowledge base triple-stores using already mature NoSQL
solutions like MongoDB. Lacking of a unique
identifying schema included in RDF, we propose the
implementation of a vocabulary using a separate document
collection. This involved the addition of some
preprocessing steps when inserting or retrieving knowledge.
In comparison with other related works that serialize
knowledge data in a separate DBMS, we don't deal
with mapping RDF stored data or SPARQL queries
to the serialized dataset. This approach reduces a lot
the complexity of the solution and focuses the e ort
on the actual triple-stored knowledge itself.</p>
        <p>Our experiments show that by avoiding the
overheads of the traditional way of storing triple structured
data (RDF) and taking advantage of the performance
oriented features of document-oriented databases, we
can achieve a considerable performance when
performing basic knowledge update operations (insert,
update, retrieve). This was con rmed also when
comparing the performance of our developed prototype with
Apache Jena, a well established open source
linkeddata toolkit.</p>
        <p>Based on our results, a better performance can be
achieved when storing multiple triples per document.
Increasing the number of documents considerably
increased the response time.</p>
        <p>Concluding, we showed that it is possible to
implement a triple-store knowledge base using not purposely
built toolkits. In various scenarios, a RDF based
implementation can create unnecessary complexities that
would increase the implementation time of a
knowledge base. Furthermore, when dealing with previously
unknown relations (fact extraction through text
mining), a NoSQL based implementation facilitates a
dynamic enlargement of the vocabulary.
[ABK+07]
[ALC15]
[BHBL11]
[BKVH02]</p>
      </sec>
      <sec id="sec-5-5">
        <title>Soren Auer, Christian Bizer, Georgi Ko</title>
        <p>bilarov, Jens Lehmann, Richard
Cyganiak, and Zachary Ives. Dbpedia: A
nucleus for a web of open data. In The
semantic web, pages 722{735. Springer,
2007.</p>
      </sec>
      <sec id="sec-5-6">
        <title>Ioannis Arapakis, Luis A Leiva, and B Barla Cambazoglu. Know your onions: understanding the user experience with the knowledge module in web search.</title>
        <p>In Proceedings of the 24th ACM
International on Conference on
Information and Knowledge Management, pages
1695{1698. ACM, 2015.</p>
      </sec>
      <sec id="sec-5-7">
        <title>Christian Bizer, Tom Heath, and Tim</title>
        <p>Berners-Lee. Linked data: The story so
far. In Semantic services, interoperability
and web applications: emerging concepts,
pages 205{227. IGI Global, 2011.</p>
      </sec>
      <sec id="sec-5-8">
        <title>Jeen Broekstra, Arjohn Kampman, and</title>
        <p>Frank Van Harmelen. Sesame: A generic
architecture for storing and querying rdf
and rdf schema. In International
semantic web conference, pages 54{68. Springer,
2002.
[CMEF+13] Philippe Cudre-Mauroux, Iliya Enchev,
Sever Fundatureanu, Paul Groth, Albert
Haque, Andreas Harth, Felix Leif
Keppmann, Daniel Miranker, Juan F Sequeda,
and Marcin Wylot. Nosql databases for
[Don16]
[EM09]</p>
      </sec>
      <sec id="sec-5-9">
        <title>Xin Luna Dong. How far are we from collecting the knowledge in the world. In</title>
        <p>International Conference on Web
Engineering, 2016.</p>
      </sec>
      <sec id="sec-5-10">
        <title>Orri Erling and Ivan Mikhailov. Rdf support in the virtuoso dbms. In Networked</title>
        <p>Knowledge-Networked Media, pages 7{
24. Springer, 2009.
[Pau17]
[Rus11]
[Tom10]</p>
      </sec>
      <sec id="sec-5-11">
        <title>Heiko Paulheim. Knowledge graph re nement: A survey of approaches and evaluation methods. Semantic web, 8(3):489{ 508, 2017.</title>
      </sec>
      <sec id="sec-5-12">
        <title>Dominik Tomaszuk. Document-oriented triple store based on rdf/json. Studies in</title>
        <p>Logic, Grammar and Rhetoric,(22 (35)),
page 130, 2010.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>