<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SPARQL Query Answering on a Shared-nothing Architecture</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Spyros Kotoulas</string-name>
          <email>kot@few.vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SPARQL</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>MapReduce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Algorithms</institution>
          ,
          <addr-line>Design, Experimentation, Performance</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science VU University Amsterdam</institution>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>The amount of Semantic Web data is outgrowing the capacity of Semantic Web stores. Similar to traditional databases, scaling up RDF stores is faced with a design dilemma: increase the number of nodes at the cost of increased complexity or use sophisticated, and expensive, hardware that can support large amounts of memory, high disk bandwidth and low seek latency. In this paper, we propose a technique to do distributed and join-less RDF query answering based on query pattern-driven indexing. To this end, we rst propose an extension of SPARQL to specify query patterns. These patterns are used to build a query-speci c indexes using MapReduce, which are later queried using a NoSQL store. We provide a preliminary evaluation of our results using Hadoop and HBase, indicating that, for a prede ned query pattern, our system o ers very high query throughput and fast response times.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>E.1 [Data Structures]: Distributed data structures; C.2.4
[Distributed Systems]: Distributed applications</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>The amount of data in the Semantic Web is growing with
a faster pace than the computational power of computers.
To cope with this disparity, we either need more e cient
single-machine algorithms for data management or scalable
distributed methods to tap on the computational resources
of more machines.</p>
      <p>When designing an RDF store, one is often confronted
with a design dilemma: use centralized or clustered storage.
To copy without fee all or part of this material is permitted only for private
and academic purposes, given that the title of the publication, the authors
and its date of publication appear. Copying or use for commercial purposes,
or to republish, to post on servers or to redistribute to lists, is forbidden
unless an explicit permission is acquired from the copyright owners; the
authors of the material.</p>
      <p>Workshop on Semantic Data Management (SemData@VLDB) 2010,
September 17, 2010, Singapore.</p>
      <p>Copyright 2010: www.semdata.org.</p>
      <p>Jacopo Urbani
The former allows easier management but only scales by
using more powerful hardware, which is often infeasible or
prohibitively expensive. The latter allows adding additional
computational resources to increase performance at the cost
of higher complexity and, often, large query response times.</p>
      <p>Typically, clustered stores split their indexes across the
nodes at loading time. When resolving queries, nodes may
need to exchange data. These data exchanges should be
minimized since they increase load and response time.
Ideally, only data that will be part of the answer should be
exchanged. This is a very di cult task in the Semantic
Web, because the semi-structured nature of the data makes
access patterns di cult to predict.</p>
      <p>There is another problem that we must consider. In
relational databases, the information that is inserted in the
system is carefully selected. In Semantic Web stores, we
often encounter the situation that, while an entire dataset
is included in the knowledge base, only a small subset is
eventually used, for a given application.</p>
      <p>In this work, we focus on clustered storage and we claim
that we should only index the part of the data that we
need, and index for speci c access patterns. We propose
a technique to decrease the complexity by calculating query
pattern-driven indexes and this will enable us to exploit
clustered architectures without the need to combine data from
di erent nodes at query time.</p>
      <p>There are several scenarios where the user is able to de ne
the queries in advance. For example, application developers
very often know the query patterns of their applications, or
they can be very easily extracted from the application source
code. Furthermore, analytical workloads typically have a
xed set of query patterns. For instance, when analysing
social network data, the user may only be interested in
resolving queries with the predicate \foaf:knows". Finally,
such an approach could be combined with a standard store.
This approach would answer queries that match some
popular query patterns, o oading the standard store that would
match the rest.</p>
      <p>
        In section 2, we will introduce SP ARQLP , an extension
of the SP ARQL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] query language to allow for de ning
SP ARQL query patterns. These query patterns serve as a
mapping between a subset of the unbound variables of the
query, called the pattern variables, and the result set of the
query.
      </p>
      <p>
        In section 3, we will present a method for building the
indexes based on the query patterns. This method constructs
only the indexes required for each SP ARQLP query pattern
using the MapReduce[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] parallel programming paradigm.
To this end, we have adapted some standard database
techniques to MapReduce/RDF and developed some new ones.
Furthermore, we will show how we query these indexes.
      </p>
      <p>In section 4 we will present an implementation of our
methods using the Hadoop MapReduce framework and the
HBase NoSQL database. Our results indicate that our
approach has good performance on large datasets, where it is
able to answer speci c queries in a few milliseconds.</p>
    </sec>
    <sec id="sec-4">
      <title>SPARQLP</title>
      <p>In this section, we will de ne an extension to SPARQL
for query patterns. We explain the details of this extension
with an example. Consider an application that needs to
retrieve the names and email addresses of all professors that
work for a university in a given country. The corresponding
SPARQL query1 would be:</p>
      <p>SELECT ?name ?email where f ?x a
professor. ?x worksfor ?y. ?y locatedin ?country. ?x
email ?email. ?x name ?name. g</p>
      <p>The application will repeatedly post the query above
replacing \?country" with the country it is interested in and
retrieve the corresponding values of (?name, ?email).</p>
      <p>SP ARQLP aims at capturing such usage patterns by
rewriting the query as a function</p>
      <p>?country ! (?name; ?email)</p>
      <p>In SPARQLp we de ne such a function using the DEF IN E
construct and execute it using the GET construct. Looking
at our example, the mapping function is de ned as:</p>
      <p>DEFINE professorsOfCountry(?country) AS
SELECT ?name ?email where f ?x a professor.
?x worksfor ?y. ?y locatedin ?country. ?x email
?email. ?x name ?name. g</p>
      <p>Using this de nition, an RDF store prepares to handle
requests on the given function. Once this operation is
completed, we use the GET construct to execute this query for
speci c countries. For example, in order to retrieve the
professors from the Netherlands we would launch the query:</p>
      <p>GET professorsOfCountry(Netherlands)</p>
      <p>In the appendix A, we formally de ne the SP ARQLP
grammar by extending the EBNF notation.</p>
    </sec>
    <sec id="sec-5">
      <title>SPARQLP ON A SHARED-NOTHING</title>
    </sec>
    <sec id="sec-6">
      <title>ARCHITECTURE</title>
      <p>To optimally support the querying model presented in the
previous section, for each SP ARQLP DEFINE query, we
can maintain a pattern index, i.e. an index from the pattern
variables to the other unbounded variables. In fact, these
are the only indexes we need to answer SP ARQLP GET
queries.</p>
      <p>Typically, RDF stores maintain a set of triple indexes over
the entire input. These indexes consist of permutations
of the triple terms (Subject-Predicate-Object,
PredicateObject-Subject etc). In our approach, we do not maintain
1for brevity, we will omit namespaces for the rest of this
paper
DEFINE
Query
Input</p>
      <p>Map
Map
Map
Map</p>
      <p>DEFINE</p>
      <p>1.
GET Query
full triple indexes, so that we avoid loading costs. Instead,
whenever we process a SP ARQLP DEFINE query, we
usually read the entire input and construct only the indexes
that we need for that query.</p>
      <p>Calculating these indexes implies resolving the embedded
SPARQL query within the SP ARQLP DEFINE queries.
We consider the following in the resolution of these queries:
The embedded queries will always be more expensive
to resolve and will produce more bindings than the
corresponding standard SPARQL queries, since they
have less bound variables.</p>
      <p>We are interested in the time to retrieve all results.
Thus time-to- rst result is not relevant, neither is
streaming of results. The performance goal is to maximize
result throughput.</p>
      <p>We are focusing in answering queries over vast amounts
of data rather than the response time of individual
SP ARQLP DEFINE queries. Query response time is
more relevant for SP ARQLP GET queries, which we
will discuss in section 3.2</p>
      <p>
        Considering the above, we will use the MapReduce [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
paradigm to construct pattern indexes. In Appendix B, we
are providing a short introduction to MapReduce, which is
essential for comprehending section 3.1.
      </p>
      <p>In Figure 1, we give an overview of our approach. For
DEFINE queries, we extract the embedded query and we
execute it using MapReduce. We index the results on the
pattern variables and store the name of the query in a
separate table. The indexes are loaded in a NoSQL database.
For SPARQLP GET queries, we retrieve a reference to the
index using the query name, and we query it using the
pattern variables. Since the variables are used as keys for the
index, we do not need to perform any joins to retrieve the
results.</p>
      <p>In the next subsections, we will describe how we deal with
SPARQLP DEFINE and GET queries.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>SPARQLP DEFINE</title>
      <p>Resolving a SPARQLP DEFINE query implies resolving
the embedded SPARQL query which will typically be an
expensive query.</p>
      <p>Querying Semantic Web data with MapReduce and
without a-priori constructed indexes di ers from querying in
traditional stores in the following:
Load-balancing: Clustered databases typically share
data across nodes, and they aim at placing data that
is used together on the same node. This decreases
response time, since less data needs to be accessed over
the network. In contrast, MapReduce focuses mainly
at dividing computation equally across nodes,
aiming at maximising parallelisation rather than response
time. MapReduce achieves this by dynamically
partitioning the data.</p>
      <p>Setting up MapReduce jobs is an expensive operation,
in terms of coordination e ort. Thus, it incurs high
latency.</p>
      <p>Without a-priori indexing, selectivity estimation is
difcult, since accurately calculating expression
cardinality is impossible without a-priori data processing.
Passing information between running Map or Reduce
tasks in the same job is not allowed by the paradigm.
Therefore, the query plan can only be modi ed
between di erent jobs. This is practical because SPARQLP
DEFINE queries are expensive, and the overhead of
starting new jobs is high.</p>
      <p>
        Semantic Web triples are small in size, typically
containing three of four terms and occupying dozens of
bytes, if dictionary encoding [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is used, and
hundreds of bytes otherwise. Furthermore, table scans are
scheduled across all processors. Thus, scanning the
entire input is relatively fast.
      </p>
      <p>Based on the above observations, we are proposing the
following for the resolution of MapReduce SP ARQLP
DEFINE queries:
3.1.1</p>
      <sec id="sec-7-1">
        <title>Tuple-at-a-time vs operator-at-a-time</title>
        <p>Typical RDF stores perform tuple-at-a-time evaluation.
In MapReduce, it is only possible to partition data once per
job. Thus, tuple-at-a-time evaluation would pose signi cant
limitations on the supported queries. Thus, within jobs, we
process one tuple at a time, while each job implements
several operations. Jobs are run in sequence, so our method
uses a mix of tuple-at-a-time and operation-at-a-time
processing.
3.1.2</p>
      </sec>
      <sec id="sec-7-2">
        <title>Selectivity estimation</title>
        <p>Since, in a MapReduce architecture, full table scans are
relatively cheap, we use an accurate selectivity estimation
algorithm, performing a full table scan. During this scan,
we calculate the cardinality of each statement pattern in
the query. Furthermore, for statement patterns with few
bindings, we also store the bindings.
3.1.3</p>
      </sec>
      <sec id="sec-7-3">
        <title>Joins during the Map phase</title>
        <p>Typically, joins are performed in the Reduce phase. This
gives rise to load balancing problems, since making
equallysized partitions for the Reduce phase is not trivial.
Furthermore, it incurs signi cant overhead, since in a MapReduce
job, we can only have a single Reduce function.</p>
        <p>
          Neither of the above is true for Maps. Thus, if one side
of the join is small enough to t in main memory, we
perform a Grace hash-join[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This implies all nodes loading the
small side in memory and iterating through the large side
in parallel. The advantage of this approach is that we do
not need to re-partition our data, eliminating load balancing
problems and the need for an additional job.
        </p>
        <p>If all sides of a join are large, then we resort to a hybrid
of hash join and sort-merge join. First the input tuples are
partitioned among nodes using a hash function. Then, each
partition is sorted locally at each node and a sort-merge join
is performed. Note that for larger joins, the load balancing
problem is dissipated by the fact that we have more bindings
per join variable.
3.1.4</p>
      </sec>
      <sec id="sec-7-4">
        <title>Recycling</title>
        <p>
          Since we are not maintaining traditional indexes, and disk
storage in a MapReduce cluster is cheap, our method uses
intermediate result recycling[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Intermediate results are not
indexed, but stored in parallel as a collection of local les.
In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], it was shown that this method improves performance
for systems doing operator-at-a-time processing.
3.1.5
        </p>
      </sec>
      <sec id="sec-7-5">
        <title>Index construction</title>
        <p>As the last step of SP ARQLP DEFINE query evaluation,
we are constructing indexes mapping template variables to
the results of the query. To this end, we are tapping on the
massive sorting functionality of MapReduce frameworks to
construct indexes in a scalable manner.</p>
        <p>This is accomplished by a job that groups tuples according
to template variables, performs a global sort and writes the
results to les. After that a suitable partitioning of the
data is calculated, this task is highly parallelisable. The
sorted les are written to the local disk and accessed by the
SP ARQLP GET method, described in the following section.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>SPARQLP GET</title>
      <p>The indexes created during the SPARQLP DEFINE query
resolution are queried using a NoSQL store.</p>
      <p>For each SPARQLP DEFINE, a new table is created in
the NoSQL database. We use the bindings of the pattern
variables as key and the results as values. In case that we
have a large number of results for a given variable, it is more
e cient if we store a pointer to a le in the Distributed File
System, instead of the values themselves. This implies an
additional lookup for queries, but this cost is amortised over
the number of retrieved results and the reduced size of the
table.</p>
      <p>
        Lacking any joins, SPARQLP GET queries are very fast to
evaluate: a single lookup is enough to retrieve all results for
a given query. Furthermore, clustered NoSQL
implementations, assure high fault tolerance, horizontal scalability and
no single point of failure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
4.
      </p>
    </sec>
    <sec id="sec-9">
      <title>PRELIMINARY EVALUATION</title>
      <p>We have performed a preliminary study to analyze the
bene ts of our approach. More thorough evaluation and
comparison to existing approaches is deferred to future work.</p>
      <p>This section is organized as follows: in subsection 4.1, we
describe the testbed and the implementation details of our
prototype. We continue reporting the performance of the
DEFINE query in subsection 4.2. At last, we report the
performance of the GET query in subsection 4.3.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Implementation and Testbed</title>
      <p>
        We have implemented a prototype which implements our
methods. We have used Sesame [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for query parsing and
basic query planning, the Hadoop framework for
MapReduce and the HBase NoSQL database for answering GET
queries.
      </p>
      <p>
        As a preprocessing step, we compress our input data using
the method described in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and, optionally, materialize the
closure under the OWL ter horst semantics using previous
work in WebPIE [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>For the DEFINE queries, we do the following: Firstly,
we extract and index the name and template variables of
the query and extract the embedded SPARQL query. We
use Sesame to parse the SPARQL query and launch our
selectivity estimation algorithm. The results of the selectivity
estimation are passed back to Sesame, which creates a static
query plan. This query plan is executed in the MapReduce
framework, and is dynamically adjusted to cater for
recycling and bindings already calculated during the selectivity
estimation step. Finally, indexes are created for the query
plans. Currently, we use a very simple index construction
mechanism that uses a single node, but that can be easily
extended by carefully partitioning data. The index is loaded
by HBase.</p>
      <p>For GET queries, we rst retrieve the query index, using
the name of the query as a key. Then, we post a query
consisting of the template variables on HBase on the retrieved
index and retrieve the results.</p>
      <p>The performance of our prototype was evaluated using the
DAS3 cluster2 of the Vrije Universiteit Amsterdam. We set
up an Hadoop and HBase cluster using 32 data nodes, with
4 cores, a single SATA drive and 4GB of memory each.</p>
      <p>We have used the LUBM benchmark for our evaluation.
We chose query number 8, since it is one of the most
challenging queries in the dataset and used the University as a
parameter. We have generated the LUBM(8000) dataset,
consisting of 1 billion triples. Dictionary encoding took
46 minutes and materializing the closure took 37 minutes,
yielding a total of 1.52 billion triples. We created a
variant of query number 8 of the standard LUBM query set as
a DEFINE query and executed it against the 1.52 billion
triples. The query is reported below:
DEFINE getStudentsOfUniversity(?P) AS
SELECT
{?X type Student .
?Y type Department .
?X memberOf ?Y .
?Y subOrganizationOf ?P .
}</p>
      <p>For GET queries, we have extracted the URIs for all
universities and evaluated by posting queries for the students
of random universities. Note that, for our prototype, we do
not perform dictionary decoding, of the results.</p>
      <p>In the following two subsections we describe the
performance obtained from rst launching a DEFINE query and
later a sequence of GET queries.
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Performance of the DEFINE query</title>
      <p>For this task, our prototype produced four MapReduce
jobs in a sequence. In Table 1, we report the execution time
for each job.</p>
      <p>Job 1 performed the selectivity estimation and extracted
the patterns which are small enough to t in the main
mem2http://www.cs.vu.nl/das3
Job
job 1: selectivity estimation
job 2: rst join job
job 3: second join job
job 4: HBase upload
Runtime (in sec.)
ory. The ordering of the patterns according to their
increasing selectivity was: (?X type Student), (?X memberOf ?Y),
(?Y type Department), (?Y subOrganizationOf ?P). The
bindings for the last two patterns were small enough to t
in memory, thus they were extracted.</p>
      <p>Job 2 loaded the extracted bindings, (?Y type Department)
and (?Y subOrganizationOf ?P), and joined them with the
input during the Map phase. In the Reduce phase, the
results were joined with (?X memberOf ?Y).</p>
      <p>Job 3 performed a hybrid join with (?X type Student ),
since both sets were too large to t in memory.</p>
      <p>Job 4 grouped the tuples according to ?P and uploaded
the bindings to HBase. The poor performance is attributed
to the fact that we did not create suitable partitions for the
data. Also, we used the HBase APIs to create the index
while we could generate it using MapReduce and then load
it on HBase. The second approach would give about an
order of magnitude better loading time, compared to our
current implementation3.
4.3</p>
    </sec>
    <sec id="sec-12">
      <title>Performance of the GET query</title>
      <p>After we have generated the index, we can e ciently query
the knowledge base with the GET query. This query will
receive in input the university and it will return all the
students for that university.</p>
      <p>A single query took on average 3.83 milliseconds to be
executed. We decided to stress the system by launching
queries simultaneously from di erent nodes. The purpose of
this stress test was to evaluate the scalability of our method
as the number of queries per second increases. We started by
launching the queries from a single client and we increased
the amount of clients up to 128. Queries were posted from
3see http://hbase.apache.org/docs/current/api/org/
apache/hadoop/hbase/mapreduce/package-summary.
html#bulk
several nodes and we have assigned a CPU core to each
client.</p>
      <p>We report the results in Figure 2. From the table, we see
how the throughput of the system increases until it can serve
about 1100 queries per second (i.e. 1ms per query) and then
it stabilizes regardless the number of clients. Again, the
performance could be dramatically increased if we would
tune the framework on our needs. For example, our index
is spread in 8 di erent regions, which were served by only 5
nodes, e ectively using only 5 of the 32 available nodes. If
we further distribute the table we could reach even higher
throughput.</p>
    </sec>
    <sec id="sec-13">
      <title>RELATED WORK</title>
      <p>This work is related to RDF stores, MapReduce
processing and view materialization in SQL databases.</p>
      <p>
        HadoopDB [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a hybrid MapReduce/database
implementation, focused on the processing of SQL queries. Data
is maintained by traditional databases in the data nodes.
For every SQL query received by the system, a MapReduce
query plan is generated. The input for the corresponding
MapReduce job is retrieved from the SQL databases
maintained by the data nodes. In contrast, our approach does
not build indexes at the loading time and does not perform
joins at runtime.
      </p>
      <p>
        Relational views[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is mature concept in the eld of SQL
databases, providing functionality to de ne a virtual table.
Furthermore, SQLServer, DB2 and Oracle [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] also
materialize these tables. Our work captures SPARQL data usage
patterns at a greater detail, since we can embed any kind
of query in our DEFINE queries and we specify the key on
which the results of this query should be indexed. On the
other hand, SQL views have the advantage that they use
the same model of normal SQL tables.
      </p>
      <p>RDF stores typically employ tuple-at-a-time processing
and avoid materialisation of intermediate results.
Furthermore, they rely on doing joins at run-time. Our
preliminary results indicate that our approach vastly outperforms
these stores in terms of throughput and response time of the
SPARQLP GET queries, since the latter require no joins.
Nevertheless, our system is at a disadvantage when
processing arbitrary SPARQL queries. Furthermore, updating data
in our system is expected to be an expensive operation.</p>
      <p>
        The work reported in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] demonstrates a method to
reduce a dataset to an interesting subset using a
supercomputer. It is focused on extracting a relatively small portion
of the input, which can be queried using a conventional RDF
store (with the associated limitations). In comparison, our
architecture is not limited by the size of the indexed data
(since the indexes created in our NoSQL store can be very
large) but it is limited by the access patterns to this data.
      </p>
    </sec>
    <sec id="sec-14">
      <title>FUTURE WORK AND CONCLUSION</title>
      <p>We intend to perform a thorough evaluation of our system
using larger datasets and more queries. Our results should
be compared with those of standard RDF stores. Moreover,
we expect that our method will also be useful for answering
standard SPARQL queries. Nevertheless, we expect to get
good performance only for very expensive queries, since the
overhead of the platform is very high. Furthermore, research
should be carried out in update mechanisms for indexes. To
this end, techniques for materialised SQL views can be used.</p>
      <p>In this paper, we have presented a method for RDF data
management based on a shared-nothing architecture.
Instead of maintaining indexes over the entire dataset, only
the indexes required for given query patterns are generated.
We have implemented a rst prototype and we tested the
performance on a sample query. Our preliminary
evaluation shows that the system can reach high throughput but,
though this approach is promising, further work is necessary
for a complete evaluation.</p>
      <p>This work was supported by the EU-IST project LarKC
(FP7-215535).
7.</p>
    </sec>
    <sec id="sec-15">
      <title>APPENDIX A.</title>
    </sec>
    <sec id="sec-16">
      <title>SPARQLP GRAMMAR</title>
      <p>
        In this appendix we formalize the SPARQLp grammar
using the EBNF notation de ned in the W3C recommendation
for XML [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We do not de ne the grammar from scratch
but we extend the SPARQL grammar [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with the following
constructs:
DefineQuery ::= 'DEFINE' ( IRIref+ )
      </p>
      <p>'AS' Query
GetQuery ::= 'GET' FunctionCall
B.</p>
    </sec>
    <sec id="sec-17">
      <title>MAPREDUCE</title>
      <p>MapReduce is a parallel programming paradigm for
parallel and distributed processing of batch jobs. Each job
consists of two phases: a map and a reduce. The mapping phase
partitions the input data by associating each element with
a key. The reduce phase processes each partition
independently. All data is processed based on key/value pairs: the
map function processes a key/value pair and produces a set
of new key/value pairs; the reduce merges all intermediate
values with the same key into nal results.</p>
      <p>B.1</p>
    </sec>
    <sec id="sec-18">
      <title>MapReduce example: term count</title>
      <p>We illustrate the use of MapReduce through an example
application that counts the occurrences of each term in a
collection of triples. As shown in Algorithm 1, the map
function partitions these triples based on each term. Thus,
it emits intermediate key/value pairs, using the triple terms
(s,p,o) as keys and blank, irrelevant, value. The framework
will group all the intermediate pairs with the same key, and
invoke the reduce function with the corresponding list of
values. The reduce function will sum the number of values
into an aggregate term count (one value was emitted for
each term occurrence) and return the result as output.</p>
      <p>This job could be executed as shown in Figure 3. The
input data is split in several blocks. Each computation node
operates on one or more blocks, and performs the map
function on that block. All intermediate values with the same
key are sent to one node, where the reduce is applied.
B.2</p>
    </sec>
    <sec id="sec-19">
      <title>Characteristics of MapReduce</title>
      <p>This simple example illustrates some important elements
of the MapReduce programming model:
since the map operates on single pieces of data without
dependencies, partitions can be created arbitrarily and
Algorithm 1 Counting term occurrences in RDF NTriples
les
map( key , v a l u e ) :
// key : l i n e number
// v a l u e : t r i p l e
// emit a b l a n k v a l u e , s i n c e
// o n l y amount o f terms m a t t e r s
emit ( v a l u e . s u b j e c t , b l a n k ) ;
emit ( v a l u e . p r e d i c a t e , b l a n k ) ;
emit ( v a l u e . o b j e c t , b l a n k ) ;
r e d u c e ( key , i t e r a t o r v a l u e s ) :
// key : t r i p l e term ( URI o r l i t e r a l )
// v a l u e s : l i s t o f i r r e l e v a n t v a l u e s
// f o r each term
i n t c o u n t =0;
f o r ( v a l u e i n v a l u e s )
c o u n t++; // c o u n t number o f v a l u e s ,</p>
      <p>// e q u a l l i n g o c c u r r e n c e s
emit ( key , c o u n t ) ;
INPUT
A p C
A q B
D r D
E r D
F r C</p>
      <p>Map &lt;A,...&gt;</p>
      <p>&lt;C,...&gt;
&lt;p,...&gt;
.
.
.
&lt;C,...&gt;
&lt;r,...&gt;
&lt;F,...&gt;</p>
      <p>OUTPUT
Reduce &lt;C,2&gt; C 2
.
.</p>
      <p>.</p>
      <p>Reduce
p 1
r 3
q 1
D 3</p>
      <p>F 1
&lt;F,1&gt; ...
can be scheduled in parallel across many nodes. In this
example, the input triples can be split across nodes
arbitrarily, since the computations on these triples
(emitting the key/value pairs), are independent of each other.
the reduce operates on an iterator of values because
the set of values is typically far too large to t in
memory. This means that the reducer can only partially
use correlations between these items while processing:
it receives them as a stream instead of a set. In this
example, operating on the stream is trivial, since the
reducer simply increments the counter for each item.
the reduce operates on all pieces of data that share some
key. By assigning proper keys to data items during the
map, the data is partitioned for the reduce phase. A
skewed partitioning (i.e. skewed key distribution) will
lead to imbalances in the load of the compute nodes.
If term x is relatively popular the node performing the
reduce for term x will be slower than others. To use
MapReduce e ciently, we must nd balanced
partitions of the data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abouzeid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bajda-Pawlikowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Silberschatz</surname>
          </string-name>
          . Hadoopdb:
          <article-title>An architectural hybrid of mapreduce and dbms technologies for analytical workloads</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <volume>922</volume>
          {
          <fpage>933</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Bello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Downing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Feenan</surname>
          </string-name>
          , Jr.,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Finnerty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. D.</given-names>
            <surname>Norcott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Witkowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ziauddin</surname>
          </string-name>
          .
          <article-title>Materialized views in oracle</article-title>
          .
          <source>In VLDB '98: Proceedings of the 24rd International Conference on Very Large Data Bases</source>
          , pages
          <volume>659</volume>
          {
          <fpage>664</fpage>
          , San Francisco, CA, USA,
          <year>1998</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Broekstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kampman</surname>
          </string-name>
          , and
          <string-name>
            <surname>F. van Harmelen. Sesame:</surname>
          </string-name>
          <article-title>A generic architecture for storing and querying RDF and RDF schema</article-title>
          .
          <source>In Proceedings of the International Semantic Web Conference (ISWC)</source>
          , pages
          <fpage>54</fpage>
          {
          <fpage>68</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Burrows</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fikes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Gruber</surname>
          </string-name>
          .
          <article-title>Bigtable: A distributed storage system for structured data</article-title>
          .
          <source>ACM Transactions on Computer Systems (TOCS)</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <fpage>4</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Codd</surname>
          </string-name>
          .
          <article-title>Recent investigations in relational data base systems</article-title>
          .
          <source>In IFIP Congress</source>
          , pages
          <volume>1017</volume>
          {
          <fpage>1021</fpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          . Mapreduce:
          <article-title>Simpli ed data processing on large clusters</article-title>
          .
          <source>In Proceedings of the USENIX Symposium on Operating Systems Design &amp; Implementation (OSDI)</source>
          , pages
          <fpage>137</fpage>
          {
          <fpage>147</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kersten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Nes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Goncalves</surname>
          </string-name>
          .
          <article-title>An architecture for recycling intermediates in a column-store</article-title>
          .
          <source>In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data</source>
          , pages
          <volume>309</volume>
          {
          <fpage>320</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kitsuregawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          , and
          <string-name>
            <surname>T.</surname>
          </string-name>
          Moto-Oka.
          <article-title>Application of hash to data base machine and its architecture</article-title>
          .
          <source>New Generation Comput.</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>63</volume>
          {
          <fpage>74</fpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Maler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Paoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Sperberg-McQueen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yergeau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Bray</surname>
          </string-name>
          .
          <article-title>Extensible markup language (XML) 1.0 (third edition). rst edition of a recommendation, W3C</article-title>
          , Feb.
          <year>2004</year>
          . http://www.w3.org/TR/2004/REC-xml-
          <volume>20040204</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud</surname>
          </string-name>
          <article-title>'hommeaux and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Seaborne</surname>
          </string-name>
          .
          <article-title>SPARQL query language for RDF. W3C recommendation, W3C</article-title>
          , Jan.
          <year>2008</year>
          . http://www.w3.org/TR/2008/REC-rdf-sparqlquery-
          <volume>20080115</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Urbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kotoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maassen</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. van Harmelen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Bal</surname>
          </string-name>
          .
          <article-title>Owl reasoning with webpie: Calculating the closure of 100 billion triples</article-title>
          . In L. Aroyo,
          <string-name>
            <given-names>G.</given-names>
            <surname>Antoniou</surname>
          </string-name>
          , E. Hyvonen, A. ten Teije,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cabral</surname>
          </string-name>
          , and T. Tudorache, editors,
          <source>ESWC (1)</source>
          , volume
          <volume>6088</volume>
          of Lecture Notes in Computer Science, pages
          <volume>213</volume>
          {
          <fpage>227</fpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Urbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maassen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Bal</surname>
          </string-name>
          .
          <article-title>Massive semantic web data compression with mapreduce</article-title>
          .
          <source>In Proceedings of the MapReduce workshop at HPDC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weaver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Atre</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Hendler</surname>
          </string-name>
          .
          <article-title>Scalable reduction of large datasets to interesting subsets</article-title>
          .
          <source>In 8th International Semantic Web Conference (Billion Triples Challenge)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>