SpiderStore: Exploiting Main Memory for Efficient
             RDF Graph Representation and Fast Querying

           Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, Günther Specht
                                 Databases and Information Systems, Institute of Computer Science
                                                  University of Innsbruck, Austria
                                                {firstname.lastname}@uibk.ac.at


ABSTRACT                                                                         stores and Semantic Web frameworks as very big ontologies
The constant growth of available RDF data requires fast                          did not fit into main memory so far.
and efficient querying facilities of graph data. So far, such                       However, the relational database model has not been in-
data sets have been stored by using mapping techniques                           tended for the storage of large graph structures. Current so-
from graph structures to relational models, secondary mem-                       lutions for the storage of graphs in relational tables feature
ory structures or even complex main memory based models.                         a mapping between the graph structure and the relational
We present the main memory database SpiderStore which                            tables of the database system [2, 13, 20]. This mapping
is capable of efficiently managing large RDF data sets and                       significantly slows down both the storing and the querying
providing powerful and fast SPARQL processing facilities.                        process, which causes a significant loss of performance. An-
The SpiderStore storage concept aims at storing the graph                        other important factor is that the sequential access charac-
structure in main memory without performing any complex                          teristics of persistent memory requires multiple indices for
mappings. Therefore it exploits the natural web-structure of                     fast data access, which basically requires a duplication of
RDF by using fast and random access to main memory. The                          the stored data. Storing all information on persistent media
abandonment of additional mappings or meta-information                           also features the disadvantage of very costly I/O operations.
therefore leads to a significant performance gain compared                       This is especially limiting when performing join operations
to other common RDF stores.                                                      which feature potentially large intermediate results which
                                                                                 have to be joined. Depending on the type of mapping and
                                                                                 the queries executed, these operations over a huge number
Categories and Subject Descriptors                                               of triples on secondary media can be an expensive task in
H.3.2 [Information Storage and Retrieval]: Information                           terms of execution time or memory consumption.
Storage; H.3.3 [Information Storage and Retrieval]: In-                          Recent developments in the area of high-performance main
formation Search and Retrieval; H.2 [Database Manage-                            memory OLTP and OLAP processing databases have min-
ment]: Query Processing                                                          imized the I/O bottleneck by moving the whole database
                                                                                 system into main memory which has become appropiate
                                                                                 in terms of space capacity even for large datasets. Sys-
General Terms                                                                    tems facilitating such an architecture – like VoltDB [1] or
Performance, Algorithms, Design, Experimentation                                 HyPer [12] – allow to process several 100,000 transactions
                                                                                 per second while providing full ACID properties. Though
Keywords                                                                         they have moved to main memory, these approaches are still
                                                                                 based on the relational paradigm and therefore suffer from
RDF, Main Memory, Database, RDF Store, Triple Store,                             an impedance mismatch when it comes to mapping graph
SPARQL, Query Processing                                                         based data into a tabular layout. Hence these approaches
                                                                                 have to deal with costly join operations and potentially large
1.    INTRODUCTION                                                               intermediate results when processing graph structures.
  Due to the ever growing, vast amounts of RDF data, the                         We present SpiderStore, a main memory based RDF database
storage of this data is mostly realized on persistent media                      which overcomes these limitations and exploits fast random
– either in a native store or within a relational database                       read and write operations of main memory and provides
system. Storing huge amounts of RDF data in relational                           time and space efficient SPARQL query processing facilities.
databases has been facilitated by most of the popular RDF
                                                                                    The remainder of this paper is structured as follows. Sec-
                                                                                 tion 2 outlines the memory layout of SpiderStore. Subse-
To copy without fee all or part of this material is permitted only for private   quently, Section 3 is concerned with the fast and efficient
and academic purposes, given that the title of the publication, the authors      processing of SPARQL queries on the proposed main mem-
and its date of publication appear. Copying or use for commercial purposes,      ory database. Section 4 features the evaluation of the Spi-
or to republish, to post on servers or to redistribute to lists, is forbidden    derStore prototype and Section 5 discusses related work.
unless an explicit permission is acquired from the copyright owners; the         The paper is concluded by a summary and the description
authors of the material.
Workshop on Semantic Data Management (SemData@VLDB) 2010,
                                                                                 of future work in Section 6.
September 17, 2010, Singapore.
Copyright 2010: www.semdata.org.
2.   MEMORY LAYOUT
                                                                               Node 1                   icate     Node 2
   In the following section we sketch the lightweight storage                                    pred
layout of SpiderStore, which is optimized for fast and effi-
                                                                              outgoing                               .....
cient SPARQL query processing on RDF graphs. In order
to utilize the nature of main memory architecture, the Spi-
derStore approach stores a graph natively as a set of nodes                                      object
and pointers (edges). Due to the fact that main memory                                                            Node 3
is more expensive than disk based memory and therefore
limited, a very lightweight layout – without any complex
                                                                                                                     .....
mappings, index structures or additional meta-information
                                                                                                          ate
– is required. These conditions are satisfied by the following                                         dic
                                                                                                 pre
memory layout which is based on two basic building blocks:
                                                                              incoming
Nodes within an RDF graph are either subjects, objects
    or predicates, where each subject is connected to an
    object by an appropriate edge, such that the predicate                                                        Node 4
                                                                                                 subjec
                                                                                                          t
    annotating the edge again is a node. Furthermore, the
    identifier of each node is stored within the node itself.                                                        .....


Edges connect two nodes, one serving as subject and one as
    object. Edges are implicitly realized by pointers from
    the subject node to the object node.
                                                                         Figure 1: RDF Store Main Memory Layout
   To be able to browse through the graph structure in order
to answer queries efficiently, the memory layout is optimized       layout can be calculated by applying this formula:
for graph traversal operations. This is implemented by stor-
ing all edges belonging to a certain node in a dense mem-                m = (#nodes ∗ 5 + #edges ∗ 3) ∗ sizeof (pointer)
ory block. This implies that scanning all edges of a certain        where #nodes is the overall number of nodes and #edges
node only requires a linear scan of a continuous memory             is the number of edges (triples) within the RDF data set.
block. Within this block, all edges of the same predicate           The subformula #edges ∗ 3 contains the space estimate for
are clustered and linked to the predicate node itself. Fur-         the incoming, the outgoing edge (bidirectional) and the link
thermore, all predicates are stored as nodes as well. The           between the edge and the corresponding entry node. For
RDF graph data model consists of a directed graph where             each node, space for a pointer to all its incoming and a
all edges point from the subject to the object. However, our        pointer to all its outgoing edges, together with their de-
approach features bidirectional edges. This is due to the fact      gree and a pointer from the string-index (dictionary) has
that the query processing performance can be increased sig-         to be allocated. The default pointer size on a 64bit ar-
nificantly by the possibility of browsing through the graph         chitecture is 8 byte. The formula does not consider the
in any direction. The benefit of such a structure – where           strings (identifiers) itself which are stored in a dictionary-
subjects, objects and predicates are nodes connected by a           like string index. Strings (URIs, literals) are stored only
pointer structure – is that the graph can be traversed effi-        once
ciently in any direction from any starting point within the         Pn which results in an additional memory consumption of
                                                                       i=0 length in byte(unique stringi ), where n denotes the
graph. An additional index which consists of all predicates         number of unique strings within the data set.
and lists of its sources (subjects) even facilitates to start the
traversal at a predicate. The access to nodes via their iden-
tifier is guaranteed within a time complexity of O(log(n))          3.   QUERY ENGINE
by using an additional index. This index is used for the fast         In the SpiderStore system, the benefits of storing all graph
and efficient conversion between strings (e.g. URIs) and            data within the main memory are exploited for the query
main memory nodes.                                                  process within the stored data.
Figure 1 sketches the memory model for the storage of RDF           Secondary memory-based RDF stores prefer breadth-first
data. The illustrated example features the following nodes:         search because the huge amount of random memory accesses
Node 1 (subject) is connected by Node 2 (predicate) to the          would be disadvantageous. This approach naturally leads
object Node 3. Node 3 itself is also a predicate of the connec-     to intermediate results. Due to the typically high volume of
tion between Node 4 and Node 1 where Node 4 is the subject          RDF data sets, intermediate results tend to be very large [3],
and Node 1 is the object. The following listing shows the           which is a limiting factor for the performance of the search
example data from Figure 1 represented in triple notation:          algorithm as writing and reading of intermediate results is
                                                                    very expensive in terms of time and space capacities. In
...                                                                 the context of a main memory store, the limited amount of
<Node 1> <Node 2> <Node 3>                                          memory available is a crucial factor when handling interme-
<Node 4> <Node 3> <Node 1>                                          diate results.
...                                                                 A memory efficient solution is accomplished by applying
                                                                    depth-first search instead of breadth-first search. As argued
  The estimated memory consumption (number of bytes m)              by Gardarin et al. [9], depth-first search is an efficient op-
for a given RDF data set stored in the described memory             eration for traversing paths as long as all data is kept in
memory, which is true for SpiderStore by design. Further-
                                                                              ?fn             3
more, they point out that by using depth-first search no                                            2            ?gn
intermediate results are required to traverse existing paths.
Therefore, the complexity of a query is only restricted by                   <familyNameOf>         <givenNameOf>
the amount of main memory used to store the overall result.
                                                                                                            1
For queries which do not facilitate any postprocessing op-                                                                                  Start
                                                                                              ?p        <type>          scientist
erations like sorting or grouping, SpiderStore provides the              6                                                               (seed node)
results in a stream-based way and therefore does not require                                                 4
                                                                              <hasDoctoralAdvisor>
any additional memory.                                                                           <bornInLocation>
The SpiderStore query engine splits SPARQL queries into
so-called “restriction triples”, which are similar to Basic
                                                                                ?a
Graph Patterns [16]. The result set of a query is defined by                                                              ?city
all variable assignments (paths) satisfying a given set of re-
strictions. Therefore, these restriction triples can be used to         <bornInLocation>      7
traverse the subgraph that contains the result data in depth-
first order. Figure 2 illustrates an exemplary query which
                                                                              ?city2                                                 5
selects all scientists who were born in a Swiss city and whose                                                         <locatedIn>
doctoral advisor was born in a German city. Nodes which
                                                                                              8
are marked by a “?” represent variables and are bound by                  <locatedIn>
the query engine during the graph walk. The solid lines in
the Figure represent the structure of the subgraph. The dot
dashed lines represent the order in which the triple patterns                Germany                                   Switzerland
are applied to the stored RDF graph by the query engine.
To determine the execution order of triples, selectivity heuris-
tics of basic graph patterns as described in [16] are exploited.
                                                                                                   predicates
SpiderStore determines the execution order by implicitly
storing basic statistical data. This data contains informa-                      1                 execution order
tion about the number of property instances and the number
of incoming and outgoing edges grouped by the property of
                                                                                           Figure 2: Example Query
each node. Hence this statistical data does not need to be
precomputed and can be used to determine the execution
order by ordering the restriction triples based on their selec-    4.      EVALUATION
tivity. Therefore, the restriction triples containing the most
                                                                     All experiments were conducted on a server equipped with
selective nodes, edge or combination of these, are applied
                                                                   two Intel Xeon L5520 Quad Core CPUs, 2.27 GHz, Linux
first in order to keep the amount of nodes which have to be
                                                                   kernel 2.6.18, CentOS, 64-bit architecture and 48 GB main
processed as small as possible.
                                                                   memory.
The processing order of the restrictions is crucial as the re-
strictions may share variables, which have to be unified and       4.1        Data Sets
therefore have to be processed in the correct order. The
                                                                      For the evaluation we used two data sets: YAGO [17]
fact that SpiderStore connects all nodes bidirectionally is
                                                                   and DBpedia [4]. The YAGO ontology is a huge knowledge
very beneficial for the computation as restriction triples can
                                                                   repository based on Wikipedia and WordNet data. YAGO
be processed in left-to-right or right-to-left order. After this
                                                                   consists of 39,193,669 triples (93 predicates, 33,951,502 unique
execution order is defined, a set of seed nodes is determined,
                                                                   subjects and objects). Based on the this data set, we exe-
which marks the starting point(s) of the search process. De-
                                                                   cuted the SPARQL queries, which already served as a bench-
pending on the selectivity, these seeds can either be a re-
                                                                   mark for the RDF-3X approach [13] on the data sets. The
stricted subject or object node (e.g. ”scientist“ in the ex-
                                                                   DBpedia project [4] is concerned with the extraction of triple
ample query in Figure 2) or nodes which are connected to a
                                                                   information contained within Wikipedia pages and infoboxes.
very selective predicate. During the next steps, the restric-
                                                                   The data set contains a total of 94,839,012 triples (39664
tions are applied in a depth-first order. For each matching
                                                                   predicates, 23,090,848 unique subjects and objects). For the
restriction, the next restriction in order is pushed onto an
                                                                   DBpedia data set, we used the project’s SPARQL example
execution stack. A result is found if all restrictions avail-
                                                                   queries, extended them and added further complex queries
able are pushed onto this stack and are therefore satisfied
                                                                   in order to cover common query types (similar to the YAGO
by the current path. Subsequently, the last restriction is
                                                                   query set). An exact formulation of the queries used for the
popped from the stack and the next edge or node at the
                                                                   evaluation can be found in the appendix.
current position is processed. The iteration continues until
the stack is empty and no more nodes are available for pro-        4.2        Evaluated Systems
cessing. The variables defined within a query are tracked by
                                                                      For the evaluation of SpiderStore, we compared it against
using a shared variable assignment which is used during the
                                                                   the main memory RDF storage Sesame version 2.3.1 [7]
graph traversal to allow the unification of variables. Every
                                                                   without an inferencer enabled. Additionally, we compared
time a result is detected, the current state of these variable
                                                                   it to the native secondary memory stores Jena TDB version
assignments is returned as a result. The result is streamed
                                                                   0.8.6 [20] and a standard installation of RDF-3X version
to the client or saved for further postprocessing (e.g. filter
                                                                   0.3.4 [13]. All systems were granted a maximum of 16 GB
conditions and grouping).
                                                                   main memory for their computations. However, we were
         Query               A1         A2         A3         B1          B2      B3         C1        C2    geom. mean
         SpiderStore     0.0012     0.0765    0.0126     0.0518     0.0055     0.9984    0.3558   0.0196          0.0352
         Sesame           0.0849    0.0387     0.1072     0.2252    0.1591     0.3765       dnf    0.0386          0.3200
         RDF-3X           0.0184   0.0166      0.0360     0.0787    0.0381     0.0929   0.3377     0.0534          0.0522
         Jena TDB         0.3090    0.9000    125.375     2.5100    2.9310     4.8552       dnf    8.9253          7.1286
         #results              1          7        147         56    3,714         74     1,842          1

                            Table 1: Query Runtimes on the YAGO data set (in seconds).


only able to set up Sesame for our tests by granting it 30           outperforms the other systems as due to the implicit statis-
GB of main memory. All systems feature single threaded               tic information no additional statistics have to be computed.
query engines and only use one out of 16 available cores             However, we have to note that in terms of restart time disk
when executing queries. For the experiments of secondary             based systems, like RDF-3X and Jena TDB, naturally out-
storage systems, the warm cache measurements were con-               perform main memory systems as all data structures are
ducted by executing the queries five times without dropping          kept on secondary memory and therefore do not have to
the caches and taking the best result for the evaluation.            be built on every startup of the system, whereas for main
                                                                     memory systems, the import has to be performed on every
4.3    Evaluation Results                                            startup.
   The query execution times for the YAGO data set can
be seen in Table 1, where ”dnf“ marks a query which was                            System         YAGO       DBpedia
aborted after a run time of 10 minutes without having ob-                          SpiderStore       7:50       29:41
tained a result. As for the YAGO data set, SpiderStore is
                                                                                   Sesame           25:47       53:51
able to compute the query results significantly faster than
the other systems in 5 out of 8 queries. SpiderStore per-                          RDF-3X           16:15       48:21
forms better than all other systems with regards to the ge-                        Jena TDB         38:52       53:49
ometric mean of all query execution times. The secondary
memory system RDF-3X performs significantly better than                        Table 2: Import Times (in minutes)
the main memory system Sesame. However, the lightweight
storage structure of SpiderStore performs better than RDF-
3X on warmed caches. Queries consisting of triple patterns
which contain many variables can result in a big amount
of paths which have to be validated. The order in which              5.    RELATED WORK
the restriction triple of such a query are processed is cru-            Basically there are two approaches for storing RDF data
cial. This fact is beneficial for RDF-3X as it features a very       in triple form: (i) storing triples in a traditional relational
mature query optimization engine, which is heavily based             database or (ii) using a native triple store. The first, very
on statistics, which are precomputed during the import pro-          popular approach maps all RDF triples onto tables of re-
cess. SpiderStore currently contains a very naive optimizer          lational databases. This can either be realized by using a
and is therefore not able to exploit such facts, which results       potentially very big triple table or by splitting the triples
in slower query execution compared to RDF-3X on query                according to their predicates and storing all triples featur-
A2, B3 and C1.                                                       ing the same predicate in a separate table (property tables).
The query execution times for the experiments on the DB-             The worst case scenario for property tables is obtaining one
pedia data set are listed in Table 3. The experiments on the         table per predicate. However, the clustered property table
DBpedia data set showed that SpiderStore clearly surpasses           approach tries to solve this problem by clustering predicates
all other systems in terms of execution time, even though            into tables based on their co-occurrence within the data set.
the query engine optimizer is very limited and not very ma-          These different approaches have been evaluated and bench-
ture. For example projections on certain variables are not           marked in [18].
taken into account during the processing of queries. There-          There are several approaches for main memory RDF stores:
fore, the system tries to satisfy all variables occurring within     Brahms [11] is a main memory RDF store, which aims at
the query, even if they are eventually not even asked for in         high-performance association discovery based on variable-
the query, which again leads to an computational overhead            length queries on the RDF graph. GRIN [19] is concerned
during the execution of queries. Furthermore, queries fea-           with the creation of an RDF index optimized for long path
turing “union” or “optional” statements are currently split          queries. In contrast, SpiderStore aims at efficiently answer-
up into two queries, executed sequentially and the results           ing all kinds of SPARQL queries and is not specialized on
are joined in the end.                                               any particular query type.
In addition to the query execution time measurements, the            As for the popular Semantic Web Frameworks, Sesame [7]
import times for the compared systems was measured. We               provides relational storage, in-memory storage and a na-
define the import time as the time required for the systems          tive storage engine. The Sesame in-memory storage engine
to load all data and build all the required indices. Table 2         uses a list of quadruples called statements where each state-
lists the import times for the compared systems. SpiderStore         ment consists of a subject-, predicate-, object- and a context
              Query               Q1          Q2        Q3           Q4       Q5        Q6       Q7    geom. mean
              SpiderStore     0.0378    0.00005     0.0002    0.0287       0.1127   0.0018   0.0005        0.00272
              Sesame         80.9006      0.0010     0.0009       0.2184   0.2341   0.0541    0.0099         0.0572
              RDF-3X          0.1197      0.0338     0.0273       0.0288   0.6778   0.0351    0.0058         0.0460
              Jena TDB           dnf         dnf     94.246        1.128    1.623    6.644     0.200        13.5192
              #results              2           4        21        1,560    1,172        1         2

                            Table 3: Query Runtimes on the DBpedia data set (in seconds).


node. Each node maintains statement lists of the node’s oc-           7.    REFERENCES
currences as subject, predicate, object and context. Jena
                                                                       [1] Voltdb. Technical report, March 2010.
[20] also provides both a relational store (SDB) and a na-
                                                                           http://www.voltdb.com/_pdf/VoltDBOverview.pdf.
tive triple store (TDB). Virtuoso [8] can be facilitated on top
of a native triple store (Virtuoso Triple Store) or on top of          [2] D. J. Abadi, A. Marcus, S. R. Madden, and
a relational database system (Virtuoso RDF Views). Abadi                   K. Hollenbach. Scalable semantic web data
et al. [2] and Sidirourgos et al. [15] exploited column-wise               management using vertical partitioning. In VLDB ’07:
storage of RDF data. YARS2 [10] facilitates native stor-                   Proceedings of the 33rd international conference on
age and makes use of six (relational) indices for subject,                 Very large data bases, pages 411–422. VLDB
predicate, object and context. The RDF-3X system [13]                      Endowment, 2007.
is also based on extensive indices which are heavily com-              [3] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler.
pressed. RDF-3X features indices for all possible order-                   Matrix ”bit” loaded: a scalable lightweight join query
ings and subsets of subject, object and predicate. It also                 processor for rdf data. In WWW ’10: Proceedings of
provides extensive heuristics about the stored data, which                 the 19th international conference on World wide web,
are further exploited for the processing of queries. Further-              pages 41–50, New York, NY, USA, 2010. ACM.
more, Neumann and Weikum proposed join order optimiza-                 [4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,
tions based on sideways information passing and selectivity                R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a
heuristics [14]. However, such extensive index structure are               web of open data. The Semantic Web, pages 722–735,
neither feasible nor required when dealing with main mem-                  2007.
ory. The BitMat project [3] provides a lightweight RDF                 [5] A. Bernstein, C. Kiefer, and M. Stocker. OptARQ: A
index for main-memory RDF stores which is based on a Bit-                  SPARQL optimization approach based on triple
Mat, a bitcube index responsible for a compressed storage                  pattern selectivity estimation. Rapport technique,
and querying of triples. All of these systems have already                 Department of Informatics, University of Zurich, 2007.
been compared extensively in [6, 13]. As for the optimiza-             [6] C. Bizer and A. Schultz. The berlin SPARQL
tion of RDF storage and querying, Stocker et al. [16, 5]                   benchmark. International Journal On Semantic Web
focus on the optimization of Basic Graph Patterns based on                 and Information Systems, 5(1), 2009.
selectivity heuristics.                                                [7] J. Broekstra, A. Kampman, and F. Van Harmelen.
                                                                           Sesame: A generic architecture for storing and
                                                                           querying RDF and RDF schema. The Semantic
6.   CONCLUSION AND FUTURE WORK                                            WebISWC 2002, pages 54–68, 2002.
   In this paper we presented a lightweight in-memory stor-            [8] O. Erling and I. Mikhailov. RDF Support in the
age layout for RDF graphs which was implemented in a                       Virtuoso DBMS. Networked Knowledge-Networked
first prototype called SpiderStore. This layout provides fast              Media, pages 7–24.
and efficient in-memory RDF querying and storage without               [9] G. Gardarin, J. Gruser, and Z. Tang. Cost-based
having to perform any complex mapping. Our experiments                     selection of path expression processing algorithms in
showed that a simplistic storage layout, which is specifically             object-oriented databases. In Proceedings of the
designed for graph data is able to outperform common stores                international conference on very large data bases,
based on the relational model. The performance gain is                     pages 390–401, 1996.
based on (1) the lightweight storage model which represents           [10] A. Harth, J. Umbrich, A. Hogan, and S. Decker.
edges as memory pointers and therefore allows direct node                  YARS2: A federated repository for querying graph
jumps in the graph, (2) implicit statistics about the selec-               structured data from the web. The Semantic Web,
tivity which can be used to optimize the processing order                  pages 211–224.
and (3) depth-first query processing which renders unneces-           [11] M. Janik and K. Kochut. Brahms: A workbench RDF
sary intermediate results and expensive join chains. Future                store and high performance memory system for
work on SpiderStore will include the optimization of the                   semantic association discovery. The Semantic
query execution order based on implicit statistical informa-               Web–ISWC 2005, pages 431–445.
tion which currently relies on a naive algorithm. Further-            [12] A. Kemper and T. Neumann. Hyper: Hybrid OLTP &
more we will improve the memory management and layout                      OLAP high performance database system. Technical
regarding writing facilities.                                              Report TU-I1010, TU Munich, Institute of Computer
newline                                                                    Science, Germany, May 2010.
[13] T. Neumann and G. Weikum. RDF-3X: a RISC-style             B3: select distinct ?n1 ?n2 where { ?n1 hfamilyNameOfi
     engine for RDF. Proceedings of the VLDB                    ?p1. ?n2 hfamilyNameOfi ?p2. ?p1 htypei scientist. ?p1
     Endowment, 1(1):647–659, 2008.                             hhasWonPrizei ?award. ?p1 hbornInLocationi ?city. ?p2
[14] T. Neumann and G. Weikum. Scalable join processing         htypei scientist. ?p2 hhasWonPrizei ?award. ?p2 hbornIn-
     on very large rdf graphs. In SIGMOD ’09: Proceedings       Locationi ?city. filter (?p1 != ?p2) }
     of the 35th SIGMOD international conference on             C1: select distinct ?n1 ?n2 where {?n1 hfamilyNameOfi ?p1.
     Management of data, pages 627–640, New York, NY,           ?n2 hfamilyNameOfi ?p2. ?p1 htypei scientist. ?p1 ?i1 ?city.
     USA, 2009. ACM.                                            ?p2 htypei scientist. ?p2 ?i2 ?city. ?city htypei hsitei. filter
[15] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and      (?p1 != ?p2) }
     S. Manegold. Column-store support for RDF data             C2: select distinct ?n where { ?p hisCalledi ?n. ?p ?i1 ?c1.
     management: not all swans are white. Proceedings of        ?p ?i2 ?c2. ?c1 htypei hvillagei. ?c1 hisCalledi London. ?c2
     the VLDB Endowment, 1(2):1553–1563, 2008.                  htypei hsitei. ?c2 hisCalledi Paris. }
[16] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and
     D. Reynolds. SPARQL basic graph pattern
     optimization using selectivity estimation. In WWW
                                                                A.2    DBpedia Data Set
     ’08: Proceeding of the 17th international conference on    prefix rdf: hhttp://www.w3.org/1999/02/22-rdf-syntax-ns#i
     World Wide Web, pages 595–604, New York, NY,               prefix foaf: hhttp://xmlns.com/foaf/0.1/i
     USA, 2008. ACM.                                            prefix dbpedia2: hhttp://dbpedia.org/property/i
[17] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago:           prefix skos: hhttp://www.w3.org/2004/02/skos/core#i
     A Core of Semantic Knowledge. In 16th international        prefix dbo: hhttp://dbpedia.org/ontology/i
     World Wide Web conference (WWW 2007), New
     York, NY, USA, 2007. ACM Press.                            Q1: select ?name ?name2 ?place where { ?person dbo:birth-
                                                                Place ?place. ?person foaf:name ?name . ?place skos:subject
[18] Y. Theoharis, V. Christophides, and
                                                                h http://dbpedia.org/resource/Category:European Capitals-
     G. Karvounarakis. Benchmarking database
                                                                 of Culturei . ?person2 dbo:birthPlace ?place . ?person2
     representations of RDF/S stores. The Semantic
                                                                foaf:name ?name2 . ?person skos:subject ?type1 . ?person2
     Web–ISWC 2005, pages 685–701, 2005.
                                                                skos:subject ?type2 . ?type1 skos:broader hhttp://dbpedia.-
[19] O. Udrea, A. Pugliese, and V. Subrahmanian. GRIN:          org/resource/Category:Musiciansi . ?type2 skos:broader hht-
     A graph based RDF index. In Proceedings of the             tp://dbpedia.org/resource/Category:Musiciansi . filter(?na-
     National Conference on Articial Intelligence,
                                                                me != ?name2) }
     volume 22, page 1465. Menlo Park, CA; Cambridge,           Q2: select ?name1 ?name2 where {?person foaf:name ?na-
     MA; London; AAAI Press; MIT Press; 1999, 2007.             me1 . ?person2 foaf:name ?name2 . ?person dbpedia2p:infl-
[20] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds, et al.      uences ?person2 . ?person2 dbpedia2:awards hhttp://db-
     Efficient RDF storage and retrieval in Jena2. In           pedia.org/resource/Time 100: The Most Important People-
     Proceedings of SWDB, volume 3, pages 7–8. Citeseer,         of the Centuryi .}
     2003.                                                      Q3: select ?name where { ?vehicle skos:subject ?cat . ?ve-
                                                                hicle dbpedia2:transmission ”6-speed manual” . ?vehicle
APPENDIX                                                        dbpedia2:name ?name . ?vehicle dbo:manufacturer ?man
                                                                . ?man dbo:location ?location . ?location dbo:leaderName
A.    QUERIES                                                   hhttp://dbpedia.org/resource/Alfred Lehmanni . ?cat skos:-
                                                                broader hhttp://dbpedia.org/resource/Category:Luxury ve-
A.1    YAGO data set                                            hiclesi.}
                                                                Q4: select distinct ?person ?person2 where {?person ?z
A1: select ?gn ?fn where { ?gn hgivenNameOfi ?p. ?fn
                                                                hhttp://dbpedia.org/resource/Category:IBM Fellowsi. ?per-
hfamilyNameOfi ?p. ?p htypei scientist. ?p hbornInLocationi
                                                                son ?prop ?value. ?person2 ?prop2 ?value2. ?person2 ?prop3
?city. ?p hhasDoctoralAdvisori ?a. ?a hbornInLocationi
                                                                hhttp://dbpedia.org/resource/Category:IBM Fellowsi. filter
?city2. ?city hlocatedIni Switzerland. ?city2 hlocatedIni
                                                                (?person != ?person2)}
Germany. }
                                                                Q5: select distinct ?scientist ?doctor where { ?scientist rdf:type
A2: select ?n where { ?a hisCalledi ?n. ?a htypei actor”. ?a
                                                                hhttp://dbpedia.org/ontology/Scientisti. ?scientist hhttp://db-
hlivesIni ?city. ?a hactedIni ?m1. ?a hdirectedi ?m2. ?city
                                                                pedia.org/ontology/doctoralStudenti ?doctor. ?doctor ?prop-
hlocatedIni ?s. ?s hlocatedIni United States. ?m1 htypei
                                                                erty ?value. ?scientist ?prop2 ?value. }
movie. ?m1 hproducedInCountryi Germany. ?m2 htypei
                                                                Q6: select distinct ?player where {?s dbpedia2:name ?player.
movie. ?m2 hproducedInCountryi Canada. }
                                                                ?s rdf:type hhttp://dbpedia.org/ontology/SoccerPlayeri. ?s
A3: select distinct ?n ?co where { ?p hisCalledi ?n. {?p
                                                                dbpedia2:position ”Goalkeeper”. ?s hhttp://dbpedia.org/-
htypei actor } union { ?p htypei athlete }. ?p hbornIn-
                                                                property/clubsi ?club. ?club hhttp://dbpedia.org/ontology/-
Locationi ?c. ?c hlocatedIni ?s. ?s hlocatedIni ?co. ?p
                                                                capacityi ”32609” . ?s hhttp://dbpedia.org/ontology/birth-
htypei ?t. filter(?t reaches politician via hsub-ClassOfi ) }
                                                                Placei ?place. ?place dbpedia2:populationCensus ”49138831”.}
B1: select distinct ?n1 ?n2 where { ?a1 hisCalledi ?n1. ?a1
                                                                Q7: select ?name ?birth ?description ?person ?x where {?person
hlivesIni ?c1. ?a1 hactedIni ?movie. ?a2 hisCalledi ?n2. ?a2
                                                                dbo:birthPlace hhttp://dbpedia.org/resource/Berlini. ?per-
hlivesIni ?c2. ?a2 hactedIni ?movie. ?c1 hlocatedIni Eng-
                                                                son skos:subject hhttp://dbpedia.org/resource/Category:Ger-
land. ?c2 hlocatedIni England. filter (?a1 != ?a2) }
                                                                man musiciansi. ?person dbo:birthDate ?birth. ?person
B2: select ?n1 ?n2 where { ?p1 hisCalledi ?n1. ?p1 hbornIn-
                                                                foaf:name ?name. }
Locationi ?city. ?p1 hisMarriedToi ?p2. ?p2 hisCalledi ?n2.
?p2 hbornInLocationi ?city. }