=Paper= {{Paper |id=None |storemode=property |title=SpiderStore: A Native Main Memory Approach for Graph Storage |pdfUrl=https://ceur-ws.org/Vol-733/paper_binna.pdf |volume=Vol-733 |dblpUrl=https://dblp.org/rec/conf/gvd/BinnaGZPS11 }} ==SpiderStore: A Native Main Memory Approach for Graph Storage== https://ceur-ws.org/Vol-733/paper_binna.pdf
     SpiderStore: A Native Main Memory Approach for Graph
                            Storage

          Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, Günther Specht
                            Databases and Information Systems, Institute of Computer Science
                                             University of Innsbruck, Austria
                                            {firstname.lastname}@uibk.ac.at


ABSTRACT                                                                   been developed. Despite the highly connected nature of
The ever increasing amount of linked open data results in a                these graphs, the main approaches proposed in this con-
demand for high performance graph databases. In this pa-                   text are facilitating technologies originating from relational
per we therefore introduce a memory layout which is tailored               databases. Even though these represent major and robust
to the storage of large RDF data sets in main memory. We                   technologies, they were not tailored for the scenario of stor-
present the memory layout SpiderStore. This layout features                ing graph based structures. At the same time the ever in-
a node centric design which is in contrast to the prevailing               creasing capacities of main memory and the increasing num-
systems using triple focused approaches. The benefit of this               bers of cores have lead to an architectural shift in the de-
design is a native mapping between the nodes of a graph                    velopment of databases and information systems by mov-
onto memory locations connected to each other. Based on                    ing from hard disk to main memory as the primary stor-
this native mapping an addressing schema which facilitates                 age device. Whereas these architectural changes lead to
relative addressing together with a snapshot mechanism is                  enormous performance improvements, when implementing
presented. Finally a performance evaluation, which demon-                  graph-operations like graph traversals they still have to be
strates the capabilities, of the SpiderStore memory layout                 implemented through costly self join operations. Despite the
is performed using an RDF-data set consisting of about 190                 possibility of supporting these operations with appropriate
mio triples.                                                               index structures they still take O(log(n)) where n denotes
                                                                           the number of index entries. Therefore we present the Spi-
                                                                           derStore storage concept as an in-memory storage approach,
Categories and Subject Descriptors                                         which allow to process edges in O(1). In contrast to previ-
H.2.2 [Database Management]: Physical Design; H.3.2                        ous work [3] the storage layout and space estimations are
[Information Storage and Retrieval]: Information Stor-                     captured in more detail. In addition, a new relative ad-
age; H.3.3 [Information Storage and Retrieval]: Infor-                     dressing scheme is introduced. The successive sections are
mation Search and Retrieval                                                structured as follows. Chapter 2 deals with the memory lay-
                                                                           out and space estimations. Chapter 3 explains the relative
                                                                           addressing scheme used for faster restarts and for snapshot
General Terms                                                              generation. In chapter 4 we present an evaluation of the pre-
Performance, Algorithms, Design, Experimentation                           sented technology using the YAGO2 [8] data set. Chapter
                                                                           5 discusses the related work in the field of RDF-databases.
Keywords                                                                   Finally Chapter 6 draws a conclusion and makes forecasts
                                                                           for possible future work.
RDF, Main Memory, Database, RDF Store, Triple Store,
SPARQL, Addressing Scheme
                                                                           2.   MEMORY LAYOUT
1.    INTRODUCTION                                                            This section represents a detailed description over the Spi-
   Due to the increasing significance of linked open data and              derStore memory layout. The aim of this memory layout
publicly available SPARQL-endpoints, the need for high per-                is to provide an in-memory storage of graphs, where the
formance graph databases has increased. To meet those                      basic operation of navigating between two vertices can be
requirements several approaches for storing and retrieving                 done in O(1). Therefore, the node is the core component
large RDF (Resource Description Framework) graphs have                     of the layout. This is in contrast to the concept favored by
                                                                           most triple stores where the the triple represent the atomic
                                                                           building block. To realize this concept of a node centric
                                                                           layout, two factors have to be fulfilled. The first is that
                                                                           all edges belonging to a node need to be stored in a sin-
                                                                           gle place, which allows to navigate back and forth along all
                                                                           edges. The second factor is that there need to be a direct
                                                                           connection between those nodes that can be resolved within
                                                                           a single operation. These requirements can be fulfilled by
23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria.                      an in-memory storage layout, which is designed as follows.
Copyright is held by the author/owner(s).                                  Each node has to be located at a unique location within




                                                                      91
memory. At each of these locations the information about
ingoing and outgoing edges is stored and grouped by the re-
                                                                              wordnet_scientist             icate      
spective edge label. The information about edges itself is not                                       pred
stored directly but implicitly through pointers. Therefore,
                                                                                   outgoing                                .....
the traditional pointer structures, which themselves can be
seen as graphs connecting different memory locations are
used to represent arbitrary graphs.                                                                  object

                                                                                                                       "scientist"
Each node contains a set of ingoing and a set of outgoing
edges (pointers). These pointers are grouped by the predi-
                                                                                                                           .....
cates labeling the corresponding edges. Grouping is done by
the address of the predicates. This address is equal to the
address of the property within the predicate index, which
stores a list of all subjects occuring together with a specific                                        predicate
                                                                                    ingoing                              
property.
Beside the raw pointer data, implicit statistical information
                                                                                                                           .....
is stored. This is the case for the number of nodes, the num-
ber of predicates and the number of strings. Furthermore,
for each subject the number of predicates and for each sub-
ject/predicate combination the number of objects is stored.                                            su
                                                                                                            bje
The same information is stored the other way around (the                                                          ct    Einstein
number of predicates for an object and the number of pred-
icate/subject combinations for an object).                                                                                 .....


To illustrate this layout, Figure 1 visualizes the memory lay-
out of a short example. The main node emphasized in this
example represents the category of a wordnet_scientist.
It can be seen from this example that two different types of                   Figure 1: Memory Layout - Example
edges exist: ingoing and outgoing edges. Both types them-
selves group their destination pointers by the predicate node.
In Figue 1, an outgoing edge with the property hhasLabeli
and an incoming edge with the property htypei is featured.
To simplify matters, URIs are abbreviated and marked with                   m=    (#nodes ∗ (5 + degree ∗ 2) + #edges ∗ 3)
angle bracket while literals are put in quotes. As it can                          ∗ sizeof (pointer) + sizeof (dictionary)
be seen in the example, all nodes independent of their type
                                                                          This formula consists of three parts, i.e. the size of the
(URI, literals, . . .) share a uniform node layout. For example
                                                                       data dictionary, the fraction influenced by the number of
the node "scientist" has the same structure as the node
                                                                       nodes and the fraction influenced by the number of edges.
hwordner_scientisti. The facts described in this example
                                                                       The part of this formula that depends on the number of
are that hEinsteini is of htypei hwordner_scientisti and
                                                                       nodes can be interpreted as follows. For each node a data
that this category has a label with the name "scientist".
                                                                       structure is stored consisting of the number of incoming
The triple notation of the example in Figure 1 is shown in
                                                                       edges and a link to the table containing the incoming edges
the listing below:
                                                                       as well as the number of outgoing edges and a link to the ta-
                                                                       ble containing the outgoing edges. Furthermore a pointer to
...
                                                                       the corresponding entry within the dictionary table is stored
hEinsteini htypei hwordner_scientisti
                                                                       within each node. The term degree ∗ 2 can be explained by
hwordner_scientisti hhasLabeli "scientist"
                                                                       the pointers which group the edges by their corresponding
...
                                                                       predicates. For each predicate there exist a pointer to the
                                                                       edges itself and a counter, which is used for statistics. Be-
                                                                       cause all edges are bidirectional, the estimation is quite ac-
                                                                       curate even though it does not take the number of distinct
2.1    Space Estimations                                               subjects or objects into account.
   Given that SpiderStore is a main memory based graph                 The part of the formula, which depends on the number of
store, space consumption becomes a crucial factor. There-              edges, can be derived of the following facts. For each edge
fore we introduce and discuss a formula that can be used               the destination is stored on both ends. Furthermore, an
to calculate the expected amount of memory needed for a                additional entry is stored within the predicate index which
specific set of data. To describe a specific data set we use           allows to start the query execution not only at subject or
the following variables. The variable #nodes represents the            object nodes but as well at predicate nodes.
total number of distinct nodes within a graph. A node can              As an example the YAGO2 [8] data set used throughout
either be a URL or a character string and can be be used               the evaluation consists of 194,35 million triples and of 28,74
as subject, predicate or object. The second variable used              million distinct nodes. The dictionary size in the case of
for calculating the expected space consumption is the total            the YAGO2 data set is roughly about 1.1 gigabytes. In this
number of triples or facts #notriples. The space consump-              example 1.2 would be an appropriate value for the variable
tion is then calculated as follows:                                    degree. This number can be derived by counting the number




                                                                  92
of all distinct subject-predicate combinations and dividing it        4.    EVALUATION
by the total number of nodes. By apply these values to the              As a platform for the evaluation a server equipped with
formula above a space consumption of about 7,4 gigabytes              two Intel Xeon L5520 Quad Core CPUs, 2.27 GHz, Linux
might be expected. In comparison to the real footprint,               kernel 2.6.18, CentOS, 64-bit architecture and 96 GB main
which is about 7.2 gigabytes this value is quite accurate.            memory was used.
Hence knowing the basic facts about a data set allows to es-
tablish an adequate estimation about the expected memory              4.1    DataSet
footprint.                                                              For the evaluation we used the YAGO2 [8] data set. The
                                                                      YAGO2 data set is the successor of the YAGO [16] data set
                                                                      and represents a large semantic knowledge base which is gen-
                                                                      erated on the basis of Wikipedia, WordNet and GeoNames.
                                                                      The data set consist of of 194,350,853 triples (98 predicates,
3.   RELATIVE ADDRESSES                                               28,744,214 unique subjects and objects). The queries exe-
   Using absolute pointer addresses for the location of nodes         cuted on this data set are derived from the queries on the
enables the navigation between two nodes in O(1) as it has            YAGO data set used in the benchmark presenting the RDF-
been explained in the section before. The drawback of this            3X approach [12].
concept is that storing a persistent state of a whole graph
database can become quite complex. The reason for this is             4.2    Evaluated Systems
that the database can only be dumped either by serializing              For the evaluation of SpiderStore we used Virtuoso [6] in
the whole graph and later deserializing the whole graph or            version 6.1.3, RDF-3X [12] in version 0.3.5 and Jena TDB
by dumping the whole memory snapshot and swizzling the                [19] in version 0.8.9. The decision for these systems was
pointers to their correct representation at load time.                taken to the best of our knowledge. Even though Spider-
Another alternative to these approaches is to arrange the             Store is the only main memory system tested, the decision
whole database content based on an offset address. Hence              for choosing the other systems is accurate. The reason for
all memory locations have to be converted into relative ad-           this is that those systems are assumed to be the currently
dresses based on this offset. Resolving such relative pointers        fastest systems available and that the benchmark results are
yields in an overhead, which is negligible compared to the            measured with warm caches on a system where the whole
total amount of time needed to fetch an arbitrary junk of             database would be able to fit into main memory. All systems
memory. The time for fetching an arbitrary chunk of mem-              were granted a maximum of 40 GB to ensure that sufficient
ory can vary between some cpu cycles, when accessing mem-             space is available.
ory already located within L1-cache, up to some hundreds
cpu cycles, when accessing a random chunk of memory which             4.3    Evaluation Results
is not available in the cache hierarchy yet.                             The test results are separated into two parts. The first
In the context of SpiderStore we decided to use relative ad-          part compares the bulk import times of the different sys-
dressing. This has two advantages. The first advantage                tems. The bulk import time is specified as the time needed
is that the general architecture of SpiderStore is still ap-          to load the data, to create the indexes and to ensure that a
plicable and the overhead introduced through relative ad-             persistent state of the database is written to disk. A sum-
dressing is insignificant in the SpiderStore concept as it has        mary of the load times can be seen in Table 1. As can be
been explained before. The main advantage of this ad-                 seen, SpiderStore is significantly faster than any of the other
dressing scheme is that database restarts can be executed             systems. The reason for this is that SpiderStore, due to its
within a few milliseconds. This is possible by facilitating           implicit statistics, does not need to explicitly create statis-
the Unix memory mapping techniques which does not dis-                tics or indexes.
miss the mapped pages unless another executable allocates
large amounts of memory. Furthermore this concept allows                                System         Load Time
to facilitate the copy on write approaches used by the Hy-
Per project [10]. This approach benefits from the operating                             SpiderStore        1:09:18
system’s memory management, which would allow different                                 Jena               1:36:35
processes to share large memory segments, while preserving                              RDF-3X             1:21:12
an isolated view on the data.
                                                                                        Virtuoso           3:32:16
Due to the lack of a customized memory allocator, the Spi-
derStore snapshot mechanism is currently implemented as               Table 1: Import Times (in hours, minutes and sec-
a read only approach. Each snapshot is split up into five             onds)
parts. One part is responsible for the storage of the node
data structures, another for the indexes between node iden-
tifiers and node addresses. One more file stores the node               The second part of the evaluation compares the query ex-
identifiers. The other files are responsible for the storage          ecution for each query on the YAGO2 data set. Queries with
and indexing of predicate nodes and for the storage of the            an execution time over 15 minutes without producing any
edge information. The separation into several files prevents          output are marked with ”DNF”. For the calculation of the
memory fragmentation and leads to shorter ”addresses” for             geometric mean, a query runtime of 15 minutes is assumed
nodes, predicates and strings. For example all entries within         for each cell which is marked with ”DNF”. Due to large re-
the node, predicate or index files have uniform sizes and can         sult sets for query C-1 this limit is set to 60 minutes. The
therefore be seen as simple array structures.                         results of this evaluation are shown in Table 2.




                                                                 93
       Query               A1         A2        B1         B2             B3          C1         C2          C3   geom. mean
       SpiderStore    0.0016      0.2084    0.3688    0.0119           0.5143       DNF     16,5464    319.0703          0.4521
       Jena            55.203    142.155      DNF        DNF            DNF         DNF        DNF        DNF         578.3126
       RDF-3X          0.0496    0.0524    0.0471      0.0936      0.0482       657.2100    0.2056      2.4741           0.3414
       Virtuoso         0.766      0.127       0.71      0.46           3.223    2197,672     2.401      36.474          3,4420
       #results              3         5        274       128            166    6,876,673      5313       35811

                           Table 2: Query Runtimes on the YAGO2 data set (in seconds).



Considering the average execution time, RDF-3X performs                  two major categories exist: (i) systems which have a main
better than all other stores tested. While SpiderStore in the            memory architecture and (ii) systems which use secondary
average case is the second fastest system, two queries exist             memory storage as their primary storage layer. Systems
where SpiderStore outperforms the other stores. In the case              falling into the first category are for example Brahms [9],
of those queries, the coarse heuristics were able to generate            Grin [18], Swift-OWLIM [11] or BitMat [2] as well as our sys-
a highly efficient execution plan. On the other side consider-           tem SpiderStore [3]. While Brahms is highly optimized for
ing the queries C1-C3 SpiderStore has significant problems               association finding and Grin for answering long path queries
to arrange and optimize the execution order to obtain bet-               the goal of SpiderStore is to provide efficient query process-
ter results. Regarding query C1 even though intermediate                 ing for arbitrary SPARQL queries. Swift-OWLIM is also
results where present the total execution time did exceed                a general purpose RDF-store, which has a strong emphasis
the time limit of one hour. The reason for the performance               on OWL-reasoning. Bitmat on the other hand represents a
shortcomings in the case of those queries is that the selec-             lightweight index structure which uses bitmap indexes to
tivity estimations based on the triple selectivity used for the          store space efficient projections of the three dimensional
generation of the execution plan can in some cases produce               triple space. In contrast to the main memory based sys-
misleading execution orders. This is due to the fact that the            tems YARS2 [7], RDF-3X [12] can be considered as native
cardinality of properties is not considered in the current al-           systems of type (ii). Both systems make heavy use of in-
gorithm. Regarding the other evaluated systems Jena TDB                  dex structures. While YARS facilitates six index structures
seem to have severe problems when executing queries on                   for subject, predicate, object and the context, RDF-3X [12]
such large knowledge bases, because only two queries were                generates index structures for all possible combinations and
able to determine the results within the given time frame.               orderings of subject, predicate and object. Beside these huge
Whereas Virtuoso can be considered as the third fastest sys-             number of index structures, RDF-3x makes heavy use of sta-
tem, which has a stable performance without any negative                 tistical information and has a highly sophisticated query ex-
outliers.                                                                ecution engine, which is described in [13]. While such an
                                                                         enormous effort results in a good query performance, the
5.   RELATED WORK                                                        management of these specific data structures can become
                                                                         quite complex. Neumann et al. therefore describe in [14]
   Several approaches exist for storing graph based data in
                                                                         how a query centric RDF-engine can be extended to pro-
databases. In the particular case of this paper we focus on
                                                                         vide full-fledged support for updates, versioning and trans-
RDF-stores because SpiderStore, the developed system, can
                                                                         actions. Beside these systems which can be clearly dedi-
be considered as part of this category. Hence we give a short
                                                                         cated to either the category of memory-native, secondary
overview about the different approaches available for storing
                                                                         memory-native or relational based systems several semantic
RDF data.
                                                                         web frameworks exist, which provide storage engines fitting
For storing RDF-data, two approaches are prevailing. On
                                                                         in several or all of these categories. Examples of such frame-
the one hand the approach of mapping the RDF-data onto
                                                                         works are Sesame [5], Jena [19] and Virtuoso [6]. For all the
relational schema exists while on the other hand the ap-
                                                                         systems described in this section several benchmarks exist
proach of native RDF-stores exist.
                                                                         [4, 12], which extensively compare those systems.
The mapping of RDF-data onto relational databases is done
either by facilitating a large triple table, where the columns
correspond to the RDF atoms subject, predicate and object                6.     CONCLUSION AND FUTURE WORK
or by clustering the triples according to their predicate into              In this paper we presented the SpiderStore memory lay-
several tables. The latter approach is called property tables.           out, which is able to store arbitrary graphs. The node cen-
Both approaches are less than perfect because both suffer                tric layout has been discussed in detail and a formula for
from severe performance drawbacks imposed by the archi-                  the estimation of the space consumption was described. An
tecture of relational databases. For example in the property             enhancement to the basic layout introducing a relative ad-
tables approach the number of tables is equal to the number              dressing schema was presented. Finally our experiments
of properties in the worst case. Several approaches which ex-            showed that a node centric layout is able to perform ar-
tend these two main approaches when mapping RDF-data                     bitrary SPARQL-queries on knowledge bases of up to 190
onto relational database have been developed and are bench-              mio nodes with a performance comparable to highly sophis-
marked in [17]. Beside mappings to traditional relational                ticated RDF-stores. This promises further performance im-
database systems, mappings which make use of column ori-                 provements because the query optimisation approach, which
ented databases exist [1, 15]. In the case of native stores              has been developed in [3] is rather simple. Therefore future




                                                                  94
work on SpiderStore will emphasize on the query execution                 Management of data, pages 627–640, New York, NY,
engine to achieve excellent performance results on a scale of             USA, 2009. ACM.
up to a billion triple.                                              [14] T. Neumann and G. Weikum. x-rdf-3x: fast querying,
                                                                          high update rates, and consistency for rdf databases.
                                                                          Proceedings of the VLDB Endowment, 3(1-2), Jan
7.   REFERENCES                                                           2010.
 [1] D. J. Abadi, A. Marcus, S. R. Madden, and                       [15] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and
     K. Hollenbach. Scalable semantic web data                            S. Manegold. Column-store support for RDF data
     management using vertical partitioning. In VLDB ’07:                 management: not all swans are white. Proceedings of
     Proceedings of the 33rd international conference on                  the VLDB Endowment, 1(2):1553–1563, 2008.
     Very large data bases, pages 411–422. VLDB                      [16] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A
     Endowment, 2007.                                                     Core of Semantic Knowledge. In 16th international
 [2] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler.                   World Wide Web conference (WWW 2007), New
     Matrix ”bit” loaded: a scalable lightweight join query               York, NY, USA, 2007. ACM Press.
     processor for rdf data. In WWW ’10: Proceedings of              [17] Y. Theoharis, V. Christophides, and
     the 19th international conference on World wide web,                 G. Karvounarakis. Benchmarking database
     pages 41–50, New York, NY, USA, 2010. ACM.                           representations of RDF/S stores. The Semantic
 [3] R. Binna, W. Gassler, E. Zangerle, D. Pacher, and                    Web–ISWC 2005, pages 685–701, 2005.
     G. Specht. Spiderstore: Exploiting main memory for              [18] O. Udrea, A. Pugliese, and V. Subrahmanian. GRIN:
     efficient rdf graph representation and fast querying. In             A graph based RDF index. In Proceedings of the
     Proceedings of the 1st International Workshop on                     National Conference on Articial Intelligence,
     Semantic Data Management (Sem-Data) at the 36th                      volume 22, page 1465. Menlo Park, CA; Cambridge,
     International Conference on Very Large Databases,                    MA; London; AAAI Press; MIT Press; 1999, 2007.
     Singapore, Jan 2010.                                            [19] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds, et al.
 [4] C. Bizer and A. Schultz. The berlin SPARQL                           Efficient RDF storage and retrieval in Jena2. In
     benchmark. International Journal On Semantic Web                     Proceedings of SWDB, volume 3, pages 7–8. Citeseer,
     and Information Systems, 5(1), 2009.                                 2003.
 [5] J. Broekstra, A. Kampman, and F. Van Harmelen.
     Sesame: A generic architecture for storing and
     querying RDF and RDF schema. The Semantic                       APPENDIX
     WebâĂŤISWC 2002, pages 54–68, 2002.                          A. QUERIES
 [6] O. Erling and I. Mikhailov. RDF Support in the
     Virtuoso DBMS. Networked Knowledge-Networked
     Media, pages 7–24.
                                                                     A.1    YAGO data set
 [7] A. Harth, J. Umbrich, A. Hogan, and S. Decker.                  prefix rdfs:hhttp://www.w3.org/2000/01/rdf-schema#i
     YARS2: A federated repository for querying graph                prefix xsd:hhttp://www.w3.org/2001/XMLSchema#i
     structured data from the web. The Semantic Web,                 prefix owl:hhttp://www.w3.org/2002/07/owl#i
     pages 211–224.                                                  prefix rdf:hhttp://www.w3.org/1999/02/22-rdf-syntax-ns#i
                                                                     prefix yago:hhttp://www.mpii.de/yago/resource/i
 [8] J. Hoffart, F. M. Suchanek, K. Berberich, and
     G. Weikum. Yago2: a spatially and temporally
                                                                     A1: SELECT ?GivenName ?FamilyName WHERE { ?p
     enhanced knowledge base from wikipedia. Research
                                                                     yago:hasGivenName ?GivenName. ?p yago:hasFamilyName
     Report MPI-I-2010-5-007, Max-Planck-Institut für
                                                                     ?FamilyName. ?p rdf:type ?scientist. ?scientist rdfs:label
     Informatik, Stuhlsatzenhausweg 85, 66123
                                                                     ”scientist”. ?p yago:wasBornIn ?city. ?city yago:isLocatedIn
     Saarbrücken, Germany, November 2010.
                                                                     ?switzerland. ?switzerland yago:hasPreferredName ”Switzer-
 [9] M. Janik and K. Kochut. Brahms: A workbench RDF                 land”. ?p yago:hasAcademicAdvisor ?a. ?a yago:wasBornIn
     store and high performance memory system for                    ?city2. ?city2 yago:isLocatedIn ?germany. ?germany
     semantic association discovery. The Semantic                    yago:hasPreferredName ”Germany”. }
     Web–ISWC 2005, pages 431–445.
[10] A. Kemper and T. Neumann. Hyper: Hybrid OLTP &                  A2: SELECT ?name WHERE { ?a yago:hasPreferredName
     OLAP high performance database system. Technical                ?name. ?a rdf:type ?actor. 1 ?actor rdfs:label ”actor”. ?a
     Report TU-I1010, TU Munich, Institute of Computer               yago:actedIn ?m1. ?m1 rdf:type ?movie. ?movie rdfs:label
     Science, Germany, May 2010.                                     ”movie”. ?m1 yago:hasWikipediaCategory ”German films”.
[11] A. Kiryakov, D. Ognyanov, and D. Manov. Owlim–a                 ?a yago:directed ?m2. ?m2 rdf:type ?movie2. ?movie2
     pragmatic semantic repository for owl. Web                      rdfs:label ”movie”. ?m2 yago:hasWikipediaCategory ”Cana-
     Information Systems Engineering–WISE 2005                       dian films”. }
     Workshops, pages 182–192, Jan 2005.
[12] T. Neumann and G. Weikum. RDF-3X: a RISC-style                  B1: SELECT ?name1 ?name2 WHERE { ?a1
     engine for RDF. Proceedings of the VLDB                         yago:hasPreferredName ?name1. ?a2
     Endowment, 1(1):647–659, 2008.                                  yago:hasPreferredName ?name2. ?a1 rdf:type
[13] T. Neumann and G. Weikum. Scalable join processing              yago:wikicategory English actors. ?a2 rdf:type
     on very large rdf graphs. In SIGMOD ’09: Proceedings            yago:wikicategory English actors. ?a1 yago:actedIn ?movie.
     of the 35th SIGMOD international conference on                  ?a2 yago:actedIn ?movie. FILTER (?a1 != ?a2) }




                                                                95
B2: SELECT ?name1 ?name2 WHERE { ?p1
yago:hasPreferredName ?name1. ?p2
yago:hasPreferredName ?name2. ?p1 yago:isMarriedTo ?p2.
?p1 yago:wasBornIn ?city. ?p2 yago:wasBornIn ?city. }

B3: SELECT distinct ?name1 ?name2 WHERE { ?p1
yago:hasFamilyName ?name1. ?p2 yago:hasFamilyName
?name2. ?p1 rdf:type ?scientist1. ?p2 rdf:type ?scientist2.
?scientist1 rdfs:label ”scientist”. ?scientist2 rdfs:label ”scien-
tist”. ?p1 yago:hasWonPrize ?award. ?p2 yago:hasWonPrize
?award. ?p1 yago:wasBornIn ?city. ?p2 yago:wasBornIn
?city. FILTER (?p1 != ?p2) }

C1: SELECT DISTINCT ?name1 ?name2 WHERE { ?p1
yago:hasFamilyName ?name1. ?p2 yago:hasFamilyName
?name2. ?p1 rdf:type ?scientist. ?p2 rdf:type ?scientist.
?scientist rdfs:label ”scientist”. ?p1 ?c ?city. ?p2 ?c2 ?city.
?city rdf:type ?cityType. ?cityType rdfs:label ”city”. }

C2: SELECT DISTINCT ?name WHERE { ?p
yago:hasPreferredName ?name. ?p ?any1 ?c1. ?p ?any2 ?c2.
?c1 rdf:type ?city . ?c2 rdf:type ?city2. ?city2 rdfs:label
”city”. ?city rdfs:label ”city”. ?c1 yago:isCalled ”London”.
?c2 yago:isCalled ”Paris”. }

C3: SELECT ?p1 ?predicate ?p2 WHERE { ?p1 ?anypred-
icate1 ?city1. ?city1 yago:isCalled ”Paris”. ?p1 ?predicate
?p2. ?p2 ?anypredicate2 ?city2. ?city2 yago:isCalled ”Hong
Kong”. }




                                                                     96