=Paper=
{{Paper
|id=None
|storemode=property
|title=SpiderStore: A Native Main Memory Approach for Graph Storage
|pdfUrl=https://ceur-ws.org/Vol-733/paper_binna.pdf
|volume=Vol-733
|dblpUrl=https://dblp.org/rec/conf/gvd/BinnaGZPS11
}}
==SpiderStore: A Native Main Memory Approach for Graph Storage==
SpiderStore: A Native Main Memory Approach for Graph
Storage
Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, Günther Specht
Databases and Information Systems, Institute of Computer Science
University of Innsbruck, Austria
{firstname.lastname}@uibk.ac.at
ABSTRACT been developed. Despite the highly connected nature of
The ever increasing amount of linked open data results in a these graphs, the main approaches proposed in this con-
demand for high performance graph databases. In this pa- text are facilitating technologies originating from relational
per we therefore introduce a memory layout which is tailored databases. Even though these represent major and robust
to the storage of large RDF data sets in main memory. We technologies, they were not tailored for the scenario of stor-
present the memory layout SpiderStore. This layout features ing graph based structures. At the same time the ever in-
a node centric design which is in contrast to the prevailing creasing capacities of main memory and the increasing num-
systems using triple focused approaches. The benefit of this bers of cores have lead to an architectural shift in the de-
design is a native mapping between the nodes of a graph velopment of databases and information systems by mov-
onto memory locations connected to each other. Based on ing from hard disk to main memory as the primary stor-
this native mapping an addressing schema which facilitates age device. Whereas these architectural changes lead to
relative addressing together with a snapshot mechanism is enormous performance improvements, when implementing
presented. Finally a performance evaluation, which demon- graph-operations like graph traversals they still have to be
strates the capabilities, of the SpiderStore memory layout implemented through costly self join operations. Despite the
is performed using an RDF-data set consisting of about 190 possibility of supporting these operations with appropriate
mio triples. index structures they still take O(log(n)) where n denotes
the number of index entries. Therefore we present the Spi-
derStore storage concept as an in-memory storage approach,
Categories and Subject Descriptors which allow to process edges in O(1). In contrast to previ-
H.2.2 [Database Management]: Physical Design; H.3.2 ous work [3] the storage layout and space estimations are
[Information Storage and Retrieval]: Information Stor- captured in more detail. In addition, a new relative ad-
age; H.3.3 [Information Storage and Retrieval]: Infor- dressing scheme is introduced. The successive sections are
mation Search and Retrieval structured as follows. Chapter 2 deals with the memory lay-
out and space estimations. Chapter 3 explains the relative
addressing scheme used for faster restarts and for snapshot
General Terms generation. In chapter 4 we present an evaluation of the pre-
Performance, Algorithms, Design, Experimentation sented technology using the YAGO2 [8] data set. Chapter
5 discusses the related work in the field of RDF-databases.
Keywords Finally Chapter 6 draws a conclusion and makes forecasts
for possible future work.
RDF, Main Memory, Database, RDF Store, Triple Store,
SPARQL, Addressing Scheme
2. MEMORY LAYOUT
1. INTRODUCTION This section represents a detailed description over the Spi-
Due to the increasing significance of linked open data and derStore memory layout. The aim of this memory layout
publicly available SPARQL-endpoints, the need for high per- is to provide an in-memory storage of graphs, where the
formance graph databases has increased. To meet those basic operation of navigating between two vertices can be
requirements several approaches for storing and retrieving done in O(1). Therefore, the node is the core component
large RDF (Resource Description Framework) graphs have of the layout. This is in contrast to the concept favored by
most triple stores where the the triple represent the atomic
building block. To realize this concept of a node centric
layout, two factors have to be fulfilled. The first is that
all edges belonging to a node need to be stored in a sin-
gle place, which allows to navigate back and forth along all
edges. The second factor is that there need to be a direct
connection between those nodes that can be resolved within
a single operation. These requirements can be fulfilled by
23rd GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 31.05.2011 - 03.06.2011, Obergurgl, Austria. an in-memory storage layout, which is designed as follows.
Copyright is held by the author/owner(s). Each node has to be located at a unique location within
91
memory. At each of these locations the information about
ingoing and outgoing edges is stored and grouped by the re-
wordnet_scientist icate
spective edge label. The information about edges itself is not pred
stored directly but implicitly through pointers. Therefore,
outgoing .....
the traditional pointer structures, which themselves can be
seen as graphs connecting different memory locations are
used to represent arbitrary graphs. object
"scientist"
Each node contains a set of ingoing and a set of outgoing
edges (pointers). These pointers are grouped by the predi-
.....
cates labeling the corresponding edges. Grouping is done by
the address of the predicates. This address is equal to the
address of the property within the predicate index, which
stores a list of all subjects occuring together with a specific predicate
ingoing
property.
Beside the raw pointer data, implicit statistical information
.....
is stored. This is the case for the number of nodes, the num-
ber of predicates and the number of strings. Furthermore,
for each subject the number of predicates and for each sub-
ject/predicate combination the number of objects is stored. su
bje
The same information is stored the other way around (the ct Einstein
number of predicates for an object and the number of pred-
icate/subject combinations for an object). .....
To illustrate this layout, Figure 1 visualizes the memory lay-
out of a short example. The main node emphasized in this
example represents the category of a wordnet_scientist.
It can be seen from this example that two different types of Figure 1: Memory Layout - Example
edges exist: ingoing and outgoing edges. Both types them-
selves group their destination pointers by the predicate node.
In Figue 1, an outgoing edge with the property hhasLabeli
and an incoming edge with the property htypei is featured.
To simplify matters, URIs are abbreviated and marked with m= (#nodes ∗ (5 + degree ∗ 2) + #edges ∗ 3)
angle bracket while literals are put in quotes. As it can ∗ sizeof (pointer) + sizeof (dictionary)
be seen in the example, all nodes independent of their type
This formula consists of three parts, i.e. the size of the
(URI, literals, . . .) share a uniform node layout. For example
data dictionary, the fraction influenced by the number of
the node "scientist" has the same structure as the node
nodes and the fraction influenced by the number of edges.
hwordner_scientisti. The facts described in this example
The part of this formula that depends on the number of
are that hEinsteini is of htypei hwordner_scientisti and
nodes can be interpreted as follows. For each node a data
that this category has a label with the name "scientist".
structure is stored consisting of the number of incoming
The triple notation of the example in Figure 1 is shown in
edges and a link to the table containing the incoming edges
the listing below:
as well as the number of outgoing edges and a link to the ta-
ble containing the outgoing edges. Furthermore a pointer to
...
the corresponding entry within the dictionary table is stored
hEinsteini htypei hwordner_scientisti
within each node. The term degree ∗ 2 can be explained by
hwordner_scientisti hhasLabeli "scientist"
the pointers which group the edges by their corresponding
...
predicates. For each predicate there exist a pointer to the
edges itself and a counter, which is used for statistics. Be-
cause all edges are bidirectional, the estimation is quite ac-
curate even though it does not take the number of distinct
2.1 Space Estimations subjects or objects into account.
Given that SpiderStore is a main memory based graph The part of the formula, which depends on the number of
store, space consumption becomes a crucial factor. There- edges, can be derived of the following facts. For each edge
fore we introduce and discuss a formula that can be used the destination is stored on both ends. Furthermore, an
to calculate the expected amount of memory needed for a additional entry is stored within the predicate index which
specific set of data. To describe a specific data set we use allows to start the query execution not only at subject or
the following variables. The variable #nodes represents the object nodes but as well at predicate nodes.
total number of distinct nodes within a graph. A node can As an example the YAGO2 [8] data set used throughout
either be a URL or a character string and can be be used the evaluation consists of 194,35 million triples and of 28,74
as subject, predicate or object. The second variable used million distinct nodes. The dictionary size in the case of
for calculating the expected space consumption is the total the YAGO2 data set is roughly about 1.1 gigabytes. In this
number of triples or facts #notriples. The space consump- example 1.2 would be an appropriate value for the variable
tion is then calculated as follows: degree. This number can be derived by counting the number
92
of all distinct subject-predicate combinations and dividing it 4. EVALUATION
by the total number of nodes. By apply these values to the As a platform for the evaluation a server equipped with
formula above a space consumption of about 7,4 gigabytes two Intel Xeon L5520 Quad Core CPUs, 2.27 GHz, Linux
might be expected. In comparison to the real footprint, kernel 2.6.18, CentOS, 64-bit architecture and 96 GB main
which is about 7.2 gigabytes this value is quite accurate. memory was used.
Hence knowing the basic facts about a data set allows to es-
tablish an adequate estimation about the expected memory 4.1 DataSet
footprint. For the evaluation we used the YAGO2 [8] data set. The
YAGO2 data set is the successor of the YAGO [16] data set
and represents a large semantic knowledge base which is gen-
erated on the basis of Wikipedia, WordNet and GeoNames.
The data set consist of of 194,350,853 triples (98 predicates,
3. RELATIVE ADDRESSES 28,744,214 unique subjects and objects). The queries exe-
Using absolute pointer addresses for the location of nodes cuted on this data set are derived from the queries on the
enables the navigation between two nodes in O(1) as it has YAGO data set used in the benchmark presenting the RDF-
been explained in the section before. The drawback of this 3X approach [12].
concept is that storing a persistent state of a whole graph
database can become quite complex. The reason for this is 4.2 Evaluated Systems
that the database can only be dumped either by serializing For the evaluation of SpiderStore we used Virtuoso [6] in
the whole graph and later deserializing the whole graph or version 6.1.3, RDF-3X [12] in version 0.3.5 and Jena TDB
by dumping the whole memory snapshot and swizzling the [19] in version 0.8.9. The decision for these systems was
pointers to their correct representation at load time. taken to the best of our knowledge. Even though Spider-
Another alternative to these approaches is to arrange the Store is the only main memory system tested, the decision
whole database content based on an offset address. Hence for choosing the other systems is accurate. The reason for
all memory locations have to be converted into relative ad- this is that those systems are assumed to be the currently
dresses based on this offset. Resolving such relative pointers fastest systems available and that the benchmark results are
yields in an overhead, which is negligible compared to the measured with warm caches on a system where the whole
total amount of time needed to fetch an arbitrary junk of database would be able to fit into main memory. All systems
memory. The time for fetching an arbitrary chunk of mem- were granted a maximum of 40 GB to ensure that sufficient
ory can vary between some cpu cycles, when accessing mem- space is available.
ory already located within L1-cache, up to some hundreds
cpu cycles, when accessing a random chunk of memory which 4.3 Evaluation Results
is not available in the cache hierarchy yet. The test results are separated into two parts. The first
In the context of SpiderStore we decided to use relative ad- part compares the bulk import times of the different sys-
dressing. This has two advantages. The first advantage tems. The bulk import time is specified as the time needed
is that the general architecture of SpiderStore is still ap- to load the data, to create the indexes and to ensure that a
plicable and the overhead introduced through relative ad- persistent state of the database is written to disk. A sum-
dressing is insignificant in the SpiderStore concept as it has mary of the load times can be seen in Table 1. As can be
been explained before. The main advantage of this ad- seen, SpiderStore is significantly faster than any of the other
dressing scheme is that database restarts can be executed systems. The reason for this is that SpiderStore, due to its
within a few milliseconds. This is possible by facilitating implicit statistics, does not need to explicitly create statis-
the Unix memory mapping techniques which does not dis- tics or indexes.
miss the mapped pages unless another executable allocates
large amounts of memory. Furthermore this concept allows System Load Time
to facilitate the copy on write approaches used by the Hy-
Per project [10]. This approach benefits from the operating SpiderStore 1:09:18
system’s memory management, which would allow different Jena 1:36:35
processes to share large memory segments, while preserving RDF-3X 1:21:12
an isolated view on the data.
Virtuoso 3:32:16
Due to the lack of a customized memory allocator, the Spi-
derStore snapshot mechanism is currently implemented as Table 1: Import Times (in hours, minutes and sec-
a read only approach. Each snapshot is split up into five onds)
parts. One part is responsible for the storage of the node
data structures, another for the indexes between node iden-
tifiers and node addresses. One more file stores the node The second part of the evaluation compares the query ex-
identifiers. The other files are responsible for the storage ecution for each query on the YAGO2 data set. Queries with
and indexing of predicate nodes and for the storage of the an execution time over 15 minutes without producing any
edge information. The separation into several files prevents output are marked with ”DNF”. For the calculation of the
memory fragmentation and leads to shorter ”addresses” for geometric mean, a query runtime of 15 minutes is assumed
nodes, predicates and strings. For example all entries within for each cell which is marked with ”DNF”. Due to large re-
the node, predicate or index files have uniform sizes and can sult sets for query C-1 this limit is set to 60 minutes. The
therefore be seen as simple array structures. results of this evaluation are shown in Table 2.
93
Query A1 A2 B1 B2 B3 C1 C2 C3 geom. mean
SpiderStore 0.0016 0.2084 0.3688 0.0119 0.5143 DNF 16,5464 319.0703 0.4521
Jena 55.203 142.155 DNF DNF DNF DNF DNF DNF 578.3126
RDF-3X 0.0496 0.0524 0.0471 0.0936 0.0482 657.2100 0.2056 2.4741 0.3414
Virtuoso 0.766 0.127 0.71 0.46 3.223 2197,672 2.401 36.474 3,4420
#results 3 5 274 128 166 6,876,673 5313 35811
Table 2: Query Runtimes on the YAGO2 data set (in seconds).
Considering the average execution time, RDF-3X performs two major categories exist: (i) systems which have a main
better than all other stores tested. While SpiderStore in the memory architecture and (ii) systems which use secondary
average case is the second fastest system, two queries exist memory storage as their primary storage layer. Systems
where SpiderStore outperforms the other stores. In the case falling into the first category are for example Brahms [9],
of those queries, the coarse heuristics were able to generate Grin [18], Swift-OWLIM [11] or BitMat [2] as well as our sys-
a highly efficient execution plan. On the other side consider- tem SpiderStore [3]. While Brahms is highly optimized for
ing the queries C1-C3 SpiderStore has significant problems association finding and Grin for answering long path queries
to arrange and optimize the execution order to obtain bet- the goal of SpiderStore is to provide efficient query process-
ter results. Regarding query C1 even though intermediate ing for arbitrary SPARQL queries. Swift-OWLIM is also
results where present the total execution time did exceed a general purpose RDF-store, which has a strong emphasis
the time limit of one hour. The reason for the performance on OWL-reasoning. Bitmat on the other hand represents a
shortcomings in the case of those queries is that the selec- lightweight index structure which uses bitmap indexes to
tivity estimations based on the triple selectivity used for the store space efficient projections of the three dimensional
generation of the execution plan can in some cases produce triple space. In contrast to the main memory based sys-
misleading execution orders. This is due to the fact that the tems YARS2 [7], RDF-3X [12] can be considered as native
cardinality of properties is not considered in the current al- systems of type (ii). Both systems make heavy use of in-
gorithm. Regarding the other evaluated systems Jena TDB dex structures. While YARS facilitates six index structures
seem to have severe problems when executing queries on for subject, predicate, object and the context, RDF-3X [12]
such large knowledge bases, because only two queries were generates index structures for all possible combinations and
able to determine the results within the given time frame. orderings of subject, predicate and object. Beside these huge
Whereas Virtuoso can be considered as the third fastest sys- number of index structures, RDF-3x makes heavy use of sta-
tem, which has a stable performance without any negative tistical information and has a highly sophisticated query ex-
outliers. ecution engine, which is described in [13]. While such an
enormous effort results in a good query performance, the
5. RELATED WORK management of these specific data structures can become
quite complex. Neumann et al. therefore describe in [14]
Several approaches exist for storing graph based data in
how a query centric RDF-engine can be extended to pro-
databases. In the particular case of this paper we focus on
vide full-fledged support for updates, versioning and trans-
RDF-stores because SpiderStore, the developed system, can
actions. Beside these systems which can be clearly dedi-
be considered as part of this category. Hence we give a short
cated to either the category of memory-native, secondary
overview about the different approaches available for storing
memory-native or relational based systems several semantic
RDF data.
web frameworks exist, which provide storage engines fitting
For storing RDF-data, two approaches are prevailing. On
in several or all of these categories. Examples of such frame-
the one hand the approach of mapping the RDF-data onto
works are Sesame [5], Jena [19] and Virtuoso [6]. For all the
relational schema exists while on the other hand the ap-
systems described in this section several benchmarks exist
proach of native RDF-stores exist.
[4, 12], which extensively compare those systems.
The mapping of RDF-data onto relational databases is done
either by facilitating a large triple table, where the columns
correspond to the RDF atoms subject, predicate and object 6. CONCLUSION AND FUTURE WORK
or by clustering the triples according to their predicate into In this paper we presented the SpiderStore memory lay-
several tables. The latter approach is called property tables. out, which is able to store arbitrary graphs. The node cen-
Both approaches are less than perfect because both suffer tric layout has been discussed in detail and a formula for
from severe performance drawbacks imposed by the archi- the estimation of the space consumption was described. An
tecture of relational databases. For example in the property enhancement to the basic layout introducing a relative ad-
tables approach the number of tables is equal to the number dressing schema was presented. Finally our experiments
of properties in the worst case. Several approaches which ex- showed that a node centric layout is able to perform ar-
tend these two main approaches when mapping RDF-data bitrary SPARQL-queries on knowledge bases of up to 190
onto relational database have been developed and are bench- mio nodes with a performance comparable to highly sophis-
marked in [17]. Beside mappings to traditional relational ticated RDF-stores. This promises further performance im-
database systems, mappings which make use of column ori- provements because the query optimisation approach, which
ented databases exist [1, 15]. In the case of native stores has been developed in [3] is rather simple. Therefore future
94
work on SpiderStore will emphasize on the query execution Management of data, pages 627–640, New York, NY,
engine to achieve excellent performance results on a scale of USA, 2009. ACM.
up to a billion triple. [14] T. Neumann and G. Weikum. x-rdf-3x: fast querying,
high update rates, and consistency for rdf databases.
Proceedings of the VLDB Endowment, 3(1-2), Jan
7. REFERENCES 2010.
[1] D. J. Abadi, A. Marcus, S. R. Madden, and [15] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and
K. Hollenbach. Scalable semantic web data S. Manegold. Column-store support for RDF data
management using vertical partitioning. In VLDB ’07: management: not all swans are white. Proceedings of
Proceedings of the 33rd international conference on the VLDB Endowment, 1(2):1553–1563, 2008.
Very large data bases, pages 411–422. VLDB [16] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A
Endowment, 2007. Core of Semantic Knowledge. In 16th international
[2] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler. World Wide Web conference (WWW 2007), New
Matrix ”bit” loaded: a scalable lightweight join query York, NY, USA, 2007. ACM Press.
processor for rdf data. In WWW ’10: Proceedings of [17] Y. Theoharis, V. Christophides, and
the 19th international conference on World wide web, G. Karvounarakis. Benchmarking database
pages 41–50, New York, NY, USA, 2010. ACM. representations of RDF/S stores. The Semantic
[3] R. Binna, W. Gassler, E. Zangerle, D. Pacher, and Web–ISWC 2005, pages 685–701, 2005.
G. Specht. Spiderstore: Exploiting main memory for [18] O. Udrea, A. Pugliese, and V. Subrahmanian. GRIN:
efficient rdf graph representation and fast querying. In A graph based RDF index. In Proceedings of the
Proceedings of the 1st International Workshop on National Conference on Articial Intelligence,
Semantic Data Management (Sem-Data) at the 36th volume 22, page 1465. Menlo Park, CA; Cambridge,
International Conference on Very Large Databases, MA; London; AAAI Press; MIT Press; 1999, 2007.
Singapore, Jan 2010. [19] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds, et al.
[4] C. Bizer and A. Schultz. The berlin SPARQL Efficient RDF storage and retrieval in Jena2. In
benchmark. International Journal On Semantic Web Proceedings of SWDB, volume 3, pages 7–8. Citeseer,
and Information Systems, 5(1), 2009. 2003.
[5] J. Broekstra, A. Kampman, and F. Van Harmelen.
Sesame: A generic architecture for storing and
querying RDF and RDF schema. The Semantic APPENDIX
WebâĂŤISWC 2002, pages 54–68, 2002. A. QUERIES
[6] O. Erling and I. Mikhailov. RDF Support in the
Virtuoso DBMS. Networked Knowledge-Networked
Media, pages 7–24.
A.1 YAGO data set
[7] A. Harth, J. Umbrich, A. Hogan, and S. Decker. prefix rdfs:hhttp://www.w3.org/2000/01/rdf-schema#i
YARS2: A federated repository for querying graph prefix xsd:hhttp://www.w3.org/2001/XMLSchema#i
structured data from the web. The Semantic Web, prefix owl:hhttp://www.w3.org/2002/07/owl#i
pages 211–224. prefix rdf:hhttp://www.w3.org/1999/02/22-rdf-syntax-ns#i
prefix yago:hhttp://www.mpii.de/yago/resource/i
[8] J. Hoffart, F. M. Suchanek, K. Berberich, and
G. Weikum. Yago2: a spatially and temporally
A1: SELECT ?GivenName ?FamilyName WHERE { ?p
enhanced knowledge base from wikipedia. Research
yago:hasGivenName ?GivenName. ?p yago:hasFamilyName
Report MPI-I-2010-5-007, Max-Planck-Institut für
?FamilyName. ?p rdf:type ?scientist. ?scientist rdfs:label
Informatik, Stuhlsatzenhausweg 85, 66123
”scientist”. ?p yago:wasBornIn ?city. ?city yago:isLocatedIn
Saarbrücken, Germany, November 2010.
?switzerland. ?switzerland yago:hasPreferredName ”Switzer-
[9] M. Janik and K. Kochut. Brahms: A workbench RDF land”. ?p yago:hasAcademicAdvisor ?a. ?a yago:wasBornIn
store and high performance memory system for ?city2. ?city2 yago:isLocatedIn ?germany. ?germany
semantic association discovery. The Semantic yago:hasPreferredName ”Germany”. }
Web–ISWC 2005, pages 431–445.
[10] A. Kemper and T. Neumann. Hyper: Hybrid OLTP & A2: SELECT ?name WHERE { ?a yago:hasPreferredName
OLAP high performance database system. Technical ?name. ?a rdf:type ?actor. 1 ?actor rdfs:label ”actor”. ?a
Report TU-I1010, TU Munich, Institute of Computer yago:actedIn ?m1. ?m1 rdf:type ?movie. ?movie rdfs:label
Science, Germany, May 2010. ”movie”. ?m1 yago:hasWikipediaCategory ”German films”.
[11] A. Kiryakov, D. Ognyanov, and D. Manov. Owlim–a ?a yago:directed ?m2. ?m2 rdf:type ?movie2. ?movie2
pragmatic semantic repository for owl. Web rdfs:label ”movie”. ?m2 yago:hasWikipediaCategory ”Cana-
Information Systems Engineering–WISE 2005 dian films”. }
Workshops, pages 182–192, Jan 2005.
[12] T. Neumann and G. Weikum. RDF-3X: a RISC-style B1: SELECT ?name1 ?name2 WHERE { ?a1
engine for RDF. Proceedings of the VLDB yago:hasPreferredName ?name1. ?a2
Endowment, 1(1):647–659, 2008. yago:hasPreferredName ?name2. ?a1 rdf:type
[13] T. Neumann and G. Weikum. Scalable join processing yago:wikicategory English actors. ?a2 rdf:type
on very large rdf graphs. In SIGMOD ’09: Proceedings yago:wikicategory English actors. ?a1 yago:actedIn ?movie.
of the 35th SIGMOD international conference on ?a2 yago:actedIn ?movie. FILTER (?a1 != ?a2) }
95
B2: SELECT ?name1 ?name2 WHERE { ?p1
yago:hasPreferredName ?name1. ?p2
yago:hasPreferredName ?name2. ?p1 yago:isMarriedTo ?p2.
?p1 yago:wasBornIn ?city. ?p2 yago:wasBornIn ?city. }
B3: SELECT distinct ?name1 ?name2 WHERE { ?p1
yago:hasFamilyName ?name1. ?p2 yago:hasFamilyName
?name2. ?p1 rdf:type ?scientist1. ?p2 rdf:type ?scientist2.
?scientist1 rdfs:label ”scientist”. ?scientist2 rdfs:label ”scien-
tist”. ?p1 yago:hasWonPrize ?award. ?p2 yago:hasWonPrize
?award. ?p1 yago:wasBornIn ?city. ?p2 yago:wasBornIn
?city. FILTER (?p1 != ?p2) }
C1: SELECT DISTINCT ?name1 ?name2 WHERE { ?p1
yago:hasFamilyName ?name1. ?p2 yago:hasFamilyName
?name2. ?p1 rdf:type ?scientist. ?p2 rdf:type ?scientist.
?scientist rdfs:label ”scientist”. ?p1 ?c ?city. ?p2 ?c2 ?city.
?city rdf:type ?cityType. ?cityType rdfs:label ”city”. }
C2: SELECT DISTINCT ?name WHERE { ?p
yago:hasPreferredName ?name. ?p ?any1 ?c1. ?p ?any2 ?c2.
?c1 rdf:type ?city . ?c2 rdf:type ?city2. ?city2 rdfs:label
”city”. ?city rdfs:label ”city”. ?c1 yago:isCalled ”London”.
?c2 yago:isCalled ”Paris”. }
C3: SELECT ?p1 ?predicate ?p2 WHERE { ?p1 ?anypred-
icate1 ?city1. ?city1 yago:isCalled ”Paris”. ?p1 ?predicate
?p2. ?p2 ?anypredicate2 ?city2. ?city2 yago:isCalled ”Hong
Kong”. }
96