SpiderStore: Exploiting Main Memory for Efficient RDF Graph Representation and Fast Querying Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, Günther Specht Databases and Information Systems, Institute of Computer Science University of Innsbruck, Austria {firstname.lastname}@uibk.ac.at ABSTRACT stores and Semantic Web frameworks as very big ontologies The constant growth of available RDF data requires fast did not fit into main memory so far. and efficient querying facilities of graph data. So far, such However, the relational database model has not been in- data sets have been stored by using mapping techniques tended for the storage of large graph structures. Current so- from graph structures to relational models, secondary mem- lutions for the storage of graphs in relational tables feature ory structures or even complex main memory based models. a mapping between the graph structure and the relational We present the main memory database SpiderStore which tables of the database system [2, 13, 20]. This mapping is capable of efficiently managing large RDF data sets and significantly slows down both the storing and the querying providing powerful and fast SPARQL processing facilities. process, which causes a significant loss of performance. An- The SpiderStore storage concept aims at storing the graph other important factor is that the sequential access charac- structure in main memory without performing any complex teristics of persistent memory requires multiple indices for mappings. Therefore it exploits the natural web-structure of fast data access, which basically requires a duplication of RDF by using fast and random access to main memory. The the stored data. Storing all information on persistent media abandonment of additional mappings or meta-information also features the disadvantage of very costly I/O operations. therefore leads to a significant performance gain compared This is especially limiting when performing join operations to other common RDF stores. which feature potentially large intermediate results which have to be joined. Depending on the type of mapping and the queries executed, these operations over a huge number Categories and Subject Descriptors of triples on secondary media can be an expensive task in H.3.2 [Information Storage and Retrieval]: Information terms of execution time or memory consumption. Storage; H.3.3 [Information Storage and Retrieval]: In- Recent developments in the area of high-performance main formation Search and Retrieval; H.2 [Database Manage- memory OLTP and OLAP processing databases have min- ment]: Query Processing imized the I/O bottleneck by moving the whole database system into main memory which has become appropiate in terms of space capacity even for large datasets. Sys- General Terms tems facilitating such an architecture – like VoltDB [1] or Performance, Algorithms, Design, Experimentation HyPer [12] – allow to process several 100,000 transactions per second while providing full ACID properties. Though Keywords they have moved to main memory, these approaches are still based on the relational paradigm and therefore suffer from RDF, Main Memory, Database, RDF Store, Triple Store, an impedance mismatch when it comes to mapping graph SPARQL, Query Processing based data into a tabular layout. Hence these approaches have to deal with costly join operations and potentially large 1. INTRODUCTION intermediate results when processing graph structures. Due to the ever growing, vast amounts of RDF data, the We present SpiderStore, a main memory based RDF database storage of this data is mostly realized on persistent media which overcomes these limitations and exploits fast random – either in a native store or within a relational database read and write operations of main memory and provides system. Storing huge amounts of RDF data in relational time and space efficient SPARQL query processing facilities. databases has been facilitated by most of the popular RDF The remainder of this paper is structured as follows. Sec- tion 2 outlines the memory layout of SpiderStore. Subse- To copy without fee all or part of this material is permitted only for private quently, Section 3 is concerned with the fast and efficient and academic purposes, given that the title of the publication, the authors processing of SPARQL queries on the proposed main mem- and its date of publication appear. Copying or use for commercial purposes, ory database. Section 4 features the evaluation of the Spi- or to republish, to post on servers or to redistribute to lists, is forbidden derStore prototype and Section 5 discusses related work. unless an explicit permission is acquired from the copyright owners; the The paper is concluded by a summary and the description authors of the material. Workshop on Semantic Data Management (SemData@VLDB) 2010, of future work in Section 6. September 17, 2010, Singapore. Copyright 2010: www.semdata.org. 2. MEMORY LAYOUT Node 1 icate Node 2 In the following section we sketch the lightweight storage pred layout of SpiderStore, which is optimized for fast and effi- outgoing ..... cient SPARQL query processing on RDF graphs. In order to utilize the nature of main memory architecture, the Spi- derStore approach stores a graph natively as a set of nodes object and pointers (edges). Due to the fact that main memory Node 3 is more expensive than disk based memory and therefore limited, a very lightweight layout – without any complex ..... mappings, index structures or additional meta-information ate – is required. These conditions are satisfied by the following dic pre memory layout which is based on two basic building blocks: incoming Nodes within an RDF graph are either subjects, objects or predicates, where each subject is connected to an object by an appropriate edge, such that the predicate Node 4 subjec t annotating the edge again is a node. Furthermore, the identifier of each node is stored within the node itself. ..... Edges connect two nodes, one serving as subject and one as object. Edges are implicitly realized by pointers from the subject node to the object node. Figure 1: RDF Store Main Memory Layout To be able to browse through the graph structure in order to answer queries efficiently, the memory layout is optimized layout can be calculated by applying this formula: for graph traversal operations. This is implemented by stor- ing all edges belonging to a certain node in a dense mem- m = (#nodes ∗ 5 + #edges ∗ 3) ∗ sizeof (pointer) ory block. This implies that scanning all edges of a certain where #nodes is the overall number of nodes and #edges node only requires a linear scan of a continuous memory is the number of edges (triples) within the RDF data set. block. Within this block, all edges of the same predicate The subformula #edges ∗ 3 contains the space estimate for are clustered and linked to the predicate node itself. Fur- the incoming, the outgoing edge (bidirectional) and the link thermore, all predicates are stored as nodes as well. The between the edge and the corresponding entry node. For RDF graph data model consists of a directed graph where each node, space for a pointer to all its incoming and a all edges point from the subject to the object. However, our pointer to all its outgoing edges, together with their de- approach features bidirectional edges. This is due to the fact gree and a pointer from the string-index (dictionary) has that the query processing performance can be increased sig- to be allocated. The default pointer size on a 64bit ar- nificantly by the possibility of browsing through the graph chitecture is 8 byte. The formula does not consider the in any direction. The benefit of such a structure – where strings (identifiers) itself which are stored in a dictionary- subjects, objects and predicates are nodes connected by a like string index. Strings (URIs, literals) are stored only pointer structure – is that the graph can be traversed effi- once ciently in any direction from any starting point within the Pn which results in an additional memory consumption of i=0 length in byte(unique stringi ), where n denotes the graph. An additional index which consists of all predicates number of unique strings within the data set. and lists of its sources (subjects) even facilitates to start the traversal at a predicate. The access to nodes via their iden- tifier is guaranteed within a time complexity of O(log(n)) 3. QUERY ENGINE by using an additional index. This index is used for the fast In the SpiderStore system, the benefits of storing all graph and efficient conversion between strings (e.g. URIs) and data within the main memory are exploited for the query main memory nodes. process within the stored data. Figure 1 sketches the memory model for the storage of RDF Secondary memory-based RDF stores prefer breadth-first data. The illustrated example features the following nodes: search because the huge amount of random memory accesses Node 1 (subject) is connected by Node 2 (predicate) to the would be disadvantageous. This approach naturally leads object Node 3. Node 3 itself is also a predicate of the connec- to intermediate results. Due to the typically high volume of tion between Node 4 and Node 1 where Node 4 is the subject RDF data sets, intermediate results tend to be very large [3], and Node 1 is the object. The following listing shows the which is a limiting factor for the performance of the search example data from Figure 1 represented in triple notation: algorithm as writing and reading of intermediate results is very expensive in terms of time and space capacities. In ... the context of a main memory store, the limited amount of memory available is a crucial factor when handling interme- diate results. ... A memory efficient solution is accomplished by applying depth-first search instead of breadth-first search. As argued The estimated memory consumption (number of bytes m) by Gardarin et al. [9], depth-first search is an efficient op- for a given RDF data set stored in the described memory eration for traversing paths as long as all data is kept in memory, which is true for SpiderStore by design. Further- ?fn 3 more, they point out that by using depth-first search no 2 ?gn intermediate results are required to traverse existing paths. Therefore, the complexity of a query is only restricted by the amount of main memory used to store the overall result. 1 For queries which do not facilitate any postprocessing op- Start ?p scientist erations like sorting or grouping, SpiderStore provides the 6 (seed node) results in a stream-based way and therefore does not require 4 any additional memory. The SpiderStore query engine splits SPARQL queries into so-called “restriction triples”, which are similar to Basic ?a Graph Patterns [16]. The result set of a query is defined by ?city all variable assignments (paths) satisfying a given set of re- strictions. Therefore, these restriction triples can be used to 7 traverse the subgraph that contains the result data in depth- first order. Figure 2 illustrates an exemplary query which ?city2 5 selects all scientists who were born in a Swiss city and whose doctoral advisor was born in a German city. Nodes which 8 are marked by a “?” represent variables and are bound by the query engine during the graph walk. The solid lines in the Figure represent the structure of the subgraph. The dot dashed lines represent the order in which the triple patterns Germany Switzerland are applied to the stored RDF graph by the query engine. To determine the execution order of triples, selectivity heuris- tics of basic graph patterns as described in [16] are exploited. predicates SpiderStore determines the execution order by implicitly storing basic statistical data. This data contains informa- 1 execution order tion about the number of property instances and the number of incoming and outgoing edges grouped by the property of Figure 2: Example Query each node. Hence this statistical data does not need to be precomputed and can be used to determine the execution order by ordering the restriction triples based on their selec- 4. EVALUATION tivity. Therefore, the restriction triples containing the most All experiments were conducted on a server equipped with selective nodes, edge or combination of these, are applied two Intel Xeon L5520 Quad Core CPUs, 2.27 GHz, Linux first in order to keep the amount of nodes which have to be kernel 2.6.18, CentOS, 64-bit architecture and 48 GB main processed as small as possible. memory. The processing order of the restrictions is crucial as the re- strictions may share variables, which have to be unified and 4.1 Data Sets therefore have to be processed in the correct order. The For the evaluation we used two data sets: YAGO [17] fact that SpiderStore connects all nodes bidirectionally is and DBpedia [4]. The YAGO ontology is a huge knowledge very beneficial for the computation as restriction triples can repository based on Wikipedia and WordNet data. YAGO be processed in left-to-right or right-to-left order. After this consists of 39,193,669 triples (93 predicates, 33,951,502 unique execution order is defined, a set of seed nodes is determined, subjects and objects). Based on the this data set, we exe- which marks the starting point(s) of the search process. De- cuted the SPARQL queries, which already served as a bench- pending on the selectivity, these seeds can either be a re- mark for the RDF-3X approach [13] on the data sets. The stricted subject or object node (e.g. ”scientist“ in the ex- DBpedia project [4] is concerned with the extraction of triple ample query in Figure 2) or nodes which are connected to a information contained within Wikipedia pages and infoboxes. very selective predicate. During the next steps, the restric- The data set contains a total of 94,839,012 triples (39664 tions are applied in a depth-first order. For each matching predicates, 23,090,848 unique subjects and objects). For the restriction, the next restriction in order is pushed onto an DBpedia data set, we used the project’s SPARQL example execution stack. A result is found if all restrictions avail- queries, extended them and added further complex queries able are pushed onto this stack and are therefore satisfied in order to cover common query types (similar to the YAGO by the current path. Subsequently, the last restriction is query set). An exact formulation of the queries used for the popped from the stack and the next edge or node at the evaluation can be found in the appendix. current position is processed. The iteration continues until the stack is empty and no more nodes are available for pro- 4.2 Evaluated Systems cessing. The variables defined within a query are tracked by For the evaluation of SpiderStore, we compared it against using a shared variable assignment which is used during the the main memory RDF storage Sesame version 2.3.1 [7] graph traversal to allow the unification of variables. Every without an inferencer enabled. Additionally, we compared time a result is detected, the current state of these variable it to the native secondary memory stores Jena TDB version assignments is returned as a result. The result is streamed 0.8.6 [20] and a standard installation of RDF-3X version to the client or saved for further postprocessing (e.g. filter 0.3.4 [13]. All systems were granted a maximum of 16 GB conditions and grouping). main memory for their computations. However, we were Query A1 A2 A3 B1 B2 B3 C1 C2 geom. mean SpiderStore 0.0012 0.0765 0.0126 0.0518 0.0055 0.9984 0.3558 0.0196 0.0352 Sesame 0.0849 0.0387 0.1072 0.2252 0.1591 0.3765 dnf 0.0386 0.3200 RDF-3X 0.0184 0.0166 0.0360 0.0787 0.0381 0.0929 0.3377 0.0534 0.0522 Jena TDB 0.3090 0.9000 125.375 2.5100 2.9310 4.8552 dnf 8.9253 7.1286 #results 1 7 147 56 3,714 74 1,842 1 Table 1: Query Runtimes on the YAGO data set (in seconds). only able to set up Sesame for our tests by granting it 30 outperforms the other systems as due to the implicit statis- GB of main memory. All systems feature single threaded tic information no additional statistics have to be computed. query engines and only use one out of 16 available cores However, we have to note that in terms of restart time disk when executing queries. For the experiments of secondary based systems, like RDF-3X and Jena TDB, naturally out- storage systems, the warm cache measurements were con- perform main memory systems as all data structures are ducted by executing the queries five times without dropping kept on secondary memory and therefore do not have to the caches and taking the best result for the evaluation. be built on every startup of the system, whereas for main memory systems, the import has to be performed on every 4.3 Evaluation Results startup. The query execution times for the YAGO data set can be seen in Table 1, where ”dnf“ marks a query which was System YAGO DBpedia aborted after a run time of 10 minutes without having ob- SpiderStore 7:50 29:41 tained a result. As for the YAGO data set, SpiderStore is Sesame 25:47 53:51 able to compute the query results significantly faster than the other systems in 5 out of 8 queries. SpiderStore per- RDF-3X 16:15 48:21 forms better than all other systems with regards to the ge- Jena TDB 38:52 53:49 ometric mean of all query execution times. The secondary memory system RDF-3X performs significantly better than Table 2: Import Times (in minutes) the main memory system Sesame. However, the lightweight storage structure of SpiderStore performs better than RDF- 3X on warmed caches. Queries consisting of triple patterns which contain many variables can result in a big amount of paths which have to be validated. The order in which 5. RELATED WORK the restriction triple of such a query are processed is cru- Basically there are two approaches for storing RDF data cial. This fact is beneficial for RDF-3X as it features a very in triple form: (i) storing triples in a traditional relational mature query optimization engine, which is heavily based database or (ii) using a native triple store. The first, very on statistics, which are precomputed during the import pro- popular approach maps all RDF triples onto tables of re- cess. SpiderStore currently contains a very naive optimizer lational databases. This can either be realized by using a and is therefore not able to exploit such facts, which results potentially very big triple table or by splitting the triples in slower query execution compared to RDF-3X on query according to their predicates and storing all triples featur- A2, B3 and C1. ing the same predicate in a separate table (property tables). The query execution times for the experiments on the DB- The worst case scenario for property tables is obtaining one pedia data set are listed in Table 3. The experiments on the table per predicate. However, the clustered property table DBpedia data set showed that SpiderStore clearly surpasses approach tries to solve this problem by clustering predicates all other systems in terms of execution time, even though into tables based on their co-occurrence within the data set. the query engine optimizer is very limited and not very ma- These different approaches have been evaluated and bench- ture. For example projections on certain variables are not marked in [18]. taken into account during the processing of queries. There- There are several approaches for main memory RDF stores: fore, the system tries to satisfy all variables occurring within Brahms [11] is a main memory RDF store, which aims at the query, even if they are eventually not even asked for in high-performance association discovery based on variable- the query, which again leads to an computational overhead length queries on the RDF graph. GRIN [19] is concerned during the execution of queries. Furthermore, queries fea- with the creation of an RDF index optimized for long path turing “union” or “optional” statements are currently split queries. In contrast, SpiderStore aims at efficiently answer- up into two queries, executed sequentially and the results ing all kinds of SPARQL queries and is not specialized on are joined in the end. any particular query type. In addition to the query execution time measurements, the As for the popular Semantic Web Frameworks, Sesame [7] import times for the compared systems was measured. We provides relational storage, in-memory storage and a na- define the import time as the time required for the systems tive storage engine. The Sesame in-memory storage engine to load all data and build all the required indices. Table 2 uses a list of quadruples called statements where each state- lists the import times for the compared systems. SpiderStore ment consists of a subject-, predicate-, object- and a context Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 geom. mean SpiderStore 0.0378 0.00005 0.0002 0.0287 0.1127 0.0018 0.0005 0.00272 Sesame 80.9006 0.0010 0.0009 0.2184 0.2341 0.0541 0.0099 0.0572 RDF-3X 0.1197 0.0338 0.0273 0.0288 0.6778 0.0351 0.0058 0.0460 Jena TDB dnf dnf 94.246 1.128 1.623 6.644 0.200 13.5192 #results 2 4 21 1,560 1,172 1 2 Table 3: Query Runtimes on the DBpedia data set (in seconds). node. Each node maintains statement lists of the node’s oc- 7. REFERENCES currences as subject, predicate, object and context. Jena [1] Voltdb. Technical report, March 2010. [20] also provides both a relational store (SDB) and a na- http://www.voltdb.com/_pdf/VoltDBOverview.pdf. tive triple store (TDB). Virtuoso [8] can be facilitated on top of a native triple store (Virtuoso Triple Store) or on top of [2] D. J. Abadi, A. Marcus, S. R. Madden, and a relational database system (Virtuoso RDF Views). Abadi K. Hollenbach. Scalable semantic web data et al. [2] and Sidirourgos et al. [15] exploited column-wise management using vertical partitioning. In VLDB ’07: storage of RDF data. YARS2 [10] facilitates native stor- Proceedings of the 33rd international conference on age and makes use of six (relational) indices for subject, Very large data bases, pages 411–422. VLDB predicate, object and context. The RDF-3X system [13] Endowment, 2007. is also based on extensive indices which are heavily com- [3] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler. pressed. RDF-3X features indices for all possible order- Matrix ”bit” loaded: a scalable lightweight join query ings and subsets of subject, object and predicate. It also processor for rdf data. In WWW ’10: Proceedings of provides extensive heuristics about the stored data, which the 19th international conference on World wide web, are further exploited for the processing of queries. Further- pages 41–50, New York, NY, USA, 2010. ACM. more, Neumann and Weikum proposed join order optimiza- [4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, tions based on sideways information passing and selectivity R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a heuristics [14]. However, such extensive index structure are web of open data. The Semantic Web, pages 722–735, neither feasible nor required when dealing with main mem- 2007. ory. The BitMat project [3] provides a lightweight RDF [5] A. Bernstein, C. Kiefer, and M. Stocker. OptARQ: A index for main-memory RDF stores which is based on a Bit- SPARQL optimization approach based on triple Mat, a bitcube index responsible for a compressed storage pattern selectivity estimation. Rapport technique, and querying of triples. All of these systems have already Department of Informatics, University of Zurich, 2007. been compared extensively in [6, 13]. As for the optimiza- [6] C. Bizer and A. Schultz. The berlin SPARQL tion of RDF storage and querying, Stocker et al. [16, 5] benchmark. International Journal On Semantic Web focus on the optimization of Basic Graph Patterns based on and Information Systems, 5(1), 2009. selectivity heuristics. [7] J. Broekstra, A. Kampman, and F. Van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schema. The Semantic 6. CONCLUSION AND FUTURE WORK WebISWC 2002, pages 54–68, 2002. In this paper we presented a lightweight in-memory stor- [8] O. Erling and I. Mikhailov. RDF Support in the age layout for RDF graphs which was implemented in a Virtuoso DBMS. Networked Knowledge-Networked first prototype called SpiderStore. This layout provides fast Media, pages 7–24. and efficient in-memory RDF querying and storage without [9] G. Gardarin, J. Gruser, and Z. Tang. Cost-based having to perform any complex mapping. Our experiments selection of path expression processing algorithms in showed that a simplistic storage layout, which is specifically object-oriented databases. In Proceedings of the designed for graph data is able to outperform common stores international conference on very large data bases, based on the relational model. The performance gain is pages 390–401, 1996. based on (1) the lightweight storage model which represents [10] A. Harth, J. Umbrich, A. Hogan, and S. Decker. edges as memory pointers and therefore allows direct node YARS2: A federated repository for querying graph jumps in the graph, (2) implicit statistics about the selec- structured data from the web. The Semantic Web, tivity which can be used to optimize the processing order pages 211–224. and (3) depth-first query processing which renders unneces- [11] M. Janik and K. Kochut. Brahms: A workbench RDF sary intermediate results and expensive join chains. Future store and high performance memory system for work on SpiderStore will include the optimization of the semantic association discovery. The Semantic query execution order based on implicit statistical informa- Web–ISWC 2005, pages 431–445. tion which currently relies on a naive algorithm. Further- [12] A. Kemper and T. Neumann. Hyper: Hybrid OLTP & more we will improve the memory management and layout OLAP high performance database system. Technical regarding writing facilities. Report TU-I1010, TU Munich, Institute of Computer newline Science, Germany, May 2010. [13] T. Neumann and G. Weikum. RDF-3X: a RISC-style B3: select distinct ?n1 ?n2 where { ?n1 hfamilyNameOfi engine for RDF. Proceedings of the VLDB ?p1. ?n2 hfamilyNameOfi ?p2. ?p1 htypei scientist. ?p1 Endowment, 1(1):647–659, 2008. hhasWonPrizei ?award. ?p1 hbornInLocationi ?city. ?p2 [14] T. Neumann and G. Weikum. Scalable join processing htypei scientist. ?p2 hhasWonPrizei ?award. ?p2 hbornIn- on very large rdf graphs. In SIGMOD ’09: Proceedings Locationi ?city. filter (?p1 != ?p2) } of the 35th SIGMOD international conference on C1: select distinct ?n1 ?n2 where {?n1 hfamilyNameOfi ?p1. Management of data, pages 627–640, New York, NY, ?n2 hfamilyNameOfi ?p2. ?p1 htypei scientist. ?p1 ?i1 ?city. USA, 2009. ACM. ?p2 htypei scientist. ?p2 ?i2 ?city. ?city htypei hsitei. filter [15] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and (?p1 != ?p2) } S. Manegold. Column-store support for RDF data C2: select distinct ?n where { ?p hisCalledi ?n. ?p ?i1 ?c1. management: not all swans are white. Proceedings of ?p ?i2 ?c2. ?c1 htypei hvillagei. ?c1 hisCalledi London. ?c2 the VLDB Endowment, 1(2):1553–1563, 2008. htypei hsitei. ?c2 hisCalledi Paris. } [16] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In WWW A.2 DBpedia Data Set ’08: Proceeding of the 17th international conference on prefix rdf: hhttp://www.w3.org/1999/02/22-rdf-syntax-ns#i World Wide Web, pages 595–604, New York, NY, prefix foaf: hhttp://xmlns.com/foaf/0.1/i USA, 2008. ACM. prefix dbpedia2: hhttp://dbpedia.org/property/i [17] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: prefix skos: hhttp://www.w3.org/2004/02/skos/core#i A Core of Semantic Knowledge. In 16th international prefix dbo: hhttp://dbpedia.org/ontology/i World Wide Web conference (WWW 2007), New York, NY, USA, 2007. ACM Press. Q1: select ?name ?name2 ?place where { ?person dbo:birth- Place ?place. ?person foaf:name ?name . ?place skos:subject [18] Y. Theoharis, V. Christophides, and h http://dbpedia.org/resource/Category:European Capitals- G. Karvounarakis. Benchmarking database of Culturei . ?person2 dbo:birthPlace ?place . ?person2 representations of RDF/S stores. The Semantic foaf:name ?name2 . ?person skos:subject ?type1 . ?person2 Web–ISWC 2005, pages 685–701, 2005. skos:subject ?type2 . ?type1 skos:broader hhttp://dbpedia.- [19] O. Udrea, A. Pugliese, and V. Subrahmanian. GRIN: org/resource/Category:Musiciansi . ?type2 skos:broader hht- A graph based RDF index. In Proceedings of the tp://dbpedia.org/resource/Category:Musiciansi . filter(?na- National Conference on Articial Intelligence, me != ?name2) } volume 22, page 1465. Menlo Park, CA; Cambridge, Q2: select ?name1 ?name2 where {?person foaf:name ?na- MA; London; AAAI Press; MIT Press; 1999, 2007. me1 . ?person2 foaf:name ?name2 . ?person dbpedia2p:infl- [20] K. Wilkinson, C. Sayers, H. Kuno, D. Reynolds, et al. uences ?person2 . ?person2 dbpedia2:awards hhttp://db- Efficient RDF storage and retrieval in Jena2. In pedia.org/resource/Time 100: The Most Important People- Proceedings of SWDB, volume 3, pages 7–8. Citeseer, of the Centuryi .} 2003. Q3: select ?name where { ?vehicle skos:subject ?cat . ?ve- hicle dbpedia2:transmission ”6-speed manual” . ?vehicle APPENDIX dbpedia2:name ?name . ?vehicle dbo:manufacturer ?man . ?man dbo:location ?location . ?location dbo:leaderName A. QUERIES hhttp://dbpedia.org/resource/Alfred Lehmanni . ?cat skos:- broader hhttp://dbpedia.org/resource/Category:Luxury ve- A.1 YAGO data set hiclesi.} Q4: select distinct ?person ?person2 where {?person ?z A1: select ?gn ?fn where { ?gn hgivenNameOfi ?p. ?fn hhttp://dbpedia.org/resource/Category:IBM Fellowsi. ?per- hfamilyNameOfi ?p. ?p htypei scientist. ?p hbornInLocationi son ?prop ?value. ?person2 ?prop2 ?value2. ?person2 ?prop3 ?city. ?p hhasDoctoralAdvisori ?a. ?a hbornInLocationi hhttp://dbpedia.org/resource/Category:IBM Fellowsi. filter ?city2. ?city hlocatedIni Switzerland. ?city2 hlocatedIni (?person != ?person2)} Germany. } Q5: select distinct ?scientist ?doctor where { ?scientist rdf:type A2: select ?n where { ?a hisCalledi ?n. ?a htypei actor”. ?a hhttp://dbpedia.org/ontology/Scientisti. ?scientist hhttp://db- hlivesIni ?city. ?a hactedIni ?m1. ?a hdirectedi ?m2. ?city pedia.org/ontology/doctoralStudenti ?doctor. ?doctor ?prop- hlocatedIni ?s. ?s hlocatedIni United States. ?m1 htypei erty ?value. ?scientist ?prop2 ?value. } movie. ?m1 hproducedInCountryi Germany. ?m2 htypei Q6: select distinct ?player where {?s dbpedia2:name ?player. movie. ?m2 hproducedInCountryi Canada. } ?s rdf:type hhttp://dbpedia.org/ontology/SoccerPlayeri. ?s A3: select distinct ?n ?co where { ?p hisCalledi ?n. {?p dbpedia2:position ”Goalkeeper”. ?s hhttp://dbpedia.org/- htypei actor } union { ?p htypei athlete }. ?p hbornIn- property/clubsi ?club. ?club hhttp://dbpedia.org/ontology/- Locationi ?c. ?c hlocatedIni ?s. ?s hlocatedIni ?co. ?p capacityi ”32609” . ?s hhttp://dbpedia.org/ontology/birth- htypei ?t. filter(?t reaches politician via hsub-ClassOfi ) } Placei ?place. ?place dbpedia2:populationCensus ”49138831”.} B1: select distinct ?n1 ?n2 where { ?a1 hisCalledi ?n1. ?a1 Q7: select ?name ?birth ?description ?person ?x where {?person hlivesIni ?c1. ?a1 hactedIni ?movie. ?a2 hisCalledi ?n2. ?a2 dbo:birthPlace hhttp://dbpedia.org/resource/Berlini. ?per- hlivesIni ?c2. ?a2 hactedIni ?movie. ?c1 hlocatedIni Eng- son skos:subject hhttp://dbpedia.org/resource/Category:Ger- land. ?c2 hlocatedIni England. filter (?a1 != ?a2) } man musiciansi. ?person dbo:birthDate ?birth. ?person B2: select ?n1 ?n2 where { ?p1 hisCalledi ?n1. ?p1 hbornIn- foaf:name ?name. } Locationi ?city. ?p1 hisMarriedToi ?p2. ?p2 hisCalledi ?n2. ?p2 hbornInLocationi ?city. }