=Paper=
{{Paper
|id=Vol-1810/GraphQ_paper_01
|storemode=property
|title=On the Performance of Analytical and Pattern Matching Graph Queries in Neo4j and a Relational Database
|pdfUrl=https://ceur-ws.org/Vol-1810/GraphQ_paper_01.pdf
|volume=Vol-1810
|authors=Jürgen Hölsch,Tobias Schmidt,Michael Grossniklaus
|dblpUrl=https://dblp.org/rec/conf/edbt/HolschSG17
}}
==On the Performance of Analytical and Pattern Matching Graph Queries in Neo4j and a Relational Database==
On the Performance of Analytical and Pattern Matching Graph Queries in Neo4j and a Relational Database Jürgen Hölsch Tobias Schmidt Michael Grossniklaus Department of Computer and Information Science, University of Konstanz P.O. Box 188, 78457 Konstanz, Germany {juergen.hoelsch, tobias.2.schmidt, michael.grossniklaus}@uni-konstanz.de ABSTRACT In this paper, we bridge this gap by defining a set of Graph databases with a custom non-relational backend pro- analytical queries (Section 2). In addition, compared to re- mote themselves to outperform relational databases in an- lated work, we introduce a more fine-grained categorization swering queries on large graphs. Recent empirical studies of pattern matching queries including (i) paths that filter show that this claim is not always true. However, these on specifc node labels without restricting edge types, (ii) studies focus only on pattern matching queries and neglect paths that filter on specific edge types without restricting analytical queries used in practice such as shortest path, di- node labels, (iii) paths that filter on node labels and edge ameter, degree centrality or closeness centrality. In addition, types, and (iv) paths containing cycles. This categoriza- there is no distinction between different types of pattern tion provides the basis to systematically breaking down the matching queries. In this paper, we introduce a set of ana- reasons why one system is outperforming another system. lytical and pattern matching queries, and evaluate them in Since Neo4j is the most widely used native graph database, Neo4j and a market-leading commercial relational database we compare the execution times of Cypher queries in Neo4j system. We show that the relational database system out- to the execution times of corresponding SQL queries in a performs Neo4j for our analytical queries and that Neo4j is market-leading relational database system (Section 3). We faster for queries that do not filter on specific edge types. show that the relational database system outperforms Neo4j for our analytical queries. We argue that analytical queries, which access most or all nodes of the graph, benefit from the 1. INTRODUCTION more advanced disk and buffer management of the relational Application domains such as social media, biology, and database system and that there is room for improvement transportation planning produce large amounts of graph data. in this respect in Neo4j. In contrast, we demonstrate that Therefore, the management and processing of graph data Neo4j is more efficient for path queries that do not filter on is gaining importance. As a result, there are several graph specific edge types. Section 4 summarizes related work and databases such as Neo4j, DEX/Sparksee, and OrientDB. In Section 5 gives an outlook on future work. general, graph databases can be categorized into systems using an existing relational database as a backend or systems 2. QUERIES implementing a custom non-relational backend, which are We begin by introducing the set of queries that we use to often called “native graph databases”. Although a relational compare Neo4j to a relational database system. We catego- backend offers numerous advantages such as transactions, rize our queries in analytical and pattern matching queries. reliable storage and access control, native graph databases at- Analytical queries process large parts of the graph at a time, tract customers with the promise of outperforming traditional whereas pattern matching queries access only small parts relational databases. Recent empirical studies [4, 9], however, of the graph in most cases. In order to formally define our demonstrate that this promise is not always kept. Gubichev queries, we first have to introduce the property graph data and Then [4] show that several queries of their benchmark model on which Neo4j and its query language Cypher is timed out in Neo4j and DEX/Sparksee. While these studies based [6]. are a good starting point for comparing graph databases with a relational and a custom non-relational backend, they Definition 1. Assume a domain V of nodes, a domain E of only focus on pattern matching queries and neglect analytical directed edges, and a domain A of attributes. Additionally, queries such as shortest path and centrality measures. suppose a domain DA of atomic attribute values, a domain DL of node labels, and a domain DT of edge types. A property graph is a finite structure G = (V, E, A, λ, α, β, γ), where • V ⊆ V is a finite set of nodes, • E ⊆ E is a finite set of edges, c 2017, Copyright is with the authors. Published in the Workshop Proceed- ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is • A ⊆ A is a finite set of attributes, permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 • λ : E → V × V is a function assigning nodes to edges, • α : (V ∪ E) × A → DA is a partial function assigning SQL values to attributes of nodes and edges, SELECT n.id, CAST(COUNT(*) AS FLOAT)/ • β : V → P(DL ) is a partial function assigning labels (SELECT COUNT(*) FROM nodes) AS degree centrality to nodes, and FROM nodes n LEFT JOIN rel all bidirectional r ON n.id = r.sID GROUP BY n.id • γ : E → DT is a partial function assigning a type to edges. 2.1.3 QA3 : Connectedness Query QA3 returns all pairs of nodes u, v ∈ V which are Besides the property graph data model, we also define the connected by a path u →∗ v. notion of a path in a property graph. Cypher Definition 2. Suppose the nodes v1 , vn ∈ V . A path from MATCH (n), (m) v1 to vn , denoted as v1 →∗ vn , is a sequence of nodes and RETURN n, m, EXISTS((n)-[*]->(m)) AS is connected edges hv1 , e1 , v2 , . . . , vn−1 , en−1 , vn i, where ∀1 ≤ i ≤ n − 1 : λ(ei ) = (vi , vi+1 ). SQL 2.1 Analytical Queries SELECT a.id1, a.id2, In this subsection, we formally define the set of analytical CASE WHEN tc.sID IS NULL THEN 0 ELSE 1 END AS connected queries that have been proposed in the graph benchmark of FROM (SELECT n.id AS id1, m.id AS id2 FROM nodes n Grossniklaus et al. [3]. Since the focus of this paper is to CROSS JOIN nodes m) AS a compare Neo4j with a relational database system, we only LEFT JOIN transitive closure tc ON a.id1 = tc.sID use queries that can be expressed in Cypher. Therefore, we AND a.id2 = tc.dID cannot cover all analytical graph queries that are used in real world applications. For each query of this subsection, we 2.1.4 QA4 : Shortest Paths also derive the corresponding Cypher and SQL statements Suppose P (u, v) is a set containing all paths u →∗ v, where that are based on the dataset of the LUBM [5] benchmark u, v ∈ V , and length(p) is a function returning the number of (cf. Section 3). The tables and views that are used in the edges of a path p. A shortest path between two nodes u and SQL statements are described in the Appendix. Some of the v is a path p where ∀p0 ∈ P (u, v) : length(p) ≤ length(p0 ). following queries are limited in the number of result tuples The length of a shortest path between u and v is denoted as because Neo4j is not able to execute these queries without δ(u, v). Query QA4 returns for all pairs of nodes u, v ∈ V a restricting the result size. shortest path between u and v. 2.1.1 QA1 : Node Degree Cypher Query QA1 returns for each node v ∈ V its degree, which MATCH p = shortestPath((n)-[*]->(m)) is the number of ingoing and outgoing edges and formally RETURN n, m, p defined as follows: SQL deg(v) := {e|e ∈ E ∧ λ(e) = (x, y) ∧ (v = x ∨ v = y)} . SELECT * Cypher FROM shortest path MATCH (n)--(m) RETURN n, COUNT(m) 2.1.5 QA5 : Closeness Centrality Query QA5 returns for each node v ∈ V its closeness SQL centrality which is defined as follows: SELECT n.id, COUNT(*) AS degree FROM nodes n LEFT JOIN rel all bidirectional r ON n.id = r.sID 1 C C (v) := P . GROUP BY n.id u∈V δ(v, u) 2.1.2 QA2 : Degree Centrality Cypher Query QA2 returns for each node v ∈ V its degree cen- MATCH p = shortestPath((n)-[*]-(m)) trality, which is the number of ingoing and outgoing edges WHERE n <> m WITH n, p LIMIT 60000000 normalized by the total number of nodes in the graph. For- RETURN n, 1.0/SUM(LENGTH(p)) AS centrality mally, the degree centrality of a node v is defined as follows: deg(v) SQL C D (v) := . |V | SELECT TOP 60000000 sID AS node, 1.0 / SUM(length) AS centrality Cypher FROM (SELECT sID, dID, MIN(length) AS length MATCH (a) FROM shortest connection GROUP BY sID, dID WITH TOFLOAT(COUNT(*)) AS total UNION ALL SELECT dID, sID, MIN(length) AS length MATCH (n)-[e]-() FROM shortest connection GROUP BY sID, dID) AS t RETURN n, COUNT(e)/total AS centrality GROUP BY sID 2.1.6 QA6 : Diameter Note that we do not restrict the edge direction. Both Query QA6 returns the diameter of the graph which is the outgoing and ingoing edges are matched by the pattern length of the longest shortest path between any two nodes above. u, v ∈ V : Cypher d := max δ(u, v). u,v∈V MATCH (a:University)--(b:FullProfessor) RETURN a, b Cypher MATCH p = shortestPath((n)-[*]-(m)) SQL WITH p LIMIT 40000000 SELECT n1.id, n2.id RETURN p, LENGTH(p) FROM rel all bidirectional r ORDER BY LENGTH(p) DESC INNER JOIN node label n1 ON n1.id = r.sID LIMIT 1 INNER JOIN node label n2 ON n2.id = r.dID SQL WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’ SELECT MAX(length) 2.2.2 QP2 : Cycles FROM (SELECT TOP 40000000 * Queries of type QP2 extend queries of type QP1 with FROM shortest connection) AS t cycles. 2.1.7 QA7 : Grouping Query QA7 returns for each existing value x of an attribute l1 ... ln a ∈ A the number of nodes v ∈ V where α(v, a) = x: γ(a) := {(x, i) | v ∈ V ∧ a ∈ A ∧ x = α(v, a) ∧ i = |{u | u ∈ V ∧ α(u, a) = x}|}. Cypher Cypher MATCH (a:UndergraduateStudent)--(b:FullProfessor) --(c:Department)--(a) MATCH (a:Course) RETURN a,b,c RETURN a.name, COUNT(a) SQL SQL SELECT name.value, COUNT(*) AS count SELECT n1.id, n2.id, n3.id FROM node label node LEFT JOIN name ON name.id = node.id FROM rel all bidirectional r1 WHERE node.value = ’Course’ INNER JOIN node label n1 ON n1.id = r1.sID GROUP BY name.value INNER JOIN node label n2 ON n2.id = r1.dID ORDER BY count DESC INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id INNER JOIN node label n3 ON n3.id = r2.dID INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID 2.2 Pattern Matching Queries WHERE n1.value = ’UndergraduateStudent’ In the following, we define the types of pattern matching AND n2.value = ’FullProfessor’ queries used in our evaluation. We categorize our queries in AND n3.value = ’Department’ paths that filter on specific node labels without restricting AND r3.dID = n1.id edge types (QP1 ), paths that filter on specific edge types without restricting node labels (QP3 ), paths that filter on 2.2.3 QP3 : Paths filtering on edge types node labels and edge types (QP4 ), and paths containing Queries of type QP3 return paths with specific edge types cycles (QP2 ). This categorization provides a way to deter- as it is shown in the figure below, where t1 , . . . , tn ∈ DT are mine how efficient Neo4j and the relational database system edge types. can filter on specific node labels or edge types. For each pattern matching query type, we derive Cypher and SQL t2 t1 statements. However, in this subsection we only present the ... actual code of queries with patterns of length 1 (denoted as QPx,1 in Section 3, where x is the query type). The queries with longer patterns used in the evaluation are found in the Cypher appendix. MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c) RETURN a, b, c 2.2.1 QP1 : Paths filtering on node labels The first type of pattern matching queries in our evaluation SQL are paths filtering on specific node labels. This query type is illustrated below, where l1 , . . . , ln ∈ DL are labels. SELECT n1.id AS teacher, n2.id AS course, n3.id AS student FROM nodes n1, rel teacherOf rto, nodes n2, rel takesCourse rtc, nodes n3 l1 ... ln WHERE n1.id = rto.sID AND rto.dID = n2.id AND n2.id = rtc.dID AND n3.id = rtc.sID 2.2.4 QP4 : Paths filtering on node labels and edge types Queries of type QP4 are a combination of queries of type Neo4j RDBMS A RDBMS B 19 05 7 QP1 and of type QP3 . 1, 2 5 9 ,3 31 01 ,7 ,4 13 21 14 2, t2 t1 l2 ... l1 Runtime in sec 600 Cypher 400 0 25 MATCH (a:FullProfessor)-[:teacherOf]->(b:Course) 6 200 2 13 12 56 72 19 06 38 08 06 <-[:takesCourse]-(c:UndergraduateStudent) 01 03 01 29 14 0. 0. 0. 0. 0. 0. 0. 0. 0. 8 0 RETURN a, b, c QA1 QA2 QA3 QA4 QA5 QA6 QA7 SQL (a) Analytical queries 1 7 39 66 SELECT a.id, b.id, c.id FROM node label a, rel teacherOf r1, node label b, 60 rel takesCourse r2, node label c Runtime in sec WHERE a.value = ’FullProfessor’ AND b.value = ’Course’ .9 40 34 AND c.value = ’UndergraduateStudent’ AND a.id = r1.sID AND r1.dID = b.id AND b.id = r2.sID AND r2.dID = c.id 20 .4 11 8 9 9. 8. 0. 8 03 09 02 03 41 57 3. EVALUATION 0. 0. 0. 0. 0. 0 Having presented the queries used in our performance QP1,1 QP1,2 QP1,3 QP1,4 evaluation, we now present the experimental setup and the (b) Paths filtering on node labels obtained results. 00 28 0 3 0 1 3.1 Experimental Setup ,5 ,4 ,9 ,8 0 10 16 38 73 98 All experiments presented in this paper were performed 800 5 71 on a Mac Pro with a 3.5 GHz 6-Core Intel Xeon E5 pro- Runtime in ms 600 cessor with 64 GB main memory. In our evaluation, we compared Neo4j Version 3.0.3 with a market-leading commer- 400 0 cial database system. Due to license terms, the commercial 24 0 0 16 200 system is referred to as “RDBMS”. Neo4j and RDBMS are 14 70 57 30 installed on a 64-bit Windows 8.1 virtual machine with 32 0 GB of main memory. Both database systems run in their QP2,1 QP2,2 QP2,3 QP2,4 standard configuration. The data sets used in our evalu- (c) Cycles ation were generated by the data generator of the LUBM benchmark [5]. We generated a data set with one university 600 (LUBM-1) and a data set with two universities (LUBM-2). 487 463 Runtime in ms 404 To store the graph in RDBMS, we used the decomposed stor- 400 379 age model (DSM) described by Sakr et al. [10]. All runtime measurements were repeated five times in random order. The 200 reported averages discard the smallest and the largest value. 50 50 66 69 Figure 1 shows the runtimes of our queries on the LUBM- 0 8.7 8.2 8.5 8.6 1 data set, whereas Figure 2 shows the runtimes of our QP3,1 QP3,2 QP3,3 QP3,4 queries on the LUBM-2 data set. In our first experiments, (d) Paths filtering on edge types we have noticed that the performance of specific query types in RDBMS heavily depends on how efficiently edges can 100 86 be retrieved. Therefore, we created two different schemata Runtime in ms 80 in RDBMS that are both based on the decomposed storage 60 model. Schema “A” has tables for each edge type. In addition, 41 a view is computed containing edges of all types. In contrast, 40 26 29 23 Schema “B” has a table containing edges of all types. In 20 8.3 17 8.6 8.3 addition, there are views for each edge type. In the following, 0 RDBMS with Schema A is denoted as “RDBMS A” and QP4,1 QP4,2 QP4,3 RDBMS with Schema B is denoted as “RDBMS B”. We (e) Paths filtering node labels and edge types use the term “RDBMS” to refer to both RDBMS A and RDBMS B. 3.2 Results Figure 1: Runtimes of queries on LUBM-1. We first present the results obtained on data set LUBM-1. Figure 1(a) shows that for most of the analytical queries RDBMS is several orders of magnitude faster than Neo4j. Neo4j RDBMS A RDBMS B The largest difference can be seen for query QA5 which 2 39 computes the closeness centrality where RDBMS B is over 2, 1500× faster than Neo4j. The computation of the shortest 1,000 path is almost 10× faster in RDBMS B than in Neo4j. At a Runtime in sec 800 first glance, this is surprising since computing the shortest 4 67 1 5 63 60 600 path is a core functionality of Neo4j and is also provided 400 as a construct in the Cypher query language. Figure 1(b) gives the runtimes of our pattern matching queries that 6 200 11 .6 .7 filter specific node labels without restricting edge types. The .7 57 58 12 12 77 09 12 02 02 35 17 0. 0. 0. 0. 0. 0. 0. 0. 0 largest difference in the runtime of Neo4j and RDBMS can be QA1 QA2 QA3 QA4 QA5 QA6 QA7 observed for query QP1,4 which contains the largest number (a) Analytical queries of edges of all queries in (b). We can also see that the queries QP1,1 and QP1,3 are faster in RDBMS B than in 8 0 07 91 3, 1, Neo4j. The runtimes of cyclic pattern matching queries that filter specific node labels without restricting edge types are .1 94 100 shown in Figure 1(c). It can be seen that Neo4j outperforms Runtime in sec RDBMS. For all queries of Figure 1(c), Neo4j is almost 50 an order of magnitude faster than RDBMS. The runtimes of pattern matching queries on filtering specific edge types .6 .4 24 23 without restricting node labels are given in Figure 1(d). 13 42 1 5 2 8 91 02 06 03 06 4. 1. In contrast to the previous pattern matching query types, 0. 0. 0. 0. 0 QP1,1 QP1,2 QP1,3 QP1,4 RDBMS answers queries of this type more efficiently than (b) Paths filtering on node labels Neo4j. This is due to the fact that each relationship type has its own table (or view) in RDBMS. Finally, Figure 1(e) plots the runtimes of pattern matching queries that filter on node 0 0 0 90 13 50 70 91 0 0 96 ,9 56 7, ,0 5, 3, labels and edge types. As before, RDBMS B outperforms 16 19 86 81 37 7, 3, 400 Neo4j and RDBMS A. Moving on to data set LUBM-2, Figure 2(a) shows the run- 0 Runtime in ms 29 300 time of our analytical queries. In Neo4j, only query QA1 and QA2 could be executed in reasonable time. The other queries 200 were aborted after 24 hours or ran out of memory (denoted 7 10 100 by in Figure 2(a)). The runtimes of our pattern matching 70 60 41 queries that filter specific node labels without restricting edge 0 QP2,1 QP2,2 QP2,3 QP2,4 types are given in Figure 2(b). For query QP1,3 RDBMS A is faster than Neo4j, otherwise Neo4j outperforms RDBMS. (c) Cycles Figure 2(c) plots the runtimes of cyclic pattern matching 1,071 932 1,203 1,732 queries that filter specific node labels without restricting edge types. As for the results obtained on the LUBM-1 data 600 549 set (Figure 1(c)), Neo4j is consistently faster than RDBMS for this type of pattern matching query. The runtimes of Runtime in ms 400 349 pattern matching queries filtering specific edge types without restricting node labels that are plotted in Figure 2(d) show 200 138 169 that RDBMS consistently outperforms Neo4j for this type of 106 105 82 76 query. Finally, the runtimes of pattern matching queries that 0 filter on node labels and edge types are shown in Figure 2(e). QP3,1 QP3,2 QP3,3 QP3,4 For these queries, RDBMS A consistently outperforms Neo4j. (d) Paths filtering on edge types 202 3.3 Discussion and Limitations To conclude this section, we summarize our results and 100 83 discuss limitations of our evaluation. We have shown that RDBMS outperforms Neo4j for our analytical queries. For Runtime in ms 80 60 50 most of the analytical queries, RDBMS is several orders of 47 45 40 33 43 39 38 magnitude faster than Neo4j. A possible reason for this result is the fact that the analytical queries have to access most or all 20 nodes of the graph. Therefore, they profit from the advanced 0 disk and memory management of relational database systems. QP4,1 QP4,2 QP4,3 We also showed that RDBMS A and RDBMS B both perform (e) Paths filtering node labels and edge types badly for queries that need to join the whole edge table (or view) multiple times for longer patterns. For queries with cycles, the performance is even worse. However, if we only Figure 2: Runtimes of queries on LUBM-2. query specific edge types, RDBMS outperforms Neo4j as can be seen in Figure 1(d) and Figure 2(d). Comparing RDBMS A with RDBMS B on LUBM-1, we can observe that Acknowledgement for queries that filter on specific edge types RDBMS B is more This work is supported by Grant No. GR 4497/2 of the efficient than RDBMS A. However, for queries with longer Deutsche Forschungsgemeinschaft (DFG). patterns that do not filter on specific edge types RDBMS A outperforms RDBMS B. As for LUBM-1, there is again no clear winner on LUBM-2. 6. REFERENCES Since the work presented in this paper is only the first [1] R. Angles, A. Prat-Pérez, D. Dominguez-Sal, and J.-L. step towards a more extensive empirical study, it has some Larriba-Pey. Benchmarking Database Systems for shortcomings and open issues. First, we only use the de- Social Network Applications. In Proc. Workshop on composed storage model [10] to map a property graph to Graph Data Management Experiences and Systems relations. Since the choice of the graph-to-relation mapping (GRADES), pages 1–7, 2013. strongly impacts the performance of queries, we need to [2] M. Ciglan, A. Averbuch, and L. Hluchy. Benchmarking evaluate alternative mappings such as the hybrid schema Traversal Operations over Graph Databases. In Proc. described by Sun et al. [11]. Second, the performance of a ICDE Workshops (ICDEW), pages 186–189, 2012. query also depends on the characteristics of the graph, e.g., [3] M. Grossniklaus, S. Leone, and T. Zäschke. Towards a the number of distinct labels or average node degree. Hence, Benchmark for Graph Data Management and further data sets with different graph characteristics need to Processing. Technical report, University of Konstanz, be evaluated. Finally, pattern matching queries that filter 2013. on specific attribute values should be included as our current http://nbn-resolving.de/urn:nbn:de:bsz:352-242536. queries filter on node labels and edge types only. [4] A. Gubichev and M. Then. Graph Pattern Matching: Do We Have to Reinvent the Wheel? In Proc. 4. RELATED WORK Workshop on Graph Data Management Experiences and There are several empirical studies on the query perfor- Systems (GRADES), pages 1–7, 2014. mance of graph database systems. To the best of our knowl- [5] Y. Guo, Z. Pan, and J. Heflin. LUBM: A Benchmark edge, none of these studies cover analytical queries formulated for OWL Knowledge Base Systems. J. Web Sem., in a declarative graph query language. The work of Vick- 3(2-3):158–182, 2005. nair et al. [12] benchmarks Neo4j and MySQL. The authors [6] J. Hölsch and M. Grossniklaus. An Algebra and define three simple traversal queries and queries counting the Equivalences to Transform Graph Patterns in Neo4j. In number of nodes with a specific value. Ciglan et al. [2] im- Proc. Intl. Workshop on Querying Graph Structured plement graph traversal operations in Neo4j, DEX/Sparksee, Data (GraphQ), 2016. OrientDB, NativeSail, and SGDB, and compare their per- [7] S. Jouili and V. Vansteenberghe. An Empirical formance. Grossniklaus et al. [3] benchmark Neo4j and a Comparison of Graph Databases. In Proc. Intl. Conf. relational database system by implementing a set of analyti- on Social Computing (SOCIALCOM), pages 708–715, cal queries using the Neo4j API and SQL. Welc et al. [13] 2013. evaluate the performance of computing the shortest path in [8] R. McColl, D. Ediger, J. Poovey, D. Campbell, and Neo4j, Oracle and the Green-Marl language infrastructure. D. A. Bader. A Performance Evaluation of Open Angles et al. [1] benchmark graph databases in the context Source Graph Databases. In Proc. Workshop on of social networks. They define a set of pattern matching Parallel Programming for Analytics Applications and reachability queries that are often used to analyse social (PPAA), pages 11–18, 2014. networks. Jouili and Vansteenberghe [7] use traversal and [9] N. Pobiedina, S. Rümmele, S. Skritek, and H. Werthner. shortest path queries to benchmark Neo4j, DEX/Sparksee, Benchmarking Database Systems for Graph Pattern OrientDB and Titan. McColl et al. [8] evaluate open-source Matching. In Proc. Intl. Conf. on Database and Expert graph databases using implementations of the graph algo- Systems Applications (DEXA), pages 226–241, 2014. rithms shortest path, connected components and PageRank. [10] S. Sakr, S. Elnikety, and Y. He. G-SPARQL: A Hybrid Pobiedina et al. [9] evaluate pattern matching queries in Engine for Querying Large Attributed Graphs. In Proc. Neo4j (Cypher), PostgreSQL (SQL), Jena TDB (SPARQL) Intl. Conf. on Information and Knowledge Management and Clingo (ASP). (CIKM), pages 335–344, 2012. [11] W. Sun, A. Fokoue, K. Srinivas, A. Kementsietsidis, 5. CONCLUSION AND FUTURE WORK G. Hu, and G. Xie. SQLGraph: An Efficient In this paper, we defined a set of analytical and pattern Relational-Based Property Graph Store. In Proc. Intl. matching queries in Cypher and SQL, and evaluated them in Conf. on Management of Data (SIGMOD), pages Neo4j and a market-leading commercial relational database 1887–1901, 2015. system. We showed that the relational database system [12] C. Vicknair, M. Macias, Z. Zhao, X. Nan, Y. Chen, and outperforms Neo4j for our analytical queries and that Neo4j D. Wilkins. A Comparison of a Graph Database and a is faster for queries that do not filter on specific edge types. Relational Database: A Data Provenance Perspective. As future work we plan to build on this initial study In Proc. ACM Southeast Regional Conference, page 42, in order to better understand the implications of different 2010. backends on graph query performance. First, we will study [13] A. Welc, R. Raman, Z. Wu, S. Hong, H. Chafi, and the influence of different graph-to-relation mappings on the J. Banerjee. Graph Analysis: Do We Have to Reinvent performance of relational backends. Additionally, we will the Wheel? In Proc. Workshop on Graph Data include further data sets with different graph characteristics Management Experiences and Systems (GRADES), and pattern matching queries that filter on attribute values. pages 1–6, 2013. APPENDIX t.step from AS step to FROM ShortestPathCTE t A. TABLES AND VIEWS INNER JOIN shortest connection s ON s.dID = t.step from AND s.sID = t.sID Below, we introduce the tables and views that were created ) SELECT sID, dID, for Schema A and Schema B in order to store a property coalesce(step from, sID) AS step from, step to FROM ShortestPathCTE graph in a relational database system. A node has a unique id and is stored in the node_label relation with its corre- sponding labels. Therefore, for each label of a node, there is B. PATTERN MATCHING QUERIES an entry in the node_label table. For each attribute, there is a relation containing entries of the node id and the attribute B.1 Paths restricted on node labels value. The node_label table and each attribute table have a QP1,2 Cypher clustered index on the node id. In Schema A, there is a table MATCH (a:University)--(b:FullProfessor)--(c:Department) for each edge type containing the edge id and the ids of the RETURN a, b, c source and target nodes. The edge tables have a clustered index on the id of the source and target node. In Schema B, QP1,2 SQL there is a table containing edges of all types. This table has SELECT n1.id, n2.id, n3.id FROM rel all bidirectional r1 a clustered index on the id of the source and target node as INNER JOIN node label n1 ON n1.id = r1.sID INNER JOIN node label n2 ON n2.id = r1.dID well as an unclustered index on the column indicating the INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id edge type. In addition, we create the following views, where INNER JOIN node label n3 ON n3.id = r2.dID the prefix “rel ” denotes an edge table (or view). However, WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’ AND n3.value = ’Department’ in Schema B we have a table rel_all instead of a view. CREATE VIEW nodes (id) AS QP1,3 Cypher SELECT DISTINCT id FROM node label MATCH (a:University)--(b:FullProfessor)--(c:Department) --(d:UndergraduateStudent) CREATE VIEW rel all (sID, dID) AS RETURN a, b, c, d SELECT sID, dID FROM rel advisor UNION ALL SELECT sID, dID FROM rel doctoralDegreeFrom QP1,3 SQL UNION AL SELECT n1.id, n2.id, n3.id, n4.id FROM rel all bidirectional r1 SELECT sID, dID FROM rel headOf INNER JOIN node label n1 ON n1.id = r1.sID UNION ALL INNER JOIN node label n2 ON n2.id = r1.dID SELECT sID, dID FROM rel mastersDegreeFrom INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id UNION ALL INNER JOIN node label n3 ON n3.id = r2.dID SELECT sID, dID FROM rel memberOf INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID UNION ALL INNER JOIN node label n4 ON n4.id = r3.dID SELECT sID, dID FROM rel publicationAuthor WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’ UNION ALL AND n3.value = ’Department’ SELECT sID, dID FROM rel subOrganizationOf AND n4.value = ’UndergraduateStudent’ UNION ALL SELECT sID, dID FROM rel takesCourse UNION ALL QP1,4 Cypher SELECT sID, dID FROM rel teacherOf MATCH (a:University)--(b:FullProfessor)--(c:Department) UNION ALL --(d:UndergraduateStudent)--(e:Course) SELECT sID, dID FROM rel teachingAssistantOf RETURN a, b, c, d, e UNION ALL SELECT sID, dID FROM rel undergraduateDegreeFrom UNION ALL QP1,4 SQL SELECT sID, dID FROM rel worksFor SELECT n1.id, n2.id, n3.id, n4.id, n5.id FROM rel all bidirectional r1 CREATE VIEW rel all bidirectional (sID, dID) AS INNER JOIN node label n1 ON n1.id = r1.sID SELECT sID, dID FROM rel all INNER JOIN node label n2 ON n2.id = r1.dID UNION ALL INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id SELECT dID AS sID, sID AS dID FROM rel all INNER JOIN node label n3 ON n3.id = r2.dID INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID CREATE VIEW transitive closure AS INNER JOIN node label n4 ON n4.id = r3.dID WITH BacktraceCTE(sID, parent, dID, length) AS INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID (SELECT sID, NULL, dID, 1 INNER JOIN node label n5 ON n5.id = r4.dID FROM dbo.rel all WHERE n1.value = ’University’ AND n2.value = ’FullProfessor’ UNION ALL AND n3.value = ’Department’ SELECT b.sID, b.dID, r.dID, b.length + 1 AND n4.value = ’UndergraduateStudent’ AND n5.value = ’Course’ FROM BacktraceCTE AS b INNER JOIN dbo.rel all AS r ON b.dID = r.sID) SELECT sID, parent, dID, length FROM BacktraceCTE B.2 Cycles CREATE VIEW shortest connection AS QP2,2 Cypher SELECT DISTINCT t.* FROM dbo.transitive closure t MATCH (a:UndergraduateStudent)--(b:AssociateProfessor) INNER JOIN (SELECT sID, dID, MIN(length) as minLength --(c:GraduateStudent)--(d:Department)--(a) FROM dbo.transitive closure GROUP BY sID, dID) m RETURN a,b,c,d ON m.sID = t.sID AND m.dID = t.dID AND m.minLength = t.length CREATE VIEW shortest path AS WITH QP2,2 SQL ShortestPathCTE (sID, dID, step from, step to) AS ( SELECT n1.id, n2.id, n3.id, n4.id FROM rel all bidirectional r1 SELECT sID, dID, parent, dID FROM dbo.shortest connection INNER JOIN node label n1 ON n1.id = r1.sID UNION ALL INNER JOIN node label n2 ON n2.id = r1.dID SELECT t.sID, t.dID, coalesce(s.parent, t.sID) AS step from, INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id INNER JOIN node label n3 ON n3.id = r2.dID QP3,3 Cypher INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c) INNER JOIN node label n4 ON n4.id = r3.dID -[:advisor]->(d)-[:mastersDegreeFrom]->(e) INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID RETURN a, b, c, d, e WHERE n1.value = ’UndergraduateStudent’ AND n2.value = ’AssociateProfessor’ AND n3.value = ’GraduateStudent’ QP3,3 SQL AND n4.value = ’Department’ AND r4.dID = n1.id SELECT n1.id AS teacher, n2.id AS course, n3.id AS student, n4.id AS advisor QP2,3 Cypher FROM nodes n1, rel teacherOf rto, nodes n2, rel takesCourse rtc, nodes n3, rel advisor ra, nodes n4, MATCH (a:UndergraduateStudent)--(b:Course) rel mastersDegreeFrom rmdf, nodes n5 --(c:AssociateProfessor)--(d:UndergraduateStudent) WHERE n1.id = rto.sID AND rto.dID = n2.id --(e:Course)--(a) WHERE b <> e AND a<>d AND n2.id = rtc.dID AND n3.id = rtc.sID RETURN a,b,c,d,e AND n3.id = ra.sID AND n4.id = ra.dID AND n4.id = rmdf.sID AND n5.id = rmdf.dID QP2,3 SQL QP3,4 Cypher SELECT n1.id, n2.id, n3.id, n4.id, n5.id FROM rel all bidirectional r1 MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c) INNER JOIN node label n1 ON n1.id = r1.sID -[:advisor]->(d)-[:mastersDegreeFrom]->(e) INNER JOIN node label n2 ON n2.id = r1.dID <-[:doctoralDegreeFrom]-(f) INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id RETURN a, b, c, d, e, f INNER JOIN node label n3 ON n3.id = r2.dID INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID QP3,4 SQL INNER JOIN node label n4 ON n4.id = r3.dID INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID SELECT n1.id AS teacher, n2.id AS course, n3.id AS student, INNER JOIN node label n5 ON n5.id = r4.dID n4.id AS advisor INNER JOIN rel all bidirectional r5 ON n5.id = r5.sID FROM nodes n1, rel teacherOf rto, nodes n2, rel takesCourse rtc, WHERE n1.value = ’UndergraduateStudent’ AND n2.value = ’Course’ nodes n3, rel advisor ra, nodes n4, rel mastersDegreeFrom rmdf, AND n3.value = ’AssociateProfessor’ nodes n5, rel doctoralDegreeFrom rddf, nodes n6 AND n4.value = ’UndergraduateStudent’ AND n5.value = ’Course’ WHERE n1.id = rto.sID AND rto.dID = n2.id AND r5.dID = n1.id AND n1.id != n4.id AND n2.id != n5.id AND n2.id = rtc.dID AND n3.id = rtc.sID AND n3.id = ra.sID AND n4.id = ra.dID AND n4.id = rmdf.sID AND n5.id = rmdf.dID QP2,4 Cypher AND n5.id = rddf.dID AND n6.id = rddf.sID MATCH (a:UndergraduateStudent)--(b:Course) --(c:AssociateProfessor)--(d:UndergraduateStudent) B.4 Paths restricted on node labels and edge --(e:Course) --(f:AssistantProfessor)--(a) types WHERE b <> e AND a<>d RETURN a,b,c,d,e,f QP4,2 Cypher MATCH (a:FullProfessor)-[:teacherOf]->(b:Course) QP2,4 SQL <-[:takesCourse]-(c:UndergraduateStudent) -[:advisor]->(d:AssociateProfessor) SELECT n1.id, n2.id, n3.id, n4.id, n5.id, n6.id RETURN a, b, c, d FROM rel all bidirectional r1 INNER JOIN node label n1 ON n1.id = r1.sID INNER JOIN node label n2 ON n2.id = r1.dID QP4,2 SQL INNER JOIN rel all bidirectional r2 ON r2.sID = n2.id SELECT a.id, b.id, c.id, d.id INNER JOIN node label n3 ON n3.id = r2.dID FROM node label a, rel teacherOf r1, node label b, INNER JOIN rel all bidirectional r3 ON n3.id = r3.sID rel takesCourse r2, node label c, rel advisor r3, node label d INNER JOIN node label n4 ON n4.id = r3.dID WHERE a.value = ’FullProfessor’ AND b.value = ’Course’ INNER JOIN rel all bidirectional r4 ON n4.id = r4.sID AND c.value = ’UndergraduateStudent’ INNER JOIN node label n5 ON n5.id = r4.dID AND d.value = ’AssociateProfessor’ INNER JOIN rel all bidirectional r5 ON n5.id = r5.sID AND a.id = r1.sID AND r1.dID = b.id AND b.id = r2.dID INNER JOIN node label n6 ON n6.id = r5.dID AND r2.sID = c.id AND c.id = r3.sID AND r3.dID = d.id INNER JOIN rel all bidirectional r6 ON n6.id = r6.sID WHERE n1.value = ’UndergraduateStudent’ AND n2.value = ’Course’ AND n3.value = ’AssociateProfessor’ QP4,3 Cypher AND n4.value = ’UndergraduateStudent’ AND n5.value = ’Course’ MATCH (a:FullProfessor)-[:teacherOf]->(b:Course) AND n6.value = ’AssistantProfessor’ AND r6.dID = n1.id <-[:takesCourse]-(c:UndergraduateStudent) AND n1.id != n4.id AND n2.id != n5.id -[:advisor]->(d:AssociateProfessor) -[:mastersDegreeFrom]->(e:University) RETURN a, b, c, d, e B.3 Paths restricted on edge types QP3,2 Cypher QP4,3 SQL MATCH (a)-[:teacherOf]->(b)<-[:takesCourse]-(c) SELECT a.id, b.id, c.id, d.id, e.id -[:advisor]->(d) FROM node label a, rel teacherOf r1, node label b, RETURN a, b, c, d rel takesCourse r2, node label c, rel advisor r3, node label d, rel mastersDegreeFrom r4, node label e WHERE a.value = ’FullProfessor’ AND b.value = ’Course’ QP3,2 SQL AND c.value = ’UndergraduateStudent’ AND d.value = ’AssociateProfessor’ AND e.value = ’University’ SELECT n1.id AS teacher, n2.id AS course, n3.id AS student, AND a.id = r1.sID AND r1.dID = b.id AND b.id = r2.dID n4.id AS advisor AND r2.sID = c.id AND c.id = r3.sID AND r3.dID = d.id FROM nodes n1, rel teacherOf rto, nodes n2, AND d.id = r4.sID AND r4.dID = e.id rel takesCourse rtc, nodes n3, rel advisor ra, nodes n4 WHERE n1.id = rto.sID AND rto.dID = n2.id AND n2.id = rtc.dID AND n3.id = rtc.sID AND n3.id = ra.sID AND n4.id = ra.dID