1. INTRODUCTION

An approach to efficiently storing property graphs in relational databases

Matthias Schmid

Matthias.Schmid@uni-passau.de 0 0 University of Passau Germany

2019

Graph structured data can be found in an increasing amount of use-cases. While there exists a considerable amount of solutions to store graphs in NoSQL databases, the combined storage of relationally stored data with graph structured data within the same system is not well researched. We present a relational approach to storing and processing huge property graphs, which is optimized for read-only queries. This way, all the advantages of full- edged relational database systems can be used and the seamless integration of classical relational data with graph-structured data is possible.

1. INTRODUCTION

The increasing appearance of data graphs reinforce the need for adequate graph data management. To meet this requirement of property graph storage, several graph databases engines have been developed. The most popular1 examples of databases with graph storage capability include Neo4j2, Microsoft Azure Cosmos DB3 and OrientDB4. While these graph databases o er good performance as long as the data ts into main memory, one can not always assume that this requirement can be ful lled. In addition the seamless integration of those solutions with relational-based information system remains a problem.

Motivated by our own use-case, in which we need to store huge graphs that represent buildings, RDF data that semantically describes those buildings and relational data into a single information system, we were looking for a relationalbased solution that o ers e cient read-focused performance. In SQLGraph:An E cient Relational-Based Property Graph 1from https://db-engines.com/de/ranking/graph+dbms, if not explicitly stated, all websites were accessed in February 2019 2https://neo4j.com/ 3https://azure.microsoft.com/de-de/services/cosmos-db/ 4https://orientdb.com/

Store [ 9 ] Sun et al. introduce an e cient approach to store property graph data in relational databases named SQLGraph. While the original approach seems to o er good performance and promises to also perform well on datasets that do not t into main memory, we were able to increase its performance for read-only queries by adapting the adjacency list concept.

The contributions of this paper are: 1. We propose a new adaption of the relational schema presented in [ 9 ], 2. we show that the adapted schema performs better on read-only queries 3. and we show that the new schema requires less disk space. 2.

RELATED WORK

In this work we focus on the relational-based storage of property graph data. Bornea et. al rst proposed the approach of shredding graph edges into adjacency tables in [ 1 ]. Sun et al. base their work in [ 9 ] on the previously mentioned work of Bornea et. al and outlined a novel schema layout for storage of property graph data in relational databases, which is generalized from the approach to store RDF data in relational databases. They combine the shredding of edges into adjacency tables with the use of JSON -based attribute storage to overcome the limitations of xed columns. To make the retrieval of data more convenient, they propose a translation mechanism that converts Gremlin [ 8 ] queries to SQL queries, which can be run on the proposed relational schema.

In the eld of RDF there exist numerous benchmarking e orts. But to the best of our knowledge the benchmarks provided by the Linked Data Benchmark Council (LDBC) are the only benchmarks that directly address the problem of benchmarking property graph stores. We do not consider graph processing frameworks like Apache Giraph5 in our work, since we focus on the storage and retrieval of graph data, not its parallel processing. The LDBC is an EU project with the goal to develop benchmarks for graph structured data. They strife to nd the acceptance benchmarks like the TPC [ 7 ] have achieved. The Linked Data Benchmark Council - Social Network Benchmark (LDBC-SNB) is one approach being developed by the LDBC that uses a generated social network graph as its data set and represents the 5http://giraph.apache.org/ data as a property graph. The benchmark suite also provides a data set generator that can create test data with di erent scale factors. An overview of di erent scale factors and the corresponding number of vertices and edges is depicted in Table 1. Works on the Social Network Benchmark are not nished yet, but the Interactive Workload has been released in draft stage [ 4 ].

3. SQLGRAPH SCHEMA ADAPTATION

For this paper we use the following de nition of a property graph.

De nition 1. A property graph consists of a set of vertices and directed edges. Every edge has a label assigned to it. Each vertex or edge can additionaly have multiple key-value pairs assigned that serve as attributes.

Proposed Schema

Our approach utilizes the combined approach of relational and JSON -based storage developed in [ 9 ]. We adapted the schema of adjacency lists and as a result, our proposed schema reduces the amount of required tables from six (as in the original SQLGraph schema) to four tables.

The vertex table stores the internal vertex id and the attributes for each vertex. In our prototypical implementation the attributes are implemented as the jsonb data type provided by PostgreSQL.

VID 1 2

Attributes

f"lastName": "Mueller", " rstName": "David"g f"lastName": "Choi", " rstName": "Jae-Jin"g

Our approach di ers from the SQLGraph schema in respect to the de nition of the adjacency tables. In their work Sun et al. [ 9 ] show that shredding vertex adjacency lists into a relational schema provides a signi cant advantage over other mechanisms, for example mechanisms that store all edge information in a single table. To this end a hash function has to be de ned that matches edge labels to a respective column triple of the adjacency table. We directly apply the approach to compute a hash function through the use of coloring heuristics as described in [ 1 ]. Note that, since the LDBC-SNB data set includes a data model, it is possible to compute a con ict free hash function for this speci c data set. In the original approach in [ 9 ] the edges are split into two tables: If only one edge of a label exists for the vertex, it's edge-id, label and target node-id are stored in the outgoing primary adjacency (OPA) and incoming primary adjacency (IPA) tables respectively. By studying available data sets and use cases (e.g. the LDBC-SNB) one can see that it is usually not the case that only one edge of a speci c label exists for a single node. Therefore if the vertex has multiple outgoing edges of the same label, the edgeids and target vertices are stored in the outgoing secondary adjacency (OSA) and incoming secondary adjacency (ISA) tables respectively, while the target vertex id of the OPA or IPA is set to an arti cial value that serves as the join partner for the OSA and ISA table. A query to receive all outgoing or incoming neighbours can be written with the use of an outer join. If there exists only a single edge, the outer join will not nd a corresponding partner in the secondary table and therefore use the data in the primary table. If more edges do exist, the join partners will be the resulting edges. This query works independently of the number of edges per label and the hash function. We will present an example of this query structure for our schema adaption, that omits the outer join, later in this section.

Nevertheless, this means every edge hop of a complex query requires an (outer) join-operation between the primary and secondary adjacency tables. This even is the case, if only a single edge of the required type is present, since the construction of the query should be independent of the hash function. Because the retrieval of direct neighbours of a vertex is one of the most important operations in any graph application, this join poses a major bottleneck for most queries.

In our approach we eliminate the need for a join-operation between OPA and OSA by storing all edges of one type in the corresponding columns using arrays. Table 3 shows an example of our adjacency tables: In the graph are two knows-edges that point from the vertex with id 1 to vertices 2 and 3. Since in this example the knows and the likes label are hashed to the same column triple, a second row for vertex 1 has to be inserted. Note that this can be omitted with a better hash function or by providing more columns. An evaluation of optimal numbers of columns, if no hash function with a low amount of resulting con icts for a given amount of columns can be found, has not been part of our research yet and will have to be addressed in the future.

Since we can provide a con ict free hash function for the LDBC-SNB data set, an outgoing neighbourhood query for a speci c vertex can be answered by loading a single tuple from the database. By eliminating the need of an outer join for any edge hop, a major drawback of the original schema is overcome.

EID1 [ 4, 5 ] [12] [ 6 ]

Label1

knows likes likes

EIDk

[11] null null

Labelk

creator of null null

An example for the general query structure implemented in the PostgreSQL dialect is depicted in Listing 1. The rst common table expression unshred edges converts the list of tuples belonging to one vertex, of which every row contains k edge types, into a list of rows that contains one edge type per row. The second common table expression gather edges then splits those tuples into a list of edges that represents a well known edge structure.

Analogue to the storage of vertex attributes, we store edge attributes. As in the original schema, we do not only store attributes in this table, but also additional information about the edges. Namely the source vertex, the target vertex and the label of the edge are recorded. In [ 9 ] it is claimed, that queries that require to trace along an variable number of edges are performed much more e ciently when the edge table is used instead of the adjacency table. Our preliminary evaluations support this claim.

Our goal was to compare the read-only query performance of our adaptation with the original SQLGraph schema. To this end we used part of the Interactive Workload of the LDBC-SNB [ 2, 4, 10 ].

We generated di erent data set sizes using the generator and performed tests with data sets that t into the main memory of the system as well as data set sizes that are too huge to t into main memory. To con rm the requirement of redundant representation of edges, we de ned path queries with di erent xed length and also path queries of variable lengths. After we con rmed the necessity of the edge tables, we also chose several queries provided by the Interactive Workload of the LDBC-SNB to evaluate the performance of our adjacency implementation against the original schema. We chose test queries applying the following criteria: The main pattern of the query uses paths of xed length and use edges of known label. This criteria is desired, because preliminary evaluations (see [ 6 ]) have shown, that in other cases the use of the attribute table performs better than the adjacency tables.

The set of queries requires the use of incoming, outgoing and undirected edges to cover all combinations of the adjacency tables.

Di erent lengths of paths are contained in the set of queries to evaluate, if the di erence in path length makes one of the two schema versions preferable. Queries with big intermediate result sets are part of the query set, since [ 9 ] states this type of query as a potential bottleneck of the original approach.

The parameters required by the queries were randomly chosen from the generated data set les and generated parameters. The same parameters were used for both approaches.

To make the performance of the two schemas as comparable as possible, we rst implemented all queries for the adapted schema. Then we replaced only the necessary subqueries that concern the di erences between the schemas, changing as little of the query as possible. We con rmed the correctness of our query implementations by comparing the results with results returned by a reference implementation provided for Neo4j, of which correctness has been validated.6

We evaluated the previously chosen queries on the same hardware using the same data set. The evaluation was conducted on a dedicated server with two Intel Xeon 2.4GHz CPUs (in total 16 cores), 64GB memory and a single 240GB SSD running 64-bit Linux. We used PostgreSQL 10 on the aforementioned server, while we ran the client program of the benchmark on a standard laptop that connected to the database over LAN. All queries that were performed for performance evaluation purposes were proceeded by several warm-up queries as in most state-of-the-art benchmarks and also advised in [ 3 ].

All examples previously shown in this paper are a simpli ed version of the test data provided by the LDBC-SNB data generator. 6https://github.com/PlatformLab/ldbc-snbimpls/tree/master/snb-interactive-neo4j 3.2.2

Our ndings support the need for redundant storage of edge data in the edge attributes table as well as the adjacency table as claimed by [ 1, 9 ]. We nd a general tendency for queries to perform signi cantly better with the use of adjacency tables, if the queried path is of xed length. While this is true for paths of xed length, the opposite holds for paths of variable length. This type of queries requires recursive SQL queries. Recursive queries with the use of the edge attributes table outperform any recursive query that uses adjacency tables. In her Master's thesis Kornev [ 6 ] conducted an extensive evaluation, that shows when to use the edge attributes table in favor of adjacency lists.

100 B G in 80 e c a Sp 60 k s i D d 40 e r i u q eR 20

Required Disk Space

Since we reduced the number of tables, we expected a reduction in required storage capacity. To this end we imported di erent sizes of generated data sets provided by the LDBC-SNB data set generator. The generator creates data sets according to an input scale factor (1; 3; 10; 30; : : :). We then sampled the required storage capacity using PostgreSQL functions. The numbers shown in Figure 2 represent the complete graph schema, including all indexes and constraints. Due to hardware constraints we could not yet import data sets with a scale factor greater than 10. We will address higher scale factors in future work. Nevertheless our ndings show, that our adaption of the schema reduces the required storage space by over 10%. This is mainly due to the reduced overhead by removing one table and the associated index structure that is needed to e ciently join the OPA/IPA and OSA/ISA tables. 3.2.4

Query Performance

We expected a reduction in query execution time for retrieval queries that use the adjacency lists. To con rm this we evaluated our approach with the use of queries de ned by the Interactive Workload. First we chose four queries of the short query set. This type of query usually requires the system to evaluate a relatively low amount of vertices and edges to compute the answer, typically the neighbours of one entity of the data set [ 4 ]. The results shown here were performed against the generated data set with scale factor 60 ] 40 s m [ n i e m i t 20

0 10. This data set size does not t in the main memory of the server.

Figure 3 shows the results of the conducted tests in a boxplot. The upper bound of the box shows the 75th percentile, the line within the box shows the 50th percentile and the lower bound of the box shows the 25th percentile, while the whiskers show the maximal and minimal execution time respectively. The results of the simpler queries conrm our expectation regarding execution time. The schema adaption improved query performance across the board for these queries. As we expected, the more edge hops a query performs, we can observe a bigger di erence in execution time. Note that short query 3 (SQ3) uses an undirected edge. Since our schema does not directly support undirected edges, this query results in a SQL query that uses both the incoming adjacency and outgoing adjacency tables and therefore is similar to a 2-edge-hop.

We then additionally conducted experiments on four complex read-only queries of the Interactive Workload. Complex queries touch a signi cant amount of data and often include aggregation [ 4 ]. The chosen queries di er in the count of required edge-hops with CQ 12 containing the highest amount of hops.

Once again our ndings con rm an increase in read performance of our schema adaptation. Considering complex queries the di erence in execution time becomes even more apparent, which con rms the ndings in [ 9 ] that stated that big intermediate join results, which is a point of focus for the set of complex queries, can be a bottleneck for their approach. Our approach signi cantly increases query performance for these types of queries. 4.

CONCLUSION

In this paper we have described our approach to store property graph data in a relational database. Our approach reduces the number of joins that are required for queries that contain paths of xed length compared to earlier approaches. We have conducted a preliminary evaluation of our approach using part of a standardized benchmark for property graphs, namely the LDBC-SNB. The evaluation CQ2

CQ5

CQ8

CQ12

LDBC SNB Queries - Scale Factor 10 results show that the adaption is a more e cient version of the database schema in regards to read-only queries. Due to the reduction in the amount of required tables and therefore also index structures, we were also able to reduce the required disk space.

Therefore we have found an approach that enables us to e ciently store the property graph data from our use-case in a relational database and link it with the already existing relational data. In addition, [ 9 ] show an approach to store RDF data in a property graph model. Thus, we can e ciently integrate all data of our use-case in a single relational database.

FUTURE WORK

Future evaluations will have to show the validity of the approach compared to native database systems like Neo4j. Since most NoSQL and graph database systems focus on main memory computation, we expect those to perform comparable or better than our approach on data sets that can be handled in main memory. On the other hand we expect our approach do outperform non-relational based systems on data sets that require regular disk accesses due to their size. These points will be addressed in future evaluations.

Additionally we will evaluate the approach more extensively using the complete LDBC-SNB Interactive Workload on bigger data sets and more realistic server hardware. The workload also contains queries that insert data. We expect updates to also perform e ciently.

One major drawback of this approach is the complexity of the queries required to retrieve the data. To that end systems like Neo4j o er a very abstract and convenient way to query data with graph query languages like Cypher [ 5 ]. We propose a translation mechanism that converts Cypher queries to SQL queries that can be evaluated by our implementation on PostgreSQL. Kornev [ 6 ] has already shown that a translation of Cypher queries to SQL is possible. Unfortunately the translated queries do not in all cases achieve the e ciency that can be achieved with queries written by an expert. Therefore more optimization possibilities for the translation mechanism need to be explored. This will be part of our future research. 6.

ACKNOWLEDGMENTS

This research was partially supported by the "Bayrische Staatsministerium fur Wirtschaft, Energie und Technologie" in the context of "BaBeDo" (IUK568/001) which we conduct in partnership with VEIT Energie GmbH. 7.

APPENDIX A. EVALUATION QUERIES: INTERACTIVE WORKLOAD

Query Cypher Representation

MATCH ( n : Person f i d : f i d gg) [:IS LOCATED IN] (p : Place ) RETURN

n . firstName AS firstName , n . lastName AS lastName , SQ 1 n . birthday AS birthday , n . l o c a t i o n I P AS l o c a t i o n I p , n . browserUsed AS browserUsed , n . gender AS gender , n . c r e a t i o n D a t e AS creationDate , p . i d AS c i t y I d MATCH ( n : Person f i d : f i d gg) [ r :KNOWS] ( f r i e n d )

RETURN SQ 3 f r i e n d . i d AS personId , f r i e n d . firstName AS firstName ,

f r i e n d . lastName AS lastName , r . c r e a t i o n D a t e AS f r i e n d s h i p C r e a t i o n D a t e ORDER BY f r i e n d s h i p C r e a t i o n D a t e DESC, t o I n t ( personId ) ASC; MATCH (m: Message f i d : f i d gg) [:HAS CREATOR] >(p : Person )

RETURN SQ 5 p . i d AS personId , p . firstName AS firstName ,

p . lastName AS lastName ; MATCH (m: Message f i d : f i d gg) < [:REPLY OF] ( c : Comment)

[:HAS CREATOR] >(p : Person ) OPTIONAL MATCH (m) [:HAS CREATOR] >(a : Person ) [ r :KNOWS] (p ) RETURN c . i d AS commentId , c . content AS commentContent , c . c r e a t i o n D a t e AS commentCreationDate , p . i d AS replyAuthorId , SQ 7 p . firstName AS replyAuthorFirstName , p . lastName AS replyAuthorLastName , CASE r

WHEN n u l l THEN f a l s e

ELSE t r u e

END AS replyAuthorKnowsOriginalMessageAuthor

ORDER BY commentCreationDate DESC, t o I n t ( replyAuthorId ) ASC;

MATCH ( : Person f i d : f 1 g g ) [ :KNOWS] ( f r i e n d : Person ) < [:HAS CREATOR] ( message ) WHERE message . c r e a t i o n D a t e <= f2g AND ( message : Post OR message : Comment) RETURN f r i e n d . i d AS personId , f r i e n d . firstName AS personFirstName , f r i e n d . lastName AS personLastName , message . i d AS messageId , CASE e x i s t s ( message . content )

WHEN t r u e THEN message . content

ELSE message . i m ag e F i l e END AS messageContent ,

message . c r e a t i o n D a t e AS messageDate ORDER BY messageDate DESC, t o I n t ( messageId ) ASC LIMIT f 3 g ; MATCH ( person : Person f i d : f 1 g g ) [ :KNOWS 1 . . 2 ] ( f r i e n d : Person )

< [membership :HAS MEMBER] ( forum : Forum) WHERE membership . joinDate >f2g AND not ( person=f r i e n d ) WITH DISTINCT f r i e n d , forum OPTIONAL MATCH ( f r i e n d ) < [:HAS CREATOR] ( post : Post ) < [:CONTAINER OF] ( forum ) WITH forum , count ( post ) AS postCount RETURN

forum . t i t l e AS forumName , postCount ORDER BY postCount DESC, t o I n t ( forum . i d ) ASC LIMIT f 3 g ; MATCH ( s t a r t : Person f i d :f1gg) < [:HAS CREATOR] () < [:REPLY OF]

(comment : Comment) [:HAS CREATOR] >( person : Person ) RETURN

person . i d AS personId , person . firstName AS personFirstName , CQ 8 person . lastName AS personLastName , comment . creationDate , comment . i d AS commentId , comment . content AS commentContent ORDER BY commentCreationDate DESC, t o I n t ( commentId ) ASC LIMIT f 2 g ; MATCH ( : Person f i d : f 1 g g ) [ :KNOWS] ( f r i e n d : Person ) OPTIONAL MATCH ( f r i e n d ) < [:HAS CREATOR] (comment : Comment) [:REPLY OF] >(: Post )

[:HAS TAG] >( tag : Tag ) , ( tag ) [:HAS TYPE] >( t a g C l a s s : TagClass ) [:IS SUBCLASS OF 0 . . ]

>(baseTagClass : TagClass ) CQ 12 WHERE t a g C l a s s . name = f2g OR baseTagClass . name = f2g

RETURN f r i e n d . i d AS f r i e n d I d , f r i e n d . firstName AS friendFirstName , f r i e n d . lastName AS friendLastName , c o l l e c t (DISTINCT tag . name ) , count (DISTINCT comment ) AS count ORDER BY count DESC, t o I n t ( f r i e n d I d ) ASC

LIMIT f 3 g ;

[1]

M. A.

Bornea ,

Dolby ,

Kementsietsidis ,

Srinivas ,

Dantressangle ,

Udrea , and

Bhattacharjee . Building an e cient rdf store over a relational database . pages 121{132 , 06 2013 .

[2]

P. A.

Boncz . LDBC: benchmarks for graph and RDF data management . In 17th International Database Engineering & Applications Symposium , IDEAS '13, Barcelona , Spain - October 09 - 11 , 2013 , pages 1{2 , 2013 .

[3]

Dominguez-Sal ,

Martinez-Bazan ,

Muntes-Mulero ,

Baleta , and

J. L.

Larriba-Pey . A discussion on the design of graph database benchmarks . In Technology Conference on Performance Evaluation and Benchmarking , pages 25 { 40 . Springer, 2010 .

[4]

Erling ,

Averbuch ,

Larriba-Pey ,

Cha ,

Gubichev ,

Prat , M.-D. Pham , and P. Boncz . The LDBC social network benchmark: Interactive workload . In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15 , pages 619 { 630 , New York, NY, USA, 2015 . ACM.

[5]

Francis ,

Green ,

Guagliardo ,

Libkin ,

Lindaaker ,

Marsault ,

Plantikow ,

Rydberg ,

Selmer , and

Taylor . Cypher: An evolving query language for property graphs . In Proceedings of the 2018 International Conference on Management of Data, SIGMOD '18 , pages 1433 { 1445 , New York, NY, USA, 2018 . ACM.

[6]

L. S.

Kornev . Cypher fur sqlgraph. Master's thesis , University of Passau, 2017 .

[7]

Poess and

Floyd . New tpc benchmarks for decision support and web commerce . ACM Sigmod Record , 29 ( 4 ): 64 { 71 , 2000 .

[8]

M. A.

Rodriguez . The gremlin graph traversal machine and language (invited talk) . In Proceedings of the 15th Symposium on Database Programming Languages, DBPL 2015 , pages 1 { 10 , New York, NY, USA, 2015 . ACM.

[9]

Sun ,

Fokoue ,

Srinivas ,

Kementsietsidis , G. Hu, and

Xie. Sqlgraph : An e cient relational-based property graph store . In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data , pages 1887 { 1901 . ACM, 2015 .

[10]

Szarnyas ,

Prat-Perez ,

Averbuch ,

Marton ,

Paradies ,

Kaufmann ,

Erling ,

P. A.

Boncz ,

Haprian , and

J. B.

Antal . An early look at the LDBC social network benchmark's business intelligence workload . In Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA ), Houston, TX, USA, June 10, 2018 , pages 9:1 {9: 11 , 2018 .