Large Scale Querying and Processing for Property Graphs PhD Symposium∗ Mohamed Ragab Data Systems Group, University of Tartu Tartu, Estonia mohamed.ragab@ut.ee ABSTRACT Recently, large scale graph data management, querying and pro- cessing have experienced a renaissance in several timely applica- tion domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the preva- lence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and chal- lenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed Figure 1: A simple example of a Property Graph efficient querying for large graphs with complex queries. In par- ticular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the state- graph data following the core principles of relational database of-the-art techniques for large-scale property graph querying and systems [10]. Popular Graph databases include Neo4j1 , Titan2 , processing. We present our goals and initial results for querying ArangoDB3 and HyperGraphDB4 among many others. and processing large property graphs based on the emerging and In general, graphs can be represented in different data mod- promising Apache Spark framework, a defacto standard platform els [1]. In practice, the two most commonly-used graph data for big data processing. In principle, the design of this research models are: Edge-Directed/Labelled graph (e.g. Resource Descrip- plan is revolving around two main goals. The first goal focuses on tion Framework (RDF5 )) for representing data in triples (Subject, designing an adequate and efficient graph-based storage backend Predicate, and Object), and the Property Graph (PG) data model [9]. that can be integrated with the Apache Spark framework. The The PG model extends edge-directed/labelled graphs by adding second goal focuses on developing various Graph-aware opti- (multiple) labels for the nodes and types for the edges, as well as mization techniques (e.g., graph indexing, graph materialized adding (multiple) key-value proprieties for both nodes and edges views), and extending the default relational Spark Catalyst op- of the graph. In this paper, we focus on the PG model, as it is timizer with Graph-aware cost-based optimizations. Achieving currently the most widely used and supported graph data model these contributions can significantly enhance the performance in industry as well as in the academia. In particular, most of the of executing graph queries on top of Apache Spark. current top and widely used graph databases use the property graph data model [8]. This great success and wide spread of PG model is due to its great balance between conceptual and intuitive 1 INTRODUCTION simplicity, in addition to its rich expressiveness [20]. Figure 1 Graphs are everywhere. They are intuitive and rich data mod- illustrates an example of a simple property graph. els that can represent strong connectivity within the data. Due Recently, several graph query languages have been proposed to their rich expressivity, graphs are widely used in several ap- to support querying different kinds of graph data models [1]. For plication domains including the Internet of Things (IoT), social example, the W3C community has developed and standardized networks, knowledge graphs, transportation networks, Semantic SPARQL, a query language for querying the RDF-typed graphs [16]. Web, and Linked Open Data (LOD) among many others [17]. In Gremlin [19] has been proposed as a functional programming principle, graph processing is not a new problem. However, re- graph query language that supports the property graph model, cently, it gained an increasing attention and momentum, more and optimized for supporting graph navigational/traversal queries. than before, in several timely applications [22]. This is due to the Oracle has designed PGQL [21], an SQL-like graph query lan- ongoing huge explosion in graph data alongside with a great avail- guage which also supports querying the property graph data ability of computational power to process this data. Nowadays, model. Facebook also presented GraphQl [13], a REST-API like several enterprises have or planned to use graph technologies for graph query language for accessing the web data as a graph their data storage and processing applications. Moreover, Graph of objects. Neo4j designed Cypher [9] as its main query lan- databases are currently widely used in the industry to manage guage which targets querying the property graph data model in a natural and intuitive way. In practice, Cypher is currently the ∗ The supervisor of this work is Sherif Sakr 1 https://neo4j.com/ 2 https://github.com/thinkaurelius/titan © Copyright 2020 for this paper held by its author(s). Published in the proceedings of 3 https://www.arangodb.com/ DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 4 http://www.hypergraphdb.org/ 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 5 https://www.w3.org/RDF/ most popular graph query language, and has been supported by core for improving query executions [4]. Catalyst optimizer is several other graph-based projects and graph databases includ- mainly a rule-based optimizer, adding optimization rules based on ing SAP HANA6 , RedisGraph7 , Agens Graph8 , MemGraph9 , and functional programming constructs in Scala language. It can also Morpheus10 (Cypher for Apache Spark) [7]. apply variety of relational cost-based optimizations for improving Problem Statement: With the continuous increase in graph the quality of multiple alternative query execution plans. data, processing large graphs introduces several challenges and GraphX extends the low level RDD abstraction, and introduces detrimental issues to the performance of graph-based applica- a new abstraction called Resilient Distributed Graphs (RDG). In tions [12]. In particular, one of the common challenges of large- a graph, RDG relates records with vertices and edges and pro- scale graph processing is the efficient evaluation of graph queries. duces an expressive computational primitives’ collection. GraphX In particular, the evaluation of a graph query mainly depends on chains the benefits of graph-parallel and data-parallel systems. the graph scope (i.e. the number of nodes and edges it touches) [20]. However, GraphX is not currently actively maintained. Besides, Therefore, real-world complex graph queries may unexpectedly GraphX is based on the low level RDGs. Thus, it cannot exploit take a long time to be answered [18]. In practice, most of cur- the Spark 2’s Catalyst query optimizer that supports only Spark rent graph databases architecture are typically designed to work DataFrames API. Moreover, GraphX is only available to Scala on a single-machine (non-clustered). Therefore, graph querying users. solutions can only handle the Online Transactional Processing GraphFrames is a graph package built on DataFrames [5]. (OLTP-style) query workload, which defines relatively simple GraphFrames benefits from the scalability and high performance computational retrievals on a limited subset of the graph data. of DataFrames. They provide a uniform API for graph processing For instance, Neo4j is optimized for subgraph traversals and for available from Scala, Java, and Python. GraphFrames API imple- medium-sized OLTP query workloads. Whereas, for complex On- ments DataFrame-based graph algorithms, and also incorporates line Analytical Processing (OLAP-style) query workload (where simple graph pattern matching with fixed length patterns (called the query needs to touch huge parts of the graph, and complex ’motifs’). Although GraphFrames are based on Spark DataFrames joins and aggregations are required), graph databases are not the API, they have a semantically-weak graph data model (i.e. based best solution. on un-typed edges and vertices). Moreover, The motif pattern In this paper, we provide an overview of the current state-of- matching facility is very limited in comparison to other well- the-art efforts in solving the large scale graph querying along established graph query languages like Cypher. Besides, other side with their limitations (Section 2). We present our planned important features which are present in Spark GraphX such as contributions based on one of the emerging distributed process- partitioning are missing in the GraphFrames package. ing platforms for querying large graph data, Morpheus11 (Section In practice, by default, Spark does not support processing 3). We present our initial results in Section 4, before we conclude and querying of property graph data model, despite is wide- the paper in Section 5. spread use. To this extent, the Morpheus project has come to the scene. In particular, the Morpheus project has been designed to 2 STATE OF THE ART enable the evaluation of Cypher over large property graphs using Distributed processing frameworks can be utilized to solve the DataFrames on top of Apache Spark framework. In practice, this graph scalability issues with query evaluation. Apache Spark framework enables combining the scalability of the Spark frame- represents the defacto standard for distributed big data process- work with the features and capabilities of Neo4j by enabling ing [2]. Unlike MapReduce model, Spark uses the main mem- the Cypher language to be integrated into the Spark analytics ory for parallel computations over large datasets. Thus, it can pipeline. Interestingly, graph processing and querying can be be up to 100 times faster than Hadoop [24]. Spark maintains then easily interwoven with other Spark processing analytics this level of efficiency due to its core data abstraction which is libraries such as Spark GraphX, Spark ML or Spark-SQL. More- known as Resilient Distributed Datasets(RDDs). An RDD is an over, this enables easy merging of graphs from Morpheus into immutable, distributed and fault tolerant collection of data el- Neo4j. Besides more advanced capabilities of Morpheus such as ements which can be partitioned across the memory of nodes the ability to handle multiple graphs (i.e. graph Composability) in the cluster. Another efficient data abstraction of Spark is the from different data sources even if they are not graph sources (i.e Spark DataFrames. DataFrames are organized according to a spe- relational data sources), it has the ability to create graph views cific schema into named and data-typed columns like a table in on the data as well. the relational databases. Spark proposes various higher level li- Figure 2 illustrates the architecture of Morpheus framework. braries on top of RDDs and DataFrames abstractions, GraphX [11] In Morpheus, Cypher queries are translated into Abstract Syntax and SparkSQL [3] for processing structured and semi-structured Tree (AST ). Then, Morpheus core system translates this AST into large data. DataFrame operations with schema and Data-Type handling. It Spark-SQL is a high-level library for processing structured as is worth noting that DataFrames in Spark use schema, while well as semi-structured large datasets. It enables querying these Neo4j or generally property graphs optionally use a schema (i.e. datasets stored in DataFrames abstraction using SQL. Spark-SQL schema free data model). Therefore, Morpheus provides a Graph acts as a distributed SQL query engine over large structured Data Definition Language GDDL for handling schema mapping. datasets. In addition, SparkSQL offers a Catalyst optimizer in its Particularly, GDDL expresses property graph types and maps between those types and the relational data sources. Moreover, 6 https://s3.amazonaws.com/artifacts.opencypher.org/website/ocim1/slides/ the Morpheus core system manages importing graph data that Graph+Pattern+Matching+in+SAP+HANA.pdf can reside in different Spark storage backends such as HDFS (i.e. 7 https://oss.redislabs.com/redisgraph/ 8 https://bitnine.net/agensgraph/ in different file formats), Hive, relational databases using JDBC, 9 https://memgraph.com/ or Neo4j (i.e. Morpheus Property graph data sources PGDs), and 10 https://github.com/opencypher/morpheus exporting these property graphs directly back to those Spark 11 https://github.com/opencypher/morpheus storage backends. This interestingly means that graph data can Figure 3: Comparison of Spark PG storage backends insights for the subsequent optimizations (i.e. graph Indexing and Partitioning). Further, figuring out the top performing backends helps optimizing the query plan afterwards. For instance, if the best performing storage backend is a Columnar-Oriented backend (e.g. ORC), it is better for making more pushing projections down in the query plan. Whereas, if it is a Row-Oriented backend, it is Figure 2: Morpheus Architecture better to make more pushing selections down in the plan. RQ2: Graph Indexes (How can we use Graph Indexing for Better Performance?): The default method for processing be read from these native Spark data sources without altering nor graph queries is to perform a subgraph matching search against copying the original data sources to Morpheus. Particularly, it is the graph dataset [14]. Several graph indexing techniques have like you plug-in these storage backends to the Morpheus frame- been proposed in the literature. In practice, building a graph work as shown in Figure 2. The native Spark Catalyst optimizer index is a multi-faceted process. That is, it depends on using is used in Morpheus pipeline for making various core query op- the graph structural information for enumerating and extracting timizations for the generated relational plan of operations. Last the most frequent features (i.e. graph sub-structures), and then but not least, Morpheus runs these optimized query plans on the building a data structure of these features. These data structures Spark cluster using distributed Spark runtime environment. are such as Hash Tables , Lattices, Tries or Trees [14]. The indexed features can be in the form of simple graph patterns/paths, trees, 3 RESEARCH PLAN graphs, or a mix of graphs and trees. Further, selecting these In general, Morpheus has been designed to enable executing features to be indexed can be done exhaustively via enumerating Cypher queries on top of Spark. However, on the backend, the all such features across the whole graph data set [15], or via property graphs are represented and maintained using Spark mining the graph data set for frequent patterns or features (i.e. relational DataFrames. Therefore, Cypher graph-based queries Discriminative Features) [23, 26]. This mining is recommended are internally translated into relational operations over these in building a graph index as the size of the created graph index DataFrames. Therefore, Spark still performs operations on the should be reasonable. It is also worth noting that, most of the property graph as tabular data and views with specified schema. existing graph indexing algorithms are only able to handle un- Thus, adding a graph-aware optimization layer for Spark can directed graphs with labelled vertices [14]. significantly enhance the performance of graph query execution Currently, Morpheus doesn’t use any indexing mechanism for on the property graph data. In this research plan, we are focus- property graphs while executing graph queries. To this end, we ing on two main aspects for enhancing querying and processing aim to build an efficient indexing scheme of the property graphs large property graphs in the context of the Morpheus project. The in an offline mode (taking into consideration its schema, as well as first aspect is to design an efficient Spark-based storage backend its storage backend). Then, this index will be used for reducing the for persisting property graphs. The other aspect is to provide search space for the complex graph pattern matching task. Thus, graph-aware optimizations for query processing inside Morpheus consulting this index for executing the graph query workload is such as Graph indexing, Graph Materialized Views, and last but better than exhaustive vertex-to-vertex correspondence checking not least graph cost-based optimizations on top of the default from the query to the graph which involves a lot of expensive join Spark Catalyst optimizer. In order to achieve these aspects in operations in the relational representation. In our plan, we don’t our research plan, we focus on answering the following Research consider the overhead of updating the built index, as Morpheus is Questions (RQs): currently supporting only read operations, and thus no insertions RQ1: Graph Persistence(Which storage backend achieves and deletions happen to the already generated property graph. better performance ?): Large graphs require well-suited per- RQ3: Graph Materialized Views (How can we use Graph sistence solutions that are efficient for query evaluations and Views for better performance?): Morpheus and most of graph processing [20]. As mentioned earlier, graph data in Morpheus databases tend to compute each query from scratch without be- can settle on multiple different data sources such as HDFS with ing aware of the previous query workloads [25]. Particularly, its different file formats (e.g. Avro, Parquet, and CSV), Neo4j, if we repeatedly execute the same query using Morpheus, the Hive, or other kinds of relational DBs (Figure 3). Therefore, first execution plan always stays the same and yields no changes in of all, we need to investigate which Spark storage backend for the execution plan nor in time improvement. Moreover, Spark- the large property graph data is the best performing one in the SQL registered DataFrames are (by default) non-materialized context of Morpheus. Deciding on the best performing storage views [3]. Spark-SQL can materialize/cache DataFrames in mem- backend with large property graphs plays a major role in enhanc- ory, but this cannot well capture the graph structural information. ing the performance of Morpheus, and further gives us useful To this direction, we aim to provide a solution for this limitation, leveraging the potentials of graph materialized views and the on Intel(R) Core(TM) i5-8250U 1.60 GHzX64-based CPU and 24 previous graph query workload. In Particular, we aim to use the GB DDR3 of physical memory. We also used a 64GB virtual information from the previous query workload to list and mate- hard drive for our VM. We used Spark V2.3 parcel on Cloud- rialize the most frequent substructures and proprieties that will era VM to fully support Spark-SQL capabilities. We used the be stored (preferably in memory) for accelerating the incoming already installed Hive service on Cloudera VM (version:hive- graph queries. 1.1.0+cdh5.16.1+1431), and neo4j V3.5.8. It is worthy to note that, graph materialization has its side Benchmark Datasets: Using the LDBC SNB data generator 13 , effects regarding the memory space that you need to sacrifice we generated a graph data set (in CSV format) of Scale Factor for keeping those views. This also comes with another challenge (SF=1). We used this data to create a property graph in Neo4j concerning the selection of best proper views (i.e. graph sub- using Neo4j import tool14 . The generated property graph has structures and properties of interest) to materialize and keep in more than 3M nodes, and more than 17M relationships. We also memory [25]. This means that Materialization in our case, will created a graph of tables and views of the same schema inside take into consideration the graph structure. Thus materialization Hive. Further, we used Morpheus Framework to read this property will be only for specific ’frequent/hot’ sub-structures rather than graph either from Hive or Neo4j to store the same graph into materializing the entire query results. HDFS into Morpheus supported file formats (CSV, ORC, Parquet). RQ4: Graph Cost-Based Optimization (How can we use For both experiments (Micro and Macro Benchmarking), we graph CBO for better performance?): In general, Spark SQL run the experiments for all queries five times (excluding the first uses the Catalyst optimizer to optimize all the queries written cold-start run time, to avoid the warm-up bias, and computed an both in explicit SQL or in a DataFrame Domain Specific Language average of the other four run times). Notably, we take the (ln) (DSL). Basically, Catalyst is a Spark library built as a relational- function of average run times in the Macro-Benchmark experi- based optimization engine. Each rule in the rule-based part of ment15 . Catalyst focuses on a specific optimization. Catalyst can also Morpheus Macro-Benchmark: For the Macro Benchmark- apply various relational cost-based optimizations for improving ing experiment, we selected 21 BI queries (i.e. which are valid to the quality of multiple alternative query physical execution plans. run in the current Morpheus Cypher constructs)16 . The results of Although there are several efforts for optimizing the cost-based Figure 4 show that Hive has the lowest performance in general techniques in Spark-SQL such the work proposed recently in [6] for running most of the queries even those that are not complex optimizing (Generalized Projection/Selection/Join) queries, these with 70% of low performance than others. HDFS backends in gen- optimizations are not graph-aware cost optimizations. To this ex- eral outperform Neo4j and Hive with 100% of better performance. tent, providing Graph-aware Cost-Based-Optimizations (GCBOs) In particular, Parquet format in HDFS has the best performance. for selecting the best execution plan of the graph query (using It outperforms ORC and CSV format in most cases of running the best guess approach that takes into account the important graph queries with 42%. While both CSV and ORC achieve only 28.5% structural information/statistics about the graph dataset instead of higher performance. of basic relational statistics) will have better optimization and Morpheus Micro-Benchmark: In our Micro-benchmark ex- performance for addressing such graph queries in Spark. periment, we run 18 Atomic/micro level BI queries17 . The results To tackle this challenge, we aim to provide a graph-aware of Figure 5 show that Neo4j has the lowest performance in general query planner which will be implemented as a layer on top of the for running the first 12 queries with 66% of low performance than default Spark Catalyst for providing a GCBO query plan, taking others. Hive starts to perform worse than Neo4j (and all other into account the statistics of the property graph that resides in systems) only when the number of joins increase and sorting Morpheus storage backend. Particularly, the new graph plan- being applied on queries from Q13 to Q18. While, HDFS back- ner/optimizer can select the best join of tables order based on ends in general outperform Neo4j and Hive with 94.4% of better selectivity and cardinality estimations of graph patterns in the performance. In particular, Parquet format in HDFS has the best graph query for filter and join operators. Therefore, at the query performance, it outperforms ORC and CSV in most queries with time, the new GCBO can suggest a more optimized query plan 55%. While CSV and ORC only outperform with 22.2% and 16.6%, for the Catalyst to follow. respectively. 4 PRELIMINARY RESULTS 5 CONCLUSIONS AND FUTURE WORK In this section, we describe our initial experimental results for We are living in an era of continuous huge growth of more and answering RQ1. In particular, we have designed a set of micro and more connected data. Querying and processing large graphs is an macro benchmarking experiments for evaluating the performance interesting and challenging task. The Morpheus framework aims of different Spark storage backends supported by the Morpheus at integrating Cypher query language to work as a graph query framework. These storage backends are: Neo4j, Hive, and HDFS language on top of Spark. Morpheus translates Cypher queries with its different file formats (CSV,Parquet, and ORC). Notably, into relational Dataframes operations that can fit in the Spark- we don’t copy data from these Spark storage backends, we only SQL environment. Morpheus depends mainly on default Spark evaluate Morpheus performance with the data already resides in Catalyst optimizer for optimizing those relational operators. No them. We have used the Cypher LDBC Social Network Benchmark graph indexing nor graph materialized views are maintained (SNB) BI benchmark query workload12 . Our selected queries are in Morpheus or Spark SQL framework for optimizing property read only queries (i.e. no updates are supported by Morpheus). 13 https://github.com/ldbc/ldbc_snb_datagen Hardware and Software Configurations: Our experiments 14 https://neo4j.com/docs/operations-manual/current/tools/import/ have been performed on a Desktop PC running a Cloudera Vir- 15 The code and results of our initial experiments is available on https://github.com/ tual Machine (VM) v.5.13 with Centos v7.3 Linux system, running DataSystemsGroupUT/MorephusStorageBenchmarking 16 http://bit.ly/2W5b01N Macro LDBC SNB BI Queries 12 https://github.com/ldbc/ldbc_snb_implementations/tree/master/cypher/queries 17 http://bit.ly/2Pa4TrF Micro/Atomic level queries Figure 4: SNB-BI query results [7] Angela Bonifati, Peter Furniss, Alastair Green, Russ Harmer, Eugenia Oshurko, and Hannes Voigt. 2019. Schema validation and evolution for graph databases. arXiv preprint arXiv:1902.06427 (2019). [8] Hirokazu Chiba, Ryota Yamanaka, and Shota Matsumoto. 2019. Property Graph Exchange Format. arXiv preprint arXiv:1907.03936 (2019). [9] Nadime Francis et al. 2018. Cypher: An evolving query language for property graphs. In SIGMOD. [10] Abel Gómez, Amine Benelallam, and Massimo Tisi. 2015. Decentralized model persistence for distributed computing. [11] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J Franklin, and Ion Stoica. 2014. Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX } Symposium on Operating Systems Design and Implementation ( {OSDI } 14). 599–613. [12] Chuang-Yi Gui et al. 2019. A survey on graph processing accelerators: Chal- lenges and opportunities. Journal of Computer Science and Technology 34, 2 (2019). [13] Olaf Hartig and Jorge Pérez. 2018. Semantics and complexity of GraphQL. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 1155–1164. [14] Foteini Katsarou, Nikos Ntarmos, and Peter Triantafillou. 2015. Performance Figure 5: Atomic level query results and scalability of indexed subgraph query processing methods. Proceedings of the VLDB Endowment 8, 12 (2015), 1566–1577. [15] Karsten Klein, Nils Kriege, and Petra Mutzel. 2011. CT-index: Fingerprint- graph data querying and processing. In this PhD work, we focus based graph indexing combining cycles and trees. In ICDE. on tackling these challenges by designing an efficient storage [16] Egor V Kostylev et al. 2015. SPARQL with property paths. In ISWC. backend for persisting property graphs for Morpheus. In addition, [17] Yike Liu, Tara Safavi, Abhilash Dighe, and Danai Koutra. 2018. Graph summa- rization methods and applications: A survey. ACM Computing Surveys (CSUR) we aim at providing graph-aware techniques (e.g., indexes, materi- 51, 3 (2018), 62. alized views) for Spark to optimize the graph queries, in addition [18] Bingqing Lyu, Lu Qin, Xuemin Lin, Lijun Chang, and Jeffrey Xu Yu. 2016. Scalable supergraph search in large graph databases. In 2016 IEEE 32nd Inter- to other graph-aware CBO for Spark Catalyst Optiomizer. We national Conference on Data Engineering (ICDE). IEEE, 157–168. believe that achieving these contributions as our future research [19] Marko A Rodriguez. 2015. The Gremlin graph traversal machine and language plan can significantly enhance the performance of executing (invited talk). In PDPL. [20] Gábor Szárnyas. 2019. Query, analysis, and benchmarking techniques for graph queries using the Morpheus framework. evolving property graphs of software systems. (2019). [21] Oskar van Rest et al. 2016. PGQL: a property graph query language. In Proceedings of the Fourth International Workshop on Graph Data Management REFERENCES Experiences and Systems. [1] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, [22] Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, et al. 2017. Big graph and Domagoj Vrgoč. 2017. Foundations of modern query languages for graph analytics platforms. Foundations and Trends® in Databases 7, 1-2 (2017), 1–195. databases. ACM Computing Surveys (CSUR) 50, 5 (2017), 68. [23] Xifeng Yan, Philip S Yu, and Jiawei Han. 2004. Graph indexing: a frequent [2] Michael Armbrust et al. 2015. Scaling spark in the real world: performance structure-based approach. In Proceedings of the 2004 ACM SIGMOD interna- and usability. PVLDB 8, 12 (2015). tional conference on Management of data. ACM, 335–346. [3] Michael Armbrust et al. 2015. Spark sql: Relational data processing in spark. [24] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and In SIGMOD. Ion Stoica. [n.d.]. Spark: Cluster computing with working sets. ([n. d.]). [4] Michael Armbrust, Doug Bateman, Reynold Xin, and Matei Zaharia. 2016. [25] Yan Zhang. 2017. Efficient Structure-aware OLAP Query Processing over Large Introduction to spark 2.0 for database researchers. In Proceedings of the 2016 Property Graphs. Master’s thesis. University of Waterloo. International Conference on Management of Data. ACM, 2193–2194. [26] Peixiang Zhao, Jeffrey Xu Yu, and Philip S Yu. 2007. Graph indexing: tree+ [5] Ramazan Ali Bahrami, Jayati Gulati, and Muhammad Abulaish. 2017. Effi- delta<= graph. In PVLDB. cient processing of SPARQL queries over graphframes. In Proceedings of the International Conference on Web Intelligence. ACM, 678–685. [6] Lorenzo Baldacci and Matteo Golfarelli. 2018. A Cost Model for SPARK SQL. IEEE Transactions on Knowledge and Data Engineering 31, 5 (2018), 819–832.