Large Scale Querying and Processing for Property Graphs
                         PhD Symposium∗
                                                                           Mohamed Ragab
                                                           Data Systems Group, University of Tartu
                                                                       Tartu, Estonia
                                                                   mohamed.ragab@ut.ee

ABSTRACT
Recently, large scale graph data management, querying and pro-
cessing have experienced a renaissance in several timely applica-
tion domains (e.g., social networks, bibliographical networks and
knowledge graphs). However, these applications still introduce
new challenges with large-scale graph processing. Therefore,
recently, we have witnessed a remarkable growth in the preva-
lence of work on graph processing in both academia and industry.
Querying and processing large graphs is an interesting and chal-
lenging task. Recently, several centralized/distributed large-scale
graph processing frameworks have been developed. However,
they mainly focus on batch graph analytics. On the other hand,
the state-of-the-art graph databases can’t sustain for distributed                             Figure 1: A simple example of a Property Graph
efficient querying for large graphs with complex queries. In par-
ticular, online large scale graph querying engines are still limited.
In this paper, we present a research plan shipped with the state-                        graph data following the core principles of relational database
of-the-art techniques for large-scale property graph querying and                        systems [10]. Popular Graph databases include Neo4j1 , Titan2 ,
processing. We present our goals and initial results for querying                        ArangoDB3 and HyperGraphDB4 among many others.
and processing large property graphs based on the emerging and                               In general, graphs can be represented in different data mod-
promising Apache Spark framework, a defacto standard platform                            els [1]. In practice, the two most commonly-used graph data
for big data processing. In principle, the design of this research                       models are: Edge-Directed/Labelled graph (e.g. Resource Descrip-
plan is revolving around two main goals. The first goal focuses on                       tion Framework (RDF5 )) for representing data in triples (Subject,
designing an adequate and efficient graph-based storage backend                          Predicate, and Object), and the Property Graph (PG) data model [9].
that can be integrated with the Apache Spark framework. The                              The PG model extends edge-directed/labelled graphs by adding
second goal focuses on developing various Graph-aware opti-                              (multiple) labels for the nodes and types for the edges, as well as
mization techniques (e.g., graph indexing, graph materialized                            adding (multiple) key-value proprieties for both nodes and edges
views), and extending the default relational Spark Catalyst op-                          of the graph. In this paper, we focus on the PG model, as it is
timizer with Graph-aware cost-based optimizations. Achieving                             currently the most widely used and supported graph data model
these contributions can significantly enhance the performance                            in industry as well as in the academia. In particular, most of the
of executing graph queries on top of Apache Spark.                                       current top and widely used graph databases use the property
                                                                                         graph data model [8]. This great success and wide spread of PG
                                                                                         model is due to its great balance between conceptual and intuitive
1     INTRODUCTION                                                                       simplicity, in addition to its rich expressiveness [20]. Figure 1
Graphs are everywhere. They are intuitive and rich data mod-                             illustrates an example of a simple property graph.
els that can represent strong connectivity within the data. Due                              Recently, several graph query languages have been proposed
to their rich expressivity, graphs are widely used in several ap-                        to support querying different kinds of graph data models [1]. For
plication domains including the Internet of Things (IoT), social                         example, the W3C community has developed and standardized
networks, knowledge graphs, transportation networks, Semantic                            SPARQL, a query language for querying the RDF-typed graphs [16].
Web, and Linked Open Data (LOD) among many others [17]. In                               Gremlin [19] has been proposed as a functional programming
principle, graph processing is not a new problem. However, re-                           graph query language that supports the property graph model,
cently, it gained an increasing attention and momentum, more                             and optimized for supporting graph navigational/traversal queries.
than before, in several timely applications [22]. This is due to the                     Oracle has designed PGQL [21], an SQL-like graph query lan-
ongoing huge explosion in graph data alongside with a great avail-                       guage which also supports querying the property graph data
ability of computational power to process this data. Nowadays,                           model. Facebook also presented GraphQl [13], a REST-API like
several enterprises have or planned to use graph technologies for                        graph query language for accessing the web data as a graph
their data storage and processing applications. Moreover, Graph                          of objects. Neo4j designed Cypher [9] as its main query lan-
databases are currently widely used in the industry to manage                            guage which targets querying the property graph data model in
                                                                                         a natural and intuitive way. In practice, Cypher is currently the
∗ The supervisor of this work is Sherif Sakr
                                                                                         1 https://neo4j.com/
                                                                                         2 https://github.com/thinkaurelius/titan
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of
                                                                                         3 https://www.arangodb.com/
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
                                                                                         4 http://www.hypergraphdb.org/
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                           5 https://www.w3.org/RDF/
most popular graph query language, and has been supported by                core for improving query executions [4]. Catalyst optimizer is
several other graph-based projects and graph databases includ-              mainly a rule-based optimizer, adding optimization rules based on
ing SAP HANA6 , RedisGraph7 , Agens Graph8 , MemGraph9 , and                functional programming constructs in Scala language. It can also
Morpheus10 (Cypher for Apache Spark) [7].                                   apply variety of relational cost-based optimizations for improving
    Problem Statement: With the continuous increase in graph                the quality of multiple alternative query execution plans.
data, processing large graphs introduces several challenges and                GraphX extends the low level RDD abstraction, and introduces
detrimental issues to the performance of graph-based applica-               a new abstraction called Resilient Distributed Graphs (RDG). In
tions [12]. In particular, one of the common challenges of large-           a graph, RDG relates records with vertices and edges and pro-
scale graph processing is the efficient evaluation of graph queries.        duces an expressive computational primitives’ collection. GraphX
In particular, the evaluation of a graph query mainly depends on            chains the benefits of graph-parallel and data-parallel systems.
the graph scope (i.e. the number of nodes and edges it touches) [20].       However, GraphX is not currently actively maintained. Besides,
Therefore, real-world complex graph queries may unexpectedly                GraphX is based on the low level RDGs. Thus, it cannot exploit
take a long time to be answered [18]. In practice, most of cur-             the Spark 2’s Catalyst query optimizer that supports only Spark
rent graph databases architecture are typically designed to work            DataFrames API. Moreover, GraphX is only available to Scala
on a single-machine (non-clustered). Therefore, graph querying              users.
solutions can only handle the Online Transactional Processing                  GraphFrames is a graph package built on DataFrames [5].
(OLTP-style) query workload, which defines relatively simple                GraphFrames benefits from the scalability and high performance
computational retrievals on a limited subset of the graph data.             of DataFrames. They provide a uniform API for graph processing
For instance, Neo4j is optimized for subgraph traversals and for            available from Scala, Java, and Python. GraphFrames API imple-
medium-sized OLTP query workloads. Whereas, for complex On-                 ments DataFrame-based graph algorithms, and also incorporates
line Analytical Processing (OLAP-style) query workload (where               simple graph pattern matching with fixed length patterns (called
the query needs to touch huge parts of the graph, and complex               ’motifs’). Although GraphFrames are based on Spark DataFrames
joins and aggregations are required), graph databases are not the           API, they have a semantically-weak graph data model (i.e. based
best solution.                                                              on un-typed edges and vertices). Moreover, The motif pattern
    In this paper, we provide an overview of the current state-of-          matching facility is very limited in comparison to other well-
the-art efforts in solving the large scale graph querying along             established graph query languages like Cypher. Besides, other
side with their limitations (Section 2). We present our planned             important features which are present in Spark GraphX such as
contributions based on one of the emerging distributed process-             partitioning are missing in the GraphFrames package.
ing platforms for querying large graph data, Morpheus11 (Section               In practice, by default, Spark does not support processing
3). We present our initial results in Section 4, before we conclude         and querying of property graph data model, despite is wide-
the paper in Section 5.                                                     spread use. To this extent, the Morpheus project has come to the
                                                                            scene. In particular, the Morpheus project has been designed to
2     STATE OF THE ART                                                      enable the evaluation of Cypher over large property graphs using
Distributed processing frameworks can be utilized to solve the              DataFrames on top of Apache Spark framework. In practice, this
graph scalability issues with query evaluation. Apache Spark                framework enables combining the scalability of the Spark frame-
represents the defacto standard for distributed big data process-           work with the features and capabilities of Neo4j by enabling
ing [2]. Unlike MapReduce model, Spark uses the main mem-                   the Cypher language to be integrated into the Spark analytics
ory for parallel computations over large datasets. Thus, it can             pipeline. Interestingly, graph processing and querying can be
be up to 100 times faster than Hadoop [24]. Spark maintains                 then easily interwoven with other Spark processing analytics
this level of efficiency due to its core data abstraction which is          libraries such as Spark GraphX, Spark ML or Spark-SQL. More-
known as Resilient Distributed Datasets(RDDs). An RDD is an                 over, this enables easy merging of graphs from Morpheus into
immutable, distributed and fault tolerant collection of data el-            Neo4j. Besides more advanced capabilities of Morpheus such as
ements which can be partitioned across the memory of nodes                  the ability to handle multiple graphs (i.e. graph Composability)
in the cluster. Another efficient data abstraction of Spark is the          from different data sources even if they are not graph sources (i.e
Spark DataFrames. DataFrames are organized according to a spe-              relational data sources), it has the ability to create graph views
cific schema into named and data-typed columns like a table in              on the data as well.
the relational databases. Spark proposes various higher level li-              Figure 2 illustrates the architecture of Morpheus framework.
braries on top of RDDs and DataFrames abstractions, GraphX [11]             In Morpheus, Cypher queries are translated into Abstract Syntax
and SparkSQL [3] for processing structured and semi-structured              Tree (AST ). Then, Morpheus core system translates this AST into
large data.                                                                 DataFrame operations with schema and Data-Type handling. It
   Spark-SQL is a high-level library for processing structured as           is worth noting that DataFrames in Spark use schema, while
well as semi-structured large datasets. It enables querying these           Neo4j or generally property graphs optionally use a schema (i.e.
datasets stored in DataFrames abstraction using SQL. Spark-SQL              schema free data model). Therefore, Morpheus provides a Graph
acts as a distributed SQL query engine over large structured                Data Definition Language GDDL for handling schema mapping.
datasets. In addition, SparkSQL offers a Catalyst optimizer in its          Particularly, GDDL expresses property graph types and maps
                                                                            between those types and the relational data sources. Moreover,
6 https://s3.amazonaws.com/artifacts.opencypher.org/website/ocim1/slides/   the Morpheus core system manages importing graph data that
Graph+Pattern+Matching+in+SAP+HANA.pdf                                      can reside in different Spark storage backends such as HDFS (i.e.
7 https://oss.redislabs.com/redisgraph/
8 https://bitnine.net/agensgraph/
                                                                            in different file formats), Hive, relational databases using JDBC,
9 https://memgraph.com/                                                     or Neo4j (i.e. Morpheus Property graph data sources PGDs), and
10 https://github.com/opencypher/morpheus                                   exporting these property graphs directly back to those Spark
11 https://github.com/opencypher/morpheus
                                                                            storage backends. This interestingly means that graph data can
                                                                         Figure 3: Comparison of Spark PG storage backends


                                                                      insights for the subsequent optimizations (i.e. graph Indexing and
                                                                      Partitioning). Further, figuring out the top performing backends
                                                                      helps optimizing the query plan afterwards. For instance, if the
                                                                      best performing storage backend is a Columnar-Oriented backend
                                                                      (e.g. ORC), it is better for making more pushing projections down
                                                                      in the query plan. Whereas, if it is a Row-Oriented backend, it is
              Figure 2: Morpheus Architecture                         better to make more pushing selections down in the plan.
                                                                          RQ2: Graph Indexes (How can we use Graph Indexing
                                                                      for Better Performance?): The default method for processing
be read from these native Spark data sources without altering nor     graph queries is to perform a subgraph matching search against
copying the original data sources to Morpheus. Particularly, it is    the graph dataset [14]. Several graph indexing techniques have
like you plug-in these storage backends to the Morpheus frame-        been proposed in the literature. In practice, building a graph
work as shown in Figure 2. The native Spark Catalyst optimizer        index is a multi-faceted process. That is, it depends on using
is used in Morpheus pipeline for making various core query op-        the graph structural information for enumerating and extracting
timizations for the generated relational plan of operations. Last     the most frequent features (i.e. graph sub-structures), and then
but not least, Morpheus runs these optimized query plans on the       building a data structure of these features. These data structures
Spark cluster using distributed Spark runtime environment.            are such as Hash Tables , Lattices, Tries or Trees [14]. The indexed
                                                                      features can be in the form of simple graph patterns/paths, trees,
3   RESEARCH PLAN                                                     graphs, or a mix of graphs and trees. Further, selecting these
In general, Morpheus has been designed to enable executing            features to be indexed can be done exhaustively via enumerating
Cypher queries on top of Spark. However, on the backend, the          all such features across the whole graph data set [15], or via
property graphs are represented and maintained using Spark            mining the graph data set for frequent patterns or features (i.e.
relational DataFrames. Therefore, Cypher graph-based queries          Discriminative Features) [23, 26]. This mining is recommended
are internally translated into relational operations over these       in building a graph index as the size of the created graph index
DataFrames. Therefore, Spark still performs operations on the         should be reasonable. It is also worth noting that, most of the
property graph as tabular data and views with specified schema.       existing graph indexing algorithms are only able to handle un-
Thus, adding a graph-aware optimization layer for Spark can           directed graphs with labelled vertices [14].
significantly enhance the performance of graph query execution            Currently, Morpheus doesn’t use any indexing mechanism for
on the property graph data. In this research plan, we are focus-      property graphs while executing graph queries. To this end, we
ing on two main aspects for enhancing querying and processing         aim to build an efficient indexing scheme of the property graphs
large property graphs in the context of the Morpheus project. The     in an offline mode (taking into consideration its schema, as well as
first aspect is to design an efficient Spark-based storage backend    its storage backend). Then, this index will be used for reducing the
for persisting property graphs. The other aspect is to provide        search space for the complex graph pattern matching task. Thus,
graph-aware optimizations for query processing inside Morpheus        consulting this index for executing the graph query workload is
such as Graph indexing, Graph Materialized Views, and last but        better than exhaustive vertex-to-vertex correspondence checking
not least graph cost-based optimizations on top of the default        from the query to the graph which involves a lot of expensive join
Spark Catalyst optimizer. In order to achieve these aspects in        operations in the relational representation. In our plan, we don’t
our research plan, we focus on answering the following Research       consider the overhead of updating the built index, as Morpheus is
Questions (RQs):                                                      currently supporting only read operations, and thus no insertions
    RQ1: Graph Persistence(Which storage backend achieves             and deletions happen to the already generated property graph.
better performance ?): Large graphs require well-suited per-              RQ3: Graph Materialized Views (How can we use Graph
sistence solutions that are efficient for query evaluations and       Views for better performance?): Morpheus and most of graph
processing [20]. As mentioned earlier, graph data in Morpheus         databases tend to compute each query from scratch without be-
can settle on multiple different data sources such as HDFS with       ing aware of the previous query workloads [25]. Particularly,
its different file formats (e.g. Avro, Parquet, and CSV), Neo4j,      if we repeatedly execute the same query using Morpheus, the
Hive, or other kinds of relational DBs (Figure 3). Therefore, first   execution plan always stays the same and yields no changes in
of all, we need to investigate which Spark storage backend for        the execution plan nor in time improvement. Moreover, Spark-
the large property graph data is the best performing one in the       SQL registered DataFrames are (by default) non-materialized
context of Morpheus. Deciding on the best performing storage          views [3]. Spark-SQL can materialize/cache DataFrames in mem-
backend with large property graphs plays a major role in enhanc-      ory, but this cannot well capture the graph structural information.
ing the performance of Morpheus, and further gives us useful          To this direction, we aim to provide a solution for this limitation,
leveraging the potentials of graph materialized views and the                    on Intel(R) Core(TM) i5-8250U 1.60 GHzX64-based CPU and 24
previous graph query workload. In Particular, we aim to use the                  GB DDR3 of physical memory. We also used a 64GB virtual
information from the previous query workload to list and mate-                   hard drive for our VM. We used Spark V2.3 parcel on Cloud-
rialize the most frequent substructures and proprieties that will                era VM to fully support Spark-SQL capabilities. We used the
be stored (preferably in memory) for accelerating the incoming                   already installed Hive service on Cloudera VM (version:hive-
graph queries.                                                                   1.1.0+cdh5.16.1+1431), and neo4j V3.5.8.
   It is worthy to note that, graph materialization has its side                     Benchmark Datasets: Using the LDBC SNB data generator 13 ,
effects regarding the memory space that you need to sacrifice                    we generated a graph data set (in CSV format) of Scale Factor
for keeping those views. This also comes with another challenge                  (SF=1). We used this data to create a property graph in Neo4j
concerning the selection of best proper views (i.e. graph sub-                   using Neo4j import tool14 . The generated property graph has
structures and properties of interest) to materialize and keep in                more than 3M nodes, and more than 17M relationships. We also
memory [25]. This means that Materialization in our case, will                   created a graph of tables and views of the same schema inside
take into consideration the graph structure. Thus materialization                Hive. Further, we used Morpheus Framework to read this property
will be only for specific ’frequent/hot’ sub-structures rather than              graph either from Hive or Neo4j to store the same graph into
materializing the entire query results.                                          HDFS into Morpheus supported file formats (CSV, ORC, Parquet).
   RQ4: Graph Cost-Based Optimization (How can we use                                For both experiments (Micro and Macro Benchmarking), we
graph CBO for better performance?): In general, Spark SQL                        run the experiments for all queries five times (excluding the first
uses the Catalyst optimizer to optimize all the queries written                  cold-start run time, to avoid the warm-up bias, and computed an
both in explicit SQL or in a DataFrame Domain Specific Language                  average of the other four run times). Notably, we take the (ln)
(DSL). Basically, Catalyst is a Spark library built as a relational-             function of average run times in the Macro-Benchmark experi-
based optimization engine. Each rule in the rule-based part of                   ment15 .
Catalyst focuses on a specific optimization. Catalyst can also                       Morpheus Macro-Benchmark: For the Macro Benchmark-
apply various relational cost-based optimizations for improving                  ing experiment, we selected 21 BI queries (i.e. which are valid to
the quality of multiple alternative query physical execution plans.              run in the current Morpheus Cypher constructs)16 . The results of
Although there are several efforts for optimizing the cost-based                 Figure 4 show that Hive has the lowest performance in general
techniques in Spark-SQL such the work proposed recently in [6]                   for running most of the queries even those that are not complex
optimizing (Generalized Projection/Selection/Join) queries, these                with 70% of low performance than others. HDFS backends in gen-
optimizations are not graph-aware cost optimizations. To this ex-                eral outperform Neo4j and Hive with 100% of better performance.
tent, providing Graph-aware Cost-Based-Optimizations (GCBOs)                     In particular, Parquet format in HDFS has the best performance.
for selecting the best execution plan of the graph query (using                  It outperforms ORC and CSV format in most cases of running the
best guess approach that takes into account the important graph                  queries with 42%. While both CSV and ORC achieve only 28.5%
structural information/statistics about the graph dataset instead                of higher performance.
of basic relational statistics) will have better optimization and                    Morpheus Micro-Benchmark: In our Micro-benchmark ex-
performance for addressing such graph queries in Spark.                          periment, we run 18 Atomic/micro level BI queries17 . The results
   To tackle this challenge, we aim to provide a graph-aware                     of Figure 5 show that Neo4j has the lowest performance in general
query planner which will be implemented as a layer on top of the                 for running the first 12 queries with 66% of low performance than
default Spark Catalyst for providing a GCBO query plan, taking                   others. Hive starts to perform worse than Neo4j (and all other
into account the statistics of the property graph that resides in                systems) only when the number of joins increase and sorting
Morpheus storage backend. Particularly, the new graph plan-                      being applied on queries from Q13 to Q18. While, HDFS back-
ner/optimizer can select the best join of tables order based on                  ends in general outperform Neo4j and Hive with 94.4% of better
selectivity and cardinality estimations of graph patterns in the                 performance. In particular, Parquet format in HDFS has the best
graph query for filter and join operators. Therefore, at the query               performance, it outperforms ORC and CSV in most queries with
time, the new GCBO can suggest a more optimized query plan                       55%. While CSV and ORC only outperform with 22.2% and 16.6%,
for the Catalyst to follow.                                                      respectively.

4    PRELIMINARY RESULTS                                                         5     CONCLUSIONS AND FUTURE WORK
In this section, we describe our initial experimental results for                We are living in an era of continuous huge growth of more and
answering RQ1. In particular, we have designed a set of micro and                more connected data. Querying and processing large graphs is an
macro benchmarking experiments for evaluating the performance                    interesting and challenging task. The Morpheus framework aims
of different Spark storage backends supported by the Morpheus                    at integrating Cypher query language to work as a graph query
framework. These storage backends are: Neo4j, Hive, and HDFS                     language on top of Spark. Morpheus translates Cypher queries
with its different file formats (CSV,Parquet, and ORC). Notably,                 into relational Dataframes operations that can fit in the Spark-
we don’t copy data from these Spark storage backends, we only                    SQL environment. Morpheus depends mainly on default Spark
evaluate Morpheus performance with the data already resides in                   Catalyst optimizer for optimizing those relational operators. No
them. We have used the Cypher LDBC Social Network Benchmark                      graph indexing nor graph materialized views are maintained
(SNB) BI benchmark query workload12 . Our selected queries are                   in Morpheus or Spark SQL framework for optimizing property
read only queries (i.e. no updates are supported by Morpheus).
                                                                                 13 https://github.com/ldbc/ldbc_snb_datagen
   Hardware and Software Configurations: Our experiments
                                                                                 14 https://neo4j.com/docs/operations-manual/current/tools/import/
have been performed on a Desktop PC running a Cloudera Vir-                      15 The code and results of our initial experiments is available on https://github.com/
tual Machine (VM) v.5.13 with Centos v7.3 Linux system, running                  DataSystemsGroupUT/MorephusStorageBenchmarking
                                                                                 16 http://bit.ly/2W5b01N Macro LDBC SNB BI Queries
12 https://github.com/ldbc/ldbc_snb_implementations/tree/master/cypher/queries   17 http://bit.ly/2Pa4TrF Micro/Atomic level queries
                                                                 Figure 4: SNB-BI query results

                                                                                       [7] Angela Bonifati, Peter Furniss, Alastair Green, Russ Harmer, Eugenia Oshurko,
                                                                                           and Hannes Voigt. 2019. Schema validation and evolution for graph databases.
                                                                                           arXiv preprint arXiv:1902.06427 (2019).
                                                                                       [8] Hirokazu Chiba, Ryota Yamanaka, and Shota Matsumoto. 2019. Property
                                                                                           Graph Exchange Format. arXiv preprint arXiv:1907.03936 (2019).
                                                                                       [9] Nadime Francis et al. 2018. Cypher: An evolving query language for property
                                                                                           graphs. In SIGMOD.
                                                                                      [10] Abel Gómez, Amine Benelallam, and Massimo Tisi. 2015. Decentralized model
                                                                                           persistence for distributed computing.
                                                                                      [11] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J
                                                                                           Franklin, and Ion Stoica. 2014. Graphx: Graph processing in a distributed
                                                                                           dataflow framework. In 11th {USENIX } Symposium on Operating Systems
                                                                                           Design and Implementation ( {OSDI } 14). 599–613.
                                                                                      [12] Chuang-Yi Gui et al. 2019. A survey on graph processing accelerators: Chal-
                                                                                           lenges and opportunities. Journal of Computer Science and Technology 34, 2
                                                                                           (2019).
                                                                                      [13] Olaf Hartig and Jorge Pérez. 2018. Semantics and complexity of GraphQL. In
                                                                                           Proceedings of the 2018 World Wide Web Conference. International World Wide
                                                                                           Web Conferences Steering Committee, 1155–1164.
                                                                                      [14] Foteini Katsarou, Nikos Ntarmos, and Peter Triantafillou. 2015. Performance
               Figure 5: Atomic level query results                                        and scalability of indexed subgraph query processing methods. Proceedings of
                                                                                           the VLDB Endowment 8, 12 (2015), 1566–1577.
                                                                                      [15] Karsten Klein, Nils Kriege, and Petra Mutzel. 2011. CT-index: Fingerprint-
graph data querying and processing. In this PhD work, we focus                             based graph indexing combining cycles and trees. In ICDE.
on tackling these challenges by designing an efficient storage                        [16] Egor V Kostylev et al. 2015. SPARQL with property paths. In ISWC.
backend for persisting property graphs for Morpheus. In addition,                     [17] Yike Liu, Tara Safavi, Abhilash Dighe, and Danai Koutra. 2018. Graph summa-
                                                                                           rization methods and applications: A survey. ACM Computing Surveys (CSUR)
we aim at providing graph-aware techniques (e.g., indexes, materi-                         51, 3 (2018), 62.
alized views) for Spark to optimize the graph queries, in addition                    [18] Bingqing Lyu, Lu Qin, Xuemin Lin, Lijun Chang, and Jeffrey Xu Yu. 2016.
                                                                                           Scalable supergraph search in large graph databases. In 2016 IEEE 32nd Inter-
to other graph-aware CBO for Spark Catalyst Optiomizer. We                                 national Conference on Data Engineering (ICDE). IEEE, 157–168.
believe that achieving these contributions as our future research                     [19] Marko A Rodriguez. 2015. The Gremlin graph traversal machine and language
plan can significantly enhance the performance of executing                                (invited talk). In PDPL.
                                                                                      [20] Gábor Szárnyas. 2019. Query, analysis, and benchmarking techniques for
graph queries using the Morpheus framework.                                                evolving property graphs of software systems. (2019).
                                                                                      [21] Oskar van Rest et al. 2016. PGQL: a property graph query language. In
                                                                                           Proceedings of the Fourth International Workshop on Graph Data Management
REFERENCES                                                                                 Experiences and Systems.
 [1] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter,          [22] Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, et al. 2017. Big graph
     and Domagoj Vrgoč. 2017. Foundations of modern query languages for graph              analytics platforms. Foundations and Trends® in Databases 7, 1-2 (2017), 1–195.
     databases. ACM Computing Surveys (CSUR) 50, 5 (2017), 68.                        [23] Xifeng Yan, Philip S Yu, and Jiawei Han. 2004. Graph indexing: a frequent
 [2] Michael Armbrust et al. 2015. Scaling spark in the real world: performance            structure-based approach. In Proceedings of the 2004 ACM SIGMOD interna-
     and usability. PVLDB 8, 12 (2015).                                                    tional conference on Management of data. ACM, 335–346.
 [3] Michael Armbrust et al. 2015. Spark sql: Relational data processing in spark.    [24] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and
     In SIGMOD.                                                                            Ion Stoica. [n.d.]. Spark: Cluster computing with working sets. ([n. d.]).
 [4] Michael Armbrust, Doug Bateman, Reynold Xin, and Matei Zaharia. 2016.            [25] Yan Zhang. 2017. Efficient Structure-aware OLAP Query Processing over Large
     Introduction to spark 2.0 for database researchers. In Proceedings of the 2016        Property Graphs. Master’s thesis. University of Waterloo.
     International Conference on Management of Data. ACM, 2193–2194.                  [26] Peixiang Zhao, Jeffrey Xu Yu, and Philip S Yu. 2007. Graph indexing: tree+
 [5] Ramazan Ali Bahrami, Jayati Gulati, and Muhammad Abulaish. 2017. Effi-                delta<= graph. In PVLDB.
     cient processing of SPARQL queries over graphframes. In Proceedings of the
     International Conference on Web Intelligence. ACM, 678–685.
 [6] Lorenzo Baldacci and Matteo Golfarelli. 2018. A Cost Model for SPARK SQL.
     IEEE Transactions on Knowledge and Data Engineering 31, 5 (2018), 819–832.