Context-Free Path Querying In Terms of Linear Algebra Rustam Azimov supervised by Semyon Grigorev Saint Petersburg State University St. Petersburg, Russia st013567@student.spbu.ru ABSTRACT to return the example of such path for all vertex pairs, and all-path Context-Free Path Querying (CFPQ) is an important problem with query semantics that requires to return all such paths for all vertex applications in many areas, for example, graph databases, bioin- pairs. formatics, static analysis, etc. Historically, regular languages are A number of CFPQ algorithms based on parsing techniques were used as constraints for navigational path queries. However, in some proposed: (G)LL and (G)LR-based algorithms by Ciro M. Medeiros important cases, regular languages are not expressive enough and et al. [3], Fred C. Santos et al. [19], Semyon Grigorev et al. [13], context-free languages are used instead. Many algorithms for CFPQ and Ekaterina Verbitskaia et al. [4]; CYK-based algorithm by Xi- were proposed but recently showed that the state-of-the-art CFPQ aowang Zhang et al. [15]; combinators-based approach to CFPQ algorithms are still not performant enough for practical use. One by Ekaterina Verbitskaia et al. [5]. Yet recent research by Jochem promising way to achieve high-performance solutions for graph Kuijpers et al. [8] shows that existing solutions are not applicable querying problems is to reduce them to linear algebra operations for real-world graph analysis because of significant running time (such as matrix multiplication). The active utilization of these oper- and memory consumption. ations in the process of context-free path query evaluation makes it Inspired by Valiant’s [20] matrix-based algorithm for context- possible to efficiently apply a wide class of optimizations and com- free language recognition, we explore the applicability of linear puting techniques, such as GPGPU (General-Purpose computing algebra methods to the CFPQ problem. The linear algebra methods on Graphics Processing Units), parallel computation, sparse matrix is widely used for various problems of finding paths in graphs, but representation, distributed-memory computation, etc. In this Ph.D. the CFPQ problem poses additional challenges originating from work, we aim at: (i) studying the applicability of linear algebra query-specific information that needs to be captured. Valiant pro- methods to the CFPQ problem, (ii) at devising the algorithms for posed a parsing algorithm, which computes a recognition table context-free path query evaluation formulated in terms of linear by computing matrix transitive closure. These algorithms take a algebra operations, and (iii) at achieving high-performance imple- string at the input and decide whether this string is generated from mentations of the devised algorithms using parallel computations. the input context-free grammar. Valiant’s algorithm has essentially the same complexity as Boolean matrix multiplication. This is a promising way to achieve high-performance solutions for graph querying problems. The active utilization of these operations in the 1 INTRODUCTION process of context-free path query evaluation makes it possible to Formal language-constrained path querying [2] is a graph analysis efficiently apply a wide class of optimizations and computing tech- problem in which formal languages are used as constraints for nav- niques, such as GPGPU (General-Purpose computing on Graphics igational path queries. In this problem, a path in an edge-labeled Processing Units), parallel computation, sparse matrix representa- graph is viewed as a word constructed by the concatenation of tion, distributed-memory computation, etc. edge labels. The formal languages are used to constrain the paths In this Ph.D. work, we make the following contributions. of interest: a query should find only paths labeled by words from the language. The most popular class of constraints used as naviga- tional queries in graph databases are the regular ones. However, in some important cases, regular languages are not expressive enough (1) We provide an approach to solving the CFPQ problem using and context-free languages are used instead. The context-free path linear algebra operations. The provided approach allows querying (CFPQ), can be used in many areas, for example, RDF anal- us to use a wide class of optimizations of linear algebra ysis [15], static code analysis [9, 14], biological data analysis [12], operations for efficient analysis of large graphs. graph segmentation [6]. (2) Using provided approach, we devise the CFPQ algorithms CFPQ has been studied a lot since the problem was first stated by based on linear algebra operations for relational, single- Mihalis Yannakakis in 1990 [21]. Jelle Hellings investigates various path, and all-path query semantics. aspects of CFPQ in [16–18] and formulates three possible querying (3) We provide the implementations of the devised algorithms semantics: relational that requires to find all vertex pairs reachable for context-free path query evaluation using different op- by some path of interest, single-path query semantics also requires timizations and computing techniques. Our preliminary results demonstrate that our best CPU and GPU-based Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den- implementations that utilize sparse matrix representation mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and parallel computation outperform the state-of-the-art context-free path querying solutions. Reducing the CFPQ problem to the linear algebra operations Implementation Data Analysis Result interpretation Construction of adjacency matrices of a Using the graph with sets of nonterminals of the Inheritance of appropriate Extracting input CF-grammar as elements theoretical libraries that information about Input graph properties by the support linear the required paths constructed algebraic from matrices algorithm operations depending on the Using the apparatus of linear algebra query semantics (matrix theory, ring theory, etc.) Query Using of suitable Efficient formats for storing parallelization on sparse adjacency CPU or GPU Construction of a Traversing the Input Context-Free matrices (CSR, semiring over the input graph using grammar CSC, COO sets of linear algebraic formats) nonterminals of the operations on input CF-grammar adjacency matrix Converting Query semantics Operations must Using matrix adjacency matrices comply with the multiplication, to matrices with application of the tensor product, information about production rules of transitive closure, the desired paths the input etc. CF-grammar Figure 1: An overview of our proposed approach to solving the CFPQ problem using linear algebra operations 2 APPROACH of a certain algebraic structure, namely, in the form of a semiring In this section, we describe our approach to solving the CFPQ prob- with non-associative multiplication. This approach takes advantage lem using linear algebra operations. The idea of using a sparse adja- of the fact that in the production rules of the CF-grammars there cency matrix as a graph representation in graph analysis problems are two operations β€” concatenation and union, which are trans- is well-known. Recently, became very popular the GraphBLAS [7] formed into the product and the sum of a semiring. The elements API specification that defines standard building blocks for graph of the adjacency matrices must be sets of nonterminals since each algorithms in the language of linear algebra. Various libraries that of them describe the set of words corresponding to the paths of implement it provide data structures and functions to compute lin- interest. Further, the resulting semiring is used to define matrix ear transformations and other linear algebra operations on sparse multiplication (or other linear algebra operation), which simulates matrices. Using these libraries or other efficient libraries for such the step of the input graph traversing. For a complete traversal of operations is a good recipe for making a high-performance CFPQ the input graph, a transitive closure of adjacency matrices can be solution if we can reduce the CFPQ problem to linear algebra oper- defined, which allows us to obtain the information about all paths ations. Although such reduction was found for a number of graph corresponding to the query. The type of information retrieved de- algorithms, there are many graph algorithms for which it has not pends on the query semantics, which must be taken into account been done. To the best of our knowledge, the reduction of CFPQ when constructing the adjacency matrices and semirings. problem to linear algebra operation is an open question. Analysis. To analyze the theoretical properties of the constructed An overview of our proposed approach is shown in Figure 1. The algorithm, the existing theoretical results of linear algebra, graph purpose of this approach is to solving CFPQ problem using linear theory and the theory of formal languages can be used. The proper- algebra operations. In order to do this, it is necessary to devise the ties of the entire CFPQ algorithm largely depend on the properties CFPQ algorithms that have the input in the form of a directed edge- of linear algebra operations since it evaluates the context-free path labeled graph as a data, a context-free grammar as a query, and the queries by offloading the most intensive computations into calls to query semantics that determines the type of requested information procedures for these operations. Therefore, the most effective for about paths in the graph. Further, the approach can be divided into the algorithm will be optimizations that use the existing results of the following stages. linear algebra to efficiently compute such operations, for example, sparse matrix and vector operations. Reducing the CFPQ problem to the linear algebra operations. The query is formulated by a context-free grammar 𝐺 which is a tuple Implementation. From a practical point of view, the algorithms (𝑁 , Ξ£, 𝑃, 𝑆), where 𝑁 is a finite set of nonterminals; Ξ£ is a finite set built in this approach are easy to implement, since the most time of terminals, 𝑁 ∩ Ξ£ = βˆ…; 𝑃 is a finite set of productions of the form consuming is the implementation of the necessary linear algebra 𝐴 β†’ 𝛼, where 𝐴 ∈ 𝑁 , 𝛼 ∈ (𝑁 βˆͺ Ξ£) βˆ— ; and 𝑆 is the start nonterminal. operations, which have already been implemented in many effi- To formulate the obtained problem in terms of linear algebra, the in- cient libraries that support linear algebra operations. For example, put graphs are considered in the form of adjacency matrices, and the can be used the SuiteSparse:GraphBLAS [1] library β€” is an imple- input context-free grammar (CF-grammar) is encoded in the form mentation of the GraphBLAS API. In such libraries, various matrix 2 optimizations are used, which will significantly speed up the com- putation of the context-free queries for large graphs. One of such optimizations is the use of sparse matrix formats (CSR, CSC, COO), the use of which gives a significant performance gain since real data is often sparse. In addition, many linear algebra operations can be efficiently computed in parallel, for example on a GPU. As a result, the CFPQ implementation will allow to obtain matrices that will store information about the desired paths in the graph. The type of stored information is determined by the query semantics, for example, it can be an answer to the question of the existence of Figure 2: Execution time in seconds of the path extraction paths of a certain form or their enumeration. algorithm depending on the path length for π‘”π‘’π‘œπ‘ π‘π‘’π‘π‘–π‘’π‘  Result interpretation. The final step is to interpret the query re- sult. Depending on the query semantics, it is possible to extract For single-path and all-path query semantics, we store additional certain information about the paths between the vertices of in- information in matrices to be able to restore found paths. For single- terest from the resulting matrix. In addition, in the case when the path query semantics, we store the intermediate vertex π‘˜ in the query semantics involves the enumeration of paths between certain element 𝑇𝑖,𝑗 only if there is a path from 𝑖 to π‘˜ corresponding to the vertices in the graph, it also becomes necessary to implement an nonterminal 𝐡, there is a path from π‘˜ to 𝑗 corresponding to the algorithm for constructing these paths from the resulting matrices. nonterminal 𝐢, and there is a rule (𝐴 β†’ 𝐡𝐢) ∈ 𝑃. For all-path query semantics, we store the sets of the intermediate vertices, since we 3 ALGORITHMS must store the information about all paths between each vertex In this section, we briefly describe our devised CFPQ algorithms pair. In that case, we cannot reduce computations to operations based on the linear algebra operations. For all our algorithms we on Boolean matrices and we use custom matrix multiplication for formally prove the correctness, termination, and time complexity matrices with more complex elements (tuples or arrays of integers). bounds. Our algorithms can be divided into two groups. There are still libraries that support linear algebra operations for our algorithms for single-path and all-path query semantics, for Matrix-based algorithms. The algorithms in the first group utilize example, the GraphBLAS implementations that support the creation the Boolean matrix multiplication for relational query semantics of custom semirings for the matrix operations. and operate with more complex matrices for single-path and all- Kronecker product-based algorithm. On the contrary, the algo- path query semantics. Also, these algorithms, like many existing rithm in the second group is based on the Kronecker product opera- ones, require the CF-grammar transformation to some normal form tion and does not require the transformation of the input grammar. that allows us to encode one step of the input graph traversing into The transformation leads to at least a quadratic blow-up in grammar exactly one matrix multiplication since in this form we have only size, thus by avoiding the transformation, this algorithm achieves two nonterminals in the right-hand side of productions rules. better time complexity in terms of the grammar size. While regu- We define a binary operation ( Β· ) for arbitrary subsets 𝑁 1, 𝑁 2 of lar languages can be expressed as a Finite-State Machine (FSM), a 𝑁 with respect to a CF-grammar 𝐺 = (𝑁 , Ξ£, 𝑃) as CF-grammar can be expressed as a Recursive State Machine (RSM) in a similar fashion. In these algorithms, we use RSM to represent 𝑁 1 Β· 𝑁 2 = {𝐴 | βˆƒπ΅ ∈ 𝑁 1, βˆƒπΆ ∈ 𝑁 2 such that (𝐴 β†’ 𝐡𝐢) ∈ 𝑃 }. the context-free path query and evaluate this query using the Kro- necker product of the corresponding adjacency matrices of the Using this binary operation as subset multiplication, and union input graph and the RSM for the input CF-grammar. Although the as an addition, we can define a matrix multiplication, π‘Ž Γ— 𝑏 = 𝑐, Kronecker-based algorithm for constructing the matrices with in- where π‘Ž and 𝑏 are matrices of a suitable size, that have subsets of 𝑁 formation about required paths (or so-called index) is the same for Ð as elements, as 𝑐𝑖,𝑗 = π‘›π‘˜=1 π‘Žπ‘–,π‘˜ Β· π‘π‘˜,𝑗 . Also, we use the element-wise all three query semantics, the paths extraction algorithm for each union operation on matrices π‘Ž and 𝑏 with the same size: π‘Ž βˆͺ 𝑏 = 𝑐, semantics is different. where 𝑐𝑖,𝑗 = π‘Žπ‘–,𝑗 βˆͺ 𝑏𝑖,𝑗 . Finally, we define the transitive closure (1) (2) (1) of a square matrix π‘Ž as π‘Ž + = π‘Ž + βˆͺ π‘Ž + βˆͺ Β· Β· Β· , where π‘Ž + = π‘Ž and 4 PRELIMINARY RESULTS (𝑖) Ð ( 𝑗) (π‘–βˆ’π‘—) We evaluate implementations of devised algorithms on real-world π‘Ž + = π‘–βˆ’1 𝑗=1 π‘Ž + Γ— π‘Ž + , 𝑖 β‰₯ 2. We can evaluate the context-free path queries with relational RDFs. We provide results only for our CPU and GPU-based imple- semantics by computing this transitive closure of an adjacency ma- mentations of matrix-based algorithms for relational and single- trix 𝑇 of input labeled graph with sets of nonterminals as elements path query semantics because of the page limit. We use Redis- where 𝐴 ∈ 𝑇𝑖,𝑗 only if there is π‘₯ ∈ Ξ£ such that there is edge from Graph [11] graph database as storage and measure the full time of vertex 𝑖 to 𝑗 labeled by π‘₯ and (𝐴 β†’ π‘₯) ∈ 𝑃. However, described query execution including all overhead on data preparation. This operations can be computed using several Boolean matrix multipli- way we can estimate the applicability of the matrix-based algorithm cations and additions if we encode the matrix 𝑇 by the |𝑁 | Boolean to real-world problems. The source code is available on GitHub1 matrices (one Boolean matrix for each nonterminal how it is done 1 Sources of matrix-based CFPQ algorithm for the RedisGraph database: https://github. in the work of Valiant [20]). com/YaccConstructor/RedisGraph. Access date: 27.02.2021. 3 Table 1: Index creation time for RDFs (time is measured in seconds and memory is measured in megabytes) Relational semantics index Single-path semantics index Name V E RG_CPUrel RG_GPUrel RG_CPUpath RG_GPUpath Time (s) Mem (MB) Time (s) Mem (MB) Time (s) Mem (MB) Time (s) Mem (MB) go-hierarchy 45 007 1 960 436 0.091 16.3 0.108 121.2 0.976 92.0 0.336 125.0 enzyme 48 815 219 390 0.018 5.9 0.018 4.0 0.029 8.1 0.043 6.0 eclass_514en 239 111 1 047 454 0.067 13.8 0.166 16.0 0.195 31.2 0.496 26.0 go 272 770 1 068 622 0.604 28.8 0.365 30.2 1.286 75.7 0.739 45.4 geospecies 450 609 2 311 461 7.146 16 934 0.856 5 274 15.134 35 803 1.935 5 282 The results of the CFPQ evaluation are presented in Table 1. [3] Ciro M. Medeiros et al. 2018. Efficient Evaluation of Context-free Path Queries As we can see, our matrix-based algorithm for relational query for Graph Databases. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (Pau, France) (SAC ’18). ACM, New York, NY, USA, 1230– semantics implemented for RedisGraph is more than 1000 times 1237. https://doi.org/10.1145/3167132.3167265 faster than the one based on annotated grammar implemented for [4] Ekaterina Verbitskaia et al. 2016. Relaxed Parsing of Regular Approximations of String-Embedded Languages. In Perspectives of System Informatics, Manuel Neo4j [8] and uses more than 4 times less memory. We can conclude Mazzara and Andrei Voronkov (Eds.). Springer International Publishing, Cham, that the matrix-based algorithm is more performant than the state- 291–302. of-the-art CFPQ algorithms for query evaluation under a relational [5] Ekaterina Verbitskaia et al. 2018. Parser Combinators for Context-free Path Querying. In Proceedings of the 9th ACM SIGPLAN International Symposium semantics for real-world data processing. on Scala (St. Louis, MO, USA) (Scala 2018). ACM, New York, NY, USA, 13–23. We can conclude, that the cost of computing matrices for single- https://doi.org/10.1145/3241653.3241655 path query semantics is not high. On average, it is about 2 times [6] H. Miao et al. 2019. Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization. In 2019 IEEE 35th International Conference on slower than for the relational query semantics. The additional run- Data Engineering (ICDE). 1710–1713. ning time of the path extraction is presented in Figure 2 (boxplots [7] J. Kepner et al. 2016. Mathematical foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1–9. https: are standard, outliers are omitted). As we can see, this time is small //doi.org/10.1109/HPEC.2016.7761646 and linear in the length of the path. [8] Jochem Kuijpers et al. 2019. An Experimental Study of Context-Free Path Query Finally, we conclude that the matrix-based algorithm paired with Evaluation Methods. In Proceedings of the 31st International Conference on Scien- tific and Statistical Database Management (Santa Cruz, CA, USA) (SSDBM ’19). a suitable database and employing appropriate libraries for linear ACM, New York, NY, USA, 121–132. https://doi.org/10.1145/3335783.3335791 algebra operations is a promising way to make CFPQ with different [9] Jakob Rehof et al. 2001. Type-Base Flow Analysis: From Polymorphic Subtyping query semantics applicable for real-world data analysis. to CFL-Reachability. SIGPLAN Not. 36, 3 (Jan. 2001), 54–66. https://doi.org/10. 1145/373243.360208 [10] Jyothi Vedurada et al. 2019. Batch Alias Analysis. In Proceedings of the 34th 5 CONCLUSION AND NEXT STEPS IEEE/ACM International Conference on Automated Software Engineering (San Diego, California) (ASE ’19). IEEE Press, 936–948. https://doi.org/10.1109/ASE. In this Ph.D. work, we provide an approach to solving the CFPQ 2019.00091 problem using linear algebra operations. Using provided approach [11] P. Cailliau et al. 2019. RedisGraph GraphBLAS Enabled Graph Database. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops we devise the CFPQ algorithms for all three query semantics and (IPDPSW). 285–286. https://doi.org/10.1109/IPDPSW.2019.00054 implement them using appropriate libraries for linear algebra op- [12] Petteri Sevon et al. 2008. Subgraph Queries by Context-free Grammars. Journal erations. Preliminary results show that our CPU and GPU-based of Integrative Bioinformatics 5, 2 (2008), 157 – 172. https://doi.org/10.1515/jib- 2008-100 implementations that utilize sparse matrix representation and par- [13] Semyon Grigorev et al. 2017. Context-free Path Querying with Structural Repre- allel computation outperform the state-of-the-art solutions. sentation of Result. In Proceedings of the 13th Central & Eastern European Software Engineering Conference in Russia (St. Petersburg, Russia) (CEE-SECR ’17). ACM, As a next step, we plan to provide a full comparison of our linear New York, NY, USA, Article 10, 7 pages. https://doi.org/10.1145/3166094.3166104 algebra-based implementations with all state-of-the-art solutions [14] Xin Zheng et al. 2008. Demand-driven Alias Analysis for C. In Proceedings of the for CFPQ on the same benchmark and experimental setup. More- 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Francisco, California, USA) (POPL ’08). ACM, New York, NY, over, our implementations are prototypes and we plan to provide USA, 197–208. https://doi.org/10.1145/1328438.1328464 full integration of CFPQ to RedisGraph. Also, we plan to improve [15] Xiaowang Zhang et al. 2016. Context-Free Path Queries on RDF Graphs. In The the dataset by including more real-world cases with larger graphs, Semantic Web – ISWC 2016. Springer International Publishing, Cham, 632–648. [16] Jelle Hellings. 2014. Conjunctive context-free path queries. In Proceedings of for example, from the area of static code analysis [10, 14]. ICDT’14. 119–130. [17] Jelle Hellings. 2015. Path Results for Context-free Grammar Queries on Graphs. CoRR abs/1502.02242 (2015). arXiv:1502.02242 http://arxiv.org/abs/1502.02242 ACKNOWLEDGMENTS [18] Jelle Hellings. 2015. Querying for Paths in Graphs using Context-Free Path This work was funded by RFBR, project number 19-37-90101, and Queries. arXiv preprint arXiv:1502.02242 (2015). [19] Fred C. et al. Santos. 2018. A Bottom-Up Algorithm for Answering Context-Free grant from JetBrains Research. Path Queries in Graph Databases. In Web Engineering, Tommi Mikkonen, Ralf Klamma, and Juan HernΓ‘ndez (Eds.). Springer International Publishing, Cham, 225–233. REFERENCES [20] Leslie G Valiant. 1975. General context-free recognition in less than cubic time. [1] Timothy A. Davis. 2019. Algorithm 1000: SuiteSparse: GraphBLAS: Graph Algo- Journal of computer and system sciences 10, 2 (1975), 308–315. rithms in the Language of Sparse Linear Algebra. ACM Trans. Math. Softw. 45, 4 [21] Mihalis Yannakakis. 1990. Graph-theoretic Methods in Database Theory. In (2019), 44:1–44:25. https://doi.org/10.1145/3322125 Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles [2] C. Barrett et al. 2000. Formal-Language-Constrained Path Problems. SIAM of Database Systems (Nashville, Tennessee, USA) (PODS ’90). ACM, New York, J. Comput. 30, 3 (2000), 809–837. https://doi.org/10.1137/S0097539798337716 NY, USA, 230–242. https://doi.org/10.1145/298514.298576 arXiv:https://doi.org/10.1137/S0097539798337716 4