Unified Algorithm to Solve several Graph Problems with Relational Queries Wellington Cabrera, Carlos Ordonez Department of Computer Science, University of Houston, USA Abstract. Several important graph algorithms can be solved as an iter- ation of vector-matrix multiplication over different semirings. On this ba- sis, we show that the Bellman-Ford (single source shortest paths), reacha- bility, PageRank, and topological sort algorithms can be expressed as re- lational queries, to solve analytic graph problems in relational databases. As a main contribution, we present a general algorithm that unifies all graph algorithms aforementioned. 1 Introduction Relational databases remain the most common technology to store transactional and analytical databases, due to optimized I/O, robustness and security con- trol. Although the common understanding is that relational database queries are not sufficient to express important graphs algorithms, some recent work has addressed the problem of solving graph algorithms with SQL queries. In our previous work [5], we proposed optimized recursive queries to compute matrix powers, to solve two important graph problems in the DBMS: Transitive clo- sure and all pairs shortest path. In [6], the authors use recursive SQL to find constrained paths in RDF data, stored in a relational DBMS. Jindal et all [3] present some graph algorithms using a columnar database, showing that DBMS are competitive with specialized graph systems, namely Giraph and GraphLab. In this work we demonstrate that several graph algorithms can be expressed as a succession of efficient SPJA queries that are defined over the foundation of graph algorithms with vector-matrix operations. We believe that efficient graph algo- rithms for relational databases are a contribution that will support in-database graphs analytics in large data sets stored in databases. 2 Definitions Graph Dataset Let G = (V, E) be a directed graph, where V is a set of vertices and E is a set of edges, considered as an ordered pairs of vertices. Let n = |V | vertices and m = |E| edges. An edge (i, j) in E links two vertices in V and has a direction. We use E to name both the set of edges of G as well as the adjacency matrix, depending on the context. In general, real-life graphs are sparse. Therefore, it is reasonable to store the adjacency matrix E in a sparse representation, which saves both computing, memory and storage resources. Several methods to represent sparse matrices are well known, and the interested reader may check [1]. In our work, the adjacency matrix of a graph G is stored in a table E(i, j, v), with primary key (i, j). The sparse matrix representation avoids storing zeroes. Thus, the space required for table E is O(m), much smaller than the space necessary in the dense representation, O(n2 ). We consider a matrix sparse when the number of non-zero cells is O(n). Likewise, we consider a graph as sparse when the number of edges is O(n). Certain classes of graph are clearly sparse. For instance, both trees and spanning trees are graphs where m = n − 1. Many graph problems are solved computing powers of E. For large graphs computing such powers is challenging; even if E is a sparse graph, it is possible that after k multiplications by itself, the number of non-zero entries increases to O(n2 ). Thus, E k is not necessarily sparse. Matrix Multiplication and Semirings Semirings are algebraic structures consisting of a set R, an additive operator ⊕ with identity element 0, a product operator ⊗ with identity element 1, and commutative, associative and distribu- tive properties holding for the two operators in the usual manner, succinctly represented as (R, ⊕, ⊗, 0, 1). For instance, the regular matrix multiplication is defined under (R, +, ×, 0, 1). A general definition of matrix multiplication ex- pands it to any semiring. On the min-plus semiring, also known as ”tropical semiring”, min is the additive operator ⊕, and + is the product operator ⊗. The min-plus semiring is used to solve shortest path problems, as in [2]. The boolean semiring, with ∨ (logical OR) as ⊕ operator and ∨ (logical OR) as ⊗ operator, is used frequently in linear algebra to represent some graph algorithms. 3 Graphs Algorithms computed with Relational Queries We use as starting point known linear algebra approaches for Bellman-Ford, reachability and PageRank, as well as a novel way to express Topological Sort with matrix operations. On this foundation, we will show how all these algo- rithms can be expressed as relational queries. As is already known, Bellman- Ford and reachability can be solved by powers of the adjacency matrix and a vector-matrix multiplication. Topological sort also can be solved in the same manner, as we will show shortly. Likewise, PageRank can be solved by powers of the transition matrix. These four algorithms can be solved in a general way by Equation 1: Sk = S0 · E k = S0 · E · E · ... (1) where S0 stores the initial state and Sk stores the final result. The last iteration is k. Instead of solving E k , it is more efficient to compute Sk with iterative vector-matrix products: Sd ← Sd−1 · E for d = 1 . . . k. Algorithms are below. Bellman-Ford: Shortest paths from a source s. Initialization: S0 [i] = 0 if i = s, ∞ otherwise. After k successive vector matrix multiplications (under the min + semiring), the vector Sk contains the minimun distance from s to every vertex. Reachability: Initialization: S0 [i] = 0 if i = s, 0 otherwise. Vector matrix multiplications are computed under the natural numbers semiring. At iteration d, Sd [i] contains 1 if there exists a path of length d from the s to the vertex i. PageRank: It is well known that PageRank can be computed as powers of a modified transition matrix [4]. Since PageRank is conceived as a Markov pro- cess, it can be computed as an iterative process that stops when the Markov chain stabilizes. Let S0 [i] = 1/n. The transition matrix T is defined as Ti,j = Ej,i /outdeg(j) when Ej,i = 1; otherwise Ti,j = 0. Let A be a n×n matrix, whose 0 cells contain always 1/n, and 0 < p < 1 the damping factor; let T = T + D , where D is a matrix such that Di,j = 1/n if the column j is a 0 column. The 00 00 0 power method can be applied on T defined as: T = (1 − p)T + pA, computing Sd ← Sd−1 · T ” iteratively, from d = 1 until stabilization of the Markov chain. Topological Sort: Sd holds the relative ordering for the vertices, and Sd [j] = 0 when the relative order of j is not yet determined. Initialization: S0 [j] = 1 if outdeg(j) = 0, otherwise S0 [j] = 1. The algorithm can be outlined in this manner: For each vertex j ∈ Sd−1 , find every father vertex π such that ∀i successor of π, i ∈ S, and let Sd [π] = 1. Repeat the process until ∀i ∈ V , S[i] 6= 0. The topological order is given by S0 + S1 + ...Sk . This algorithm can be expressed as in Equation 1, where Sk is solved with iterative vector-matrix multiplication. In this case, the ⊕ operation is the logical AND: ∧, and the ⊗ is the logical implication ⇒. As result of the vector-matrix multiplication, Sd [j] = 1 if for all j 0 successors of j , Sd [j 0 ] = 1. Although the operators ∧ and ⇒ cannot comprise a semiring (the commutative property for ⇒ does not hold), it is possible to express Sd ·E with SPJA queries, like in the previous algorithms. Table 1. Comparison of four graphs algorithms ... Bellman-Ford Reachability PageRank Topological Sort Computed as Sd ← Sd−1 · E Sd ← Sd−1 · E Sd ← Sd−1 · T Sd ← Sd−1 · E Semiring op.(⊕, ⊗) min, + +, × +, × ∧, ⇒ Value S[i] distance s to i 1 iif ∃ a path s to i probability order S0 defined as S0 [s] = 0 S0 [s] = 1 S0 [i] = 1/n S0 [i]=1 iff outdeg(i)=1 |S0 | 1 1 n | {i ∈ V | outdeg(i) = 0} | Output Sk S0 + S1 + ..Sk Sk S0 + S1 + ..Sk Time complexity O(kn log n) O(kn) or O(kn log n) O(kn log n) O(kn log n) Scope from source s from source s ∀i ∈ V ∀i ∈ V 4 Unified Algorithm Considering that S and E are stored in relational tables S(j, v) and E(i, j, v), the vector-matrix product can be clearly computed with a relational query as: πS.i,E.j:sum(S.v∗E.v) (S ./S.j=E.i E). Table 1 shows a comparison of the properties of the four algorithms of our concern. All of them are computed by a combined Algorithm 1: Unified Algorithm Input: Table E, S0 , R0 , f (), g(), ⊗, , unionF lag . Optional input: s (source) Output: Table Rd(j, v) d ← 0; ∆ ← 1; while ∆ >  do d←d+1 ; Sd ← πj:g()(Sd−1 .v⊗E.v) (Sd−1 ./j=i E) if unionFlag then Rd ← πj:sum(v) (Sd ∪ Rd−1 ) else Rd ← Sd end ∆ ← f (Rd−1 , Sd−1 , Rd , Sd ) end join/aggregation between S and E (or T ” ) as the main operation of the iter- ation. This SPJA query computes a vector-matrix multiplication on different semirings, depending on the specific algorithm, with time complexity O(nlogn). The meaning of the value in S[i] is different in every case. While Bellman-Ford and reachability return results concerning an unique source vertex, topological sort and PageRank return results with respect to the entire graph. The execution time of these algorithms will depend on n and the number of iterations k. For instance, reachability may finish in less iterations than Bellman-Ford, for the same graph. It is possible to conceptualize a Unified Algorithm to solve these four graphs problems, as presented in Algorithm 5. The main operation remains the matrix multiplication, with parameters ⊕, ⊗. Other input parameters are the initial vector S0 . and unionFlag, wich controls if the output is Sd or the cumulative sum of S0 + S1 ... + Sd . F 5 Conclusions We demonstrate that Bellman-Ford, reachability, PageRank and Topological Sort can be expressed as an iteration of relational queries, to solve graph prob- lems in data sets stored in databases. Based on the conceptual foundation of vector-matrix products under different semirings, these algorithms run as SPJA queries; they are secondary memory algorithms, not limited by the system RAM. The efficiency of our algorithms is supported by: sparse storage, which avoids wasting both space and computing time; conditions for early termination, which avoids useless iterations; and light weight relational queries. Moreover, we com- pared these algorithms and proposed a Unified Algorithm to solve them. Further research will identify and integrate into the Unified Algorithm more algorithms that follow the vector-matrix multiplication pattern. Besides, we will study these algorithms in large and complex graphs, especially in parallel clusters. References 1. Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the solution of algebraic eigenvalue problems: a practical guide, volume 11. Siam, 2000. 2. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algo- rithms, Third Edition. The MIT Press, 3rd edition, 2009. 3. A. Jindal, P. Rawlani, E. Wu, S. Madden, A. Deshpande, and M. Stonebraker. Vertexica: your relational friend for graph analytics! Proceedings of the VLDB En- dowment, 7(13):1669–1672, 2014. 4. S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating pagerank computations. In Proceedings of the 12th Int. Conf. on World Wide Web, pages 261–270. ACM, 2003. 5. C. Ordonez, W. Cabrera, and A. Gurram. Comparing columnar, row and array DBMSs to process recursive queries on graphs. Information Systems, pages –, 2016. 6. N. Yakovets, P. Godfrey, and J. Gryz. Evaluation of SPARQL property paths via recursive SQL. In Proc. AMW, 2013.