ceur-ws.org/Vol-2399/paper08.pdf Provenance-Based Routing in Probabilistic Graph Databases Yann Ramusat DI ENS, ENS, CNRS, PSL University & Inria Paris, France yann.ramusat@ens.fr Supervised by Pierre Senellart and Silviu Maniu ABSTRACT using a framework based on provenance annotations over Optimizing routing queries over graphs is a rich research area graph databases. with important applications, e.g., to road and transportation Graph databases are a common way to manage graph-like networks. Thanks to progress made during past decades, data, with applications to social network analysis, trans- current-day systems are able to compute paths across cities portation networks, or the Semantic Web. A notable graph in continent-sized areas, paths that are optimal in terms of DBMS is Neo4j1 and its Cypher query language2 . These distance or expected travel time. Nevertheless, the problem databases are typically queried using navigational queries, an considered is quite limited, personal preferences cannot be abstraction for which are Regular Path Queries (RPQs) [3], handled e↵ectively, and similar queries need to be computed which select pairs of vertices joined by a path whose label separately. We explore a provenance-based framework as belongs to the regular language defined by the query. a way to extend the expressive power of routing queries, In this PhD project, in order to incorporate additional in- based on the idea of keeping track of meta-information about formation within a graph database, we enrich the graph with query results. This framework, useful to deal with such provenance annotations. These annotations are propagated aspects as uncertainty or preferences, cannot always benefit to query results, and can be used to determine how the result of optimizations used for computing optimal routes, leading has been computed and how it reacts to slight changes in to impractical algorithms. The aim of our PhD is to improve the initial database. A mathematically rich way to do this is on routing techniques based on provenance to apply them to to choose provenance annotations to be elements of a semir- real transportation networks. ing [12]. Semirings are well-suited to model computations (e.g., choices and sequences) carried along in computations PVLDB Reference Format: such as the shortest-path problem, which can be rephrased as Yann Ramusat. Provenance-Based Routing in Probabilistic Graph solving equational systems over semirings. This framework Databases. VLDB 2019 PhD Workshop, 2019. is expressive enough to deal with uncertainty and security clearances, among many other applications. We thus use as a basis of our framework to compute en- 1. CONTEXT riched routing queries: graph databases, navigational queries, Progress made during past decades on optimization of and semiring-based provenance. routing (i.e., computation of optimal paths) in road and Our past work [18] suggests that computing the semiring- transportation networks led to current competing algorithms based provenance of navigational queries over graph databases answering queries in milliseconds over continent-sized ar- results in high complexity. It is strongly believed we cannot eas [4]. These algorithms exploit very specific properties of avoid a cubic (data)-complexity [22] for the general problem. routing operations. Thus they tend to be very constrained To overcome this issue, we consider four main approaches to and not able to handle the needs of real-life concerns, such as speed computations up. uncertainty about traffic congestion or personal preferences of the users. • Considering the shape of the network. It has To extend the expressive power of routing queries and to been shown in [15] that transportation networks ex- take into account this additional information, we propose hibit a relatively low treewidth. Exploiting this low treewidth may help improve the running time of pre- processing or query evaluation. It is worth noting these ideas have already been applied to the routing problem [10] without provenance. • Restricting the expressive power of provenance annotations. Considering subclasses of semirings can lead to more optimized algorithms, thus yielding a Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los 1 Angeles, California. Copyright c 2019 for this paper by its authors. https://neo4j.com/ 2 Copying permitted for private and academic purposes. http://www.opencypher.org/ 1 trade-o↵ between expressive power and cost of compu- Provenance-aware routing algorithms. In [18] we gen- tation. eralized three existing graph algorithms to compute the provenance of regular path queries over graph databases. • Linking to other models. Previous work has con- Each algorithm yields a di↵erent trade-o↵ between time com- sidered optimizing the computation of provenance for plexity and generality, as each requires di↵erent properties recursive query languages such as Datalog [12, 9]; re- over the semiring. lating RPQs over graphs to these models may allow Together, these algorithms already cover a large class of reusing these optimization techniques. semirings used for provenance (top-k, security, etc.). Experimental results suggest these approaches are com- • Adapting state-of-the-art routing algorithms. plementary and practical for various kinds of provenance Standard routing algorithms are orders of magnitude indications, even on a relatively large transport network. faster than provenance-aware routing algorithms be- In the following we do a brief review of them, their com- cause they rely on very specific properties of routing plexity and the situations were they can be applied. operations. It is worth determining whether they can • Dijkstra’s algorithm can be successfully applied for be generalized for our purpose. computing single-source provenance. This algorithm can be applied whenever we have 0-closed totally or- Ultimately our PhD aims to o↵er a strong theoretical dered semirings (also known as bounded or absorptive foundation for future systems supporting non-trivial real-life semirings) [18]. Under these restrictions we can still routing applications. These four main directions need to compute security clearances. be combined in an e↵ective way to allow designers of such systems to apply the fastest algorithm possible without losing • Mohri’s algorithm [16] for the case when annotations the capacity to handle the needs of their users. belong to k-closed semirings. This algorithm is ex- The document is organized as follows: Section 2 discusses ponential in theory but experimental studies [16, 18] in more detail the state of the art in routing and provenance- showed this algorithm is in fact practical in a real aware routing algorithms. We then present in Section 3 context. Under these restrictions we can compute for some preliminary results based on some improvements of example top-k shortest-paths and top-k distinct already known algorithms for specific classes of semirings shortest-paths. Note these restrictions are weaker (bounded semirings, distributive lattices, etc.). This leads us as for Dijkstra so we can still apply this algorithm for to Section 4 where we present a roadmap for the next two security clearances. years and a half of our PhD research. • The last algorithm is inspired by the node elimination method to obtain the language recognized by a finite 2. STATE OF THE ART state automaton [8]. Notice that this problem can be expressed inside our framework using the semiring We will focus on two di↵erent areas of the research litera- of formal languages. The provenance for one pair of ture: one for the algorithms for the (simple) routing problem nodes (x, y) is precisely the language recognized by the and one dedicated to algorithms for provenance-based routing automaton with x as initial state and y as final state. in the context of our framework. We can then compute single-pair provenance without any restriction on the underlying semiring. Routing algorithms. We provide an overview of the current techniques in routing we think ready to be generalized to In introduction we were referring to semirings as being our settings. Mostly, current competing algorithms rely well-suited structures to model several problem classes in either on the inherent hierarchy of the network (hierarchical computer sciences. These problems can be unified using techniques) or on the precomputation of distances between matrices over ⇤-semirings (also known as closed semirings), well-chosen pairs of vertices (bounded-hop techniques) [4]. semirings with a star operation. The theory of matrices over Hierarchical techniques are based on the observation a ⇤-semirings [14, 1] exhibit number of similarities with linear subnetwork of important roads (such as highways) concen- algebra. All-pairs graph provenance is then equivalent to the trate the traffic between sufficiently far away cities. This computation of the asteration of the matrix corresponding to allows to scan few vertices for long distance queries. Two the graph representation with provenance tags as cell-values. major algorithms belong to this framework: Contraction Notice the graph provenance is in fact an (infinite) sum over Hierarchies [11] and Reach [13]. the provenances of all paths from the source node leading to On the other side, bounded-hop techniques such as Label- the target node. Previous algorithms work over ⇤-semirings. ing algorithms [17] or Transit Nodes Routing [5, 6] permit Infinite sums can only appear because of circles in our graphs, to answer queries based on a virtual network composed of so we only need to be able to sum all the powers of a given precomputed shortcuts and limiting to few-hop paths. values. To have a result semantically correct we need to P 1 These two kinds of techniques can rely on each other, for ensure that a⇤ = an and that associativity and distribu- example the choice for hops in the Labeling algorithm can n=0 be done based on the most important nodes discovered by tivity extends to these infinite sums. This class of semirings the first step of a contraction hierarchy [4]. is commonly known as countably complete star semirings, or In [2] a new measure of the network (the Highway dimen- c-complete star semirings. sion) has been introduced. This measure provides a way to justify from a theoretical point of view the (sublinear) 3. PRELIMINARY RESULTS complexity of most of these techniques, dramatically helping We now briefly present our preliminary results, achieved our understanding of these techniques. during the first months of our PhD research. We emphasize 2 semirings Theorem 1. Let L be a fixed distributive lattice, with ⇤-semirings a chain decomposition of its join-irreducible elements of matrix asteration width w. Single-source provenance of an RPQ over a graph database of n nodes and m edges with annotations in L can e et C be computed using w applications of Dijkstra’s algorithm. pl on m w co ay This results in a complexity for the whole computation of c- c-complete star semirings graph provenance O(w ⇥ (m + n log n)), assuming semiring operations in L k-closed take constant time. Mohri’s alg. 0-closed absorptivity Links with Datalog queries. Recently, it has been shown we can make use of circuits to represent provenance for 0-closed total order Datalog queries in a much more efficient way [9]. It is worth Dijkstra alg. noting that, in this framework, distributive lattices and 0- closed semirings are also distinguished classes of semirings for optimizing queries; PosBool(X) and Sorp(X) being Figure 1: Taxonomy of the semirings we can use for respectively the free distributive lattice and the free 0-closed graph provenance along with their key properties semiring. and/or associated algorithms. In order to permit investigations for the links between the two instances of the provenance concept, we describe how to translate our database and our query into a Datalog query. Graph database. Given a graph database G = (V, E, w) how they allowed us to answer our research questions and how we can encode it into an edb. Let ⌃ be the alphabet for edge they can be classified according to the four main approaches labels, create |⌃| binary relations {R | 2 ⌃} and populate discussed in the introduction. them with facts corresponding to existing labeled edges: for each e = (u, v) 2 E, fact Rw(e) (u, v) holds. We tag these A taxonomy of semirings. As a first task, to obtain a facts with the same provenance indication as for the edges. better understanding of the domain of semiring-based prove- RPQ. We recursively convert an RPQ L into a (linear) nance, we classified all classes of semirings of interest, along Datalog query: with their key properties or the best-known algorithm for provenance-based routing in these semirings. We then pro- • If L = a, RL (x, y) Ra (x, y), duced a taxonomy of these semirings, that is graphically • If L = L1 [ L2 , RL (x, y) RL1 (x, y) and RL (x, y) represented as Figure 1. RL2 (x, y), 0-closed semirings. Our first concern was to overcome the • If L = L1 · L2 , RL (x, z) RL1 (x, y), RL2 (y, z), need of a natural total order to ensure correctness of Dijk- • If L = L⇤1 , RL (x, z) RL (x, y), RL1 (y, z) and RL (x, x) . stra’s algorithm. In [18], we gave an example where Dijkstra’s The size of the resulting (idb) program is linear in the algorithm fails when the semiring is 0-closed but not totally size of the RPQ and linear (for the edb predicates) in the ordered: the semiring of natural numbers with greatest- database instance. common-divisor and least-common-multiple as semiring op- erators. For this class of 0-closed semirings, already a strong restriction, we do not know of any algorithm more efficient 4. ROADMAP for the single-pair problem than either node elimination Our PhD research started at the beginning of September (cubic time) or Mohri’s algorithm (exponential in theory, 2018 and is expected to last three years. As such, we are relatively efficient in practice). This is a huge gap with the still at a very preliminary stage of our research. situation of 0-closed totally ordered semirings for which Di- Our initial observations opened new research directions jkstra’s algorithm is polynomial-time. Our investigations led we would like to investigate. us to consider the case of 0-closed multiplicatively idempo- tent semirings (0-closed semirings in which multiplication is Comparison with Datalog queries. After observing we idempotent). It turns out these are equivalent to bounded can translate RPQs against graph databases into Datalog distributive lattices [7]. A prominent example of such semir- programs over relational encodings of these graphs, we would ing in the database context is the PosBool(X) provenance now like to compare the expressive power and efficiency of semiring [12]. algorithms for computing the provenance of RPQs on graphs Relying on the rich mathematical theory behind lattices, to that for computing provenance of Datalog queries over we were able to design a parameterized algorithm based relational databases. We are also interested in investigating a on Dijkstra’s algorithm answering single-source provenance reverse translation, to determine which fragment of Datalog with a parameterized number of calls of Dijkstra’s algorithm. can be translated back into our model, in order to derive We consider as a parameter the width of the chain de- complexity bounds and obtain insights on the expressive composition of the join-irreducible elements of the lattice power of our framework. [20]. Intuitively, each element of the lattice can be uniquely Another to way to benefit from this observation is to represented as a combination of join-irreducible elements. compare actual optimizations for Datalog queries based on Considering a chain decomposition of these elements allow the analysis of derivation trees [9] with ours: Dijkstra’s for applying Dijkstra independently for each component of algorithm and extended Dijkstra’s algorithm for distributive the product as they all are totally ordered. lattices. 3 Adapting current state-of-the-art routing algorithms. [2] I. Abraham, A. Fiat, A. V. Goldberg, and R. F. We want to investigate how we can adapt routing algorithms Werneck. Highway dimension, shortest paths, and introduced in Section 2 within our framework. One major provably efficient algorithms. In SODA, pages 782–793. challenge is that these routing algorithms rely on the assump- 2010. tion that a small number of nodes or junctions concentrate [3] P. Barceló Baeza. Querying graph databases. In PODS, most of the long-distance traffic. This cannot be applied to pages 175–188, 2013. the computation of provenance in arbitrary semirings: for [4] H. Bast, D. Delling, A. V. Goldberg, example, in the counting semiring, we need to be able to M. Müller-Hannemann, T. Pajor, P. Sanders, count all paths between two nodes, which seems to require D. Wagner, and R. F. Werneck. Route planning in to explore all these paths. A natural question is then: does transportation networks. CoRR, abs/1504.05140, 2015. there exist specific semirings for which the assumption holds, [5] H. Bast, S. Funke, and D. Matijević. Ultrafast and for which we can apply these algorithms? shortest-path queries via transit nodes. In The shortest path problem: Ninth DIMACS implementation Lower bounds. Finally, we would like to address the prob- challenge, pages 175–192. 2006. lem of finding lower complexity bounds for our general frame- [6] H. Bast, S. Funke, P. Sanders, and D. Schultes. Fast work, especially involving c-complete star semirings (resp., routing in road networks with transit nodes. Science, ⇤-semirings). For now, the complexities of the algorithms we 316:566, 2007. can use for all-pairs, single-source, and single-pair provenance [7] S. Bistarelli, U. Montanari, and F. Rossi. are the same. Thus we would like to know if computing the Semiring-based constraint satisfaction and optimization. provenance for one pair is indeed as difficult as computing JACM, 44(2):201–236, 1997. the full matrix asteration. These bounds are commonly [8] J. A. Brzozowski and E. J. McCluskey. Signal flow hard to prove but we could rely on already known (or sus- graph techniques for sequential circuit state diagrams. pected) bounds for the APP problem [21] or on some results IEEE Trans. Electronic Computers, EC-12(2):67–76, based on circuit complexity (such as Theorem 1 of [9]. As 1963. our background is not in (circuit)-complexity theory but in [9] D. Deutch, T. Milo, S. Roy, and V. Tannen. Circuits routing algorithms and database theory, this may require a for Datalog Provenance. In ICDT, pages 201–212, 2014. collaboration with specialists in this area. [10] R. Geisberger, P. Sanders, D. Schultes, and D. Delling. Contraction hierarchies: Faster and simpler hierarchical 5. CONCLUSION routing in road networks. In Experimental Algorithms, We gave an overview of a provenance-based framework pages 319–333, 2008. for routing queries, and of di↵erent approaches considered [11] R. Geisberger, P. Sanders, D. Schultes, and C. Vetter. to lower the complexity of query evaluation: considering Exact routing in large road networks using contraction the shape of the network, restricting the expressive power hierarchies. Transportation Science, 46(3):388–404, of provenance annotations, linking to other models, and 2012. adapting state-of-the-art routing algorithms. [12] T. J. Green, G. Karvounarakis, and V. Tannen. Our research has so far involved considering a new class Provenance semirings. In PODS, pages 31–40, 2007. of semirings together with an e↵ective algorithm. We have [13] R. Gutman. Reach-based routing: A new approach to observed similarities with semirings for which optimizations shortest path algorithms optimized for road networks. of Datalog provenance computation have been proposed, In ALENEX, pages 100–111, 2004. leading us to consider translation of our queries into Datalog [14] D. J. Lehmann. Algebraic structures for transitive programs for further investigations. closure. TCS, 4(1):59–76, 1977. We then have exposed intended directions for our PhD [15] S. Maniu, P. Senellart, and S. Jog. An experimental work, pursuing our current research, and finally combining study of the treewidth of real-world graph data. In them to obtain theoretical and practical results applicable ICDT, pages 12:1–12:18, 2019. to real-world transportation networks. [16] M. Mohri. Semiring frameworks and algorithms for As the aim of our PhD is to o↵er a strong theoretical shortest-distance problems. J. Autom. Lang. Comb., foundation for an eventual implementation of a provenance 7(3):321–350, 2002. aware query optimizer/processor in a graph database system, we conclude with a final note concerning how to adapt such an [17] D. Peleg. Proximity-preserving labeling schemes. J. existing system to support provenance. One way to proceed is Graph Theory, 33(3):167–176, 2000. to dynamically rewrite queries to make them process auxiliary [18] Y. Ramusat, S. Maniu, and P. Senellart. Semiring data carrying provenance information; this idea has been provenance over graph databases. In TaPP, 2018. successfully applied in the context of a relational database [19] P. Senellart, L. Jachiet, S. Maniu, and Y. Ramusat. system [19]. The major drawback of this approach is that Provsql: Provenance and probability management in most of the optimization strategies are no longer applicable, postgresql. In Proc. VLDB, pages 2034–2037, Rio de thus leading to a non-negligible computation time overhead; Janeiro, Brazil, Aug. 2018. Demonstration. this would be the bulk of our system-focused research. [20] M. Siggers. On the representation of finite distributive lattices. arXiv [math], abs/1412.0011, 2014. 6. REFERENCES [21] R. Williams. Faster all-pairs shortest paths via circuit complexity. CoRR, abs/1312.6680, 2013. [1] S. K. Abdali. Parallel computations in *-semirings. [22] R. Williams. Faster all-pairs shortest paths via circuit Computational Algebra, pages 1–16, 1994. complexity. In STOC, pages 664–673, 2014. 4