=Paper= {{Paper |id=Vol-2399/paper08 |storemode=property |title=Provenance-Based Routing in Probabilistic Graph Databases |pdfUrl=https://ceur-ws.org/Vol-2399/paper08.pdf |volume=Vol-2399 |authors=Yann Ramusat |dblpUrl=https://dblp.org/rec/conf/vldb/Ramusat19 }} ==Provenance-Based Routing in Probabilistic Graph Databases== https://ceur-ws.org/Vol-2399/paper08.pdf
                                                                                                            ceur-ws.org/Vol-2399/paper08.pdf




                                  Provenance-Based Routing
                               in Probabilistic Graph Databases

                                                            Yann Ramusat
                                                DI ENS, ENS, CNRS, PSL University
                                                             & Inria
                                                          Paris, France
                                                       yann.ramusat@ens.fr
                                         Supervised by Pierre Senellart and Silviu Maniu


ABSTRACT                                                                   using a framework based on provenance annotations over
Optimizing routing queries over graphs is a rich research area             graph databases.
with important applications, e.g., to road and transportation                 Graph databases are a common way to manage graph-like
networks. Thanks to progress made during past decades,                     data, with applications to social network analysis, trans-
current-day systems are able to compute paths across cities                portation networks, or the Semantic Web. A notable graph
in continent-sized areas, paths that are optimal in terms of               DBMS is Neo4j1 and its Cypher query language2 . These
distance or expected travel time. Nevertheless, the problem                databases are typically queried using navigational queries, an
considered is quite limited, personal preferences cannot be                abstraction for which are Regular Path Queries (RPQs) [3],
handled e↵ectively, and similar queries need to be computed                which select pairs of vertices joined by a path whose label
separately. We explore a provenance-based framework as                     belongs to the regular language defined by the query.
a way to extend the expressive power of routing queries,                      In this PhD project, in order to incorporate additional in-
based on the idea of keeping track of meta-information about               formation within a graph database, we enrich the graph with
query results. This framework, useful to deal with such                    provenance annotations. These annotations are propagated
aspects as uncertainty or preferences, cannot always benefit               to query results, and can be used to determine how the result
of optimizations used for computing optimal routes, leading                has been computed and how it reacts to slight changes in
to impractical algorithms. The aim of our PhD is to improve                the initial database. A mathematically rich way to do this is
on routing techniques based on provenance to apply them to                 to choose provenance annotations to be elements of a semir-
real transportation networks.                                              ing [12]. Semirings are well-suited to model computations
                                                                           (e.g., choices and sequences) carried along in computations
PVLDB Reference Format:                                                    such as the shortest-path problem, which can be rephrased as
Yann Ramusat. Provenance-Based Routing in Probabilistic Graph              solving equational systems over semirings. This framework
Databases. VLDB 2019 PhD Workshop, 2019.                                   is expressive enough to deal with uncertainty and security
                                                                           clearances, among many other applications.
                                                                              We thus use as a basis of our framework to compute en-
1.   CONTEXT                                                               riched routing queries: graph databases, navigational queries,
   Progress made during past decades on optimization of                    and semiring-based provenance.
routing (i.e., computation of optimal paths) in road and                      Our past work [18] suggests that computing the semiring-
transportation networks led to current competing algorithms                based provenance of navigational queries over graph databases
answering queries in milliseconds over continent-sized ar-                 results in high complexity. It is strongly believed we cannot
eas [4]. These algorithms exploit very specific properties of              avoid a cubic (data)-complexity [22] for the general problem.
routing operations. Thus they tend to be very constrained                  To overcome this issue, we consider four main approaches to
and not able to handle the needs of real-life concerns, such as            speed computations up.
uncertainty about traffic congestion or personal preferences
of the users.                                                                   • Considering the shape of the network. It has
   To extend the expressive power of routing queries and to                       been shown in [15] that transportation networks ex-
take into account this additional information, we propose                         hibit a relatively low treewidth. Exploiting this low
                                                                                  treewidth may help improve the running time of pre-
                                                                                  processing or query evaluation. It is worth noting these
                                                                                  ideas have already been applied to the routing problem
                                                                                  [10] without provenance.
                                                                                • Restricting the expressive power of provenance
                                                                                  annotations. Considering subclasses of semirings can
                                                                                  lead to more optimized algorithms, thus yielding a
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los          1
Angeles, California. Copyright c 2019 for this paper by its authors.           https://neo4j.com/
                                                                           2
Copying permitted for private and academic purposes.                           http://www.opencypher.org/


                                                                       1
       trade-o↵ between expressive power and cost of compu-           Provenance-aware routing algorithms. In [18] we gen-
       tation.                                                        eralized three existing graph algorithms to compute the
                                                                      provenance of regular path queries over graph databases.
     • Linking to other models. Previous work has con-                Each algorithm yields a di↵erent trade-o↵ between time com-
       sidered optimizing the computation of provenance for           plexity and generality, as each requires di↵erent properties
       recursive query languages such as Datalog [12, 9]; re-         over the semiring.
       lating RPQs over graphs to these models may allow                Together, these algorithms already cover a large class of
       reusing these optimization techniques.                         semirings used for provenance (top-k, security, etc.).
                                                                        Experimental results suggest these approaches are com-
     • Adapting state-of-the-art routing algorithms.                  plementary and practical for various kinds of provenance
       Standard routing algorithms are orders of magnitude            indications, even on a relatively large transport network.
       faster than provenance-aware routing algorithms be-              In the following we do a brief review of them, their com-
       cause they rely on very specific properties of routing         plexity and the situations were they can be applied.
       operations. It is worth determining whether they can                • Dijkstra’s algorithm can be successfully applied for
       be generalized for our purpose.                                       computing single-source provenance. This algorithm
                                                                             can be applied whenever we have 0-closed totally or-
   Ultimately our PhD aims to o↵er a strong theoretical                      dered semirings (also known as bounded or absorptive
foundation for future systems supporting non-trivial real-life               semirings) [18]. Under these restrictions we can still
routing applications. These four main directions need to                     compute security clearances.
be combined in an e↵ective way to allow designers of such
systems to apply the fastest algorithm possible without losing             • Mohri’s algorithm [16] for the case when annotations
the capacity to handle the needs of their users.                             belong to k-closed semirings. This algorithm is ex-
   The document is organized as follows: Section 2 discusses                 ponential in theory but experimental studies [16, 18]
in more detail the state of the art in routing and provenance-               showed this algorithm is in fact practical in a real
aware routing algorithms. We then present in Section 3                       context. Under these restrictions we can compute for
some preliminary results based on some improvements of                       example top-k shortest-paths and top-k distinct
already known algorithms for specific classes of semirings                   shortest-paths. Note these restrictions are weaker
(bounded semirings, distributive lattices, etc.). This leads us              as for Dijkstra so we can still apply this algorithm for
to Section 4 where we present a roadmap for the next two                     security clearances.
years and a half of our PhD research.                                      • The last algorithm is inspired by the node elimination
                                                                             method to obtain the language recognized by a finite
2.    STATE OF THE ART                                                       state automaton [8]. Notice that this problem can
                                                                             be expressed inside our framework using the semiring
  We will focus on two di↵erent areas of the research litera-                of formal languages. The provenance for one pair of
ture: one for the algorithms for the (simple) routing problem                nodes (x, y) is precisely the language recognized by the
and one dedicated to algorithms for provenance-based routing                 automaton with x as initial state and y as final state.
in the context of our framework.                                             We can then compute single-pair provenance without
                                                                             any restriction on the underlying semiring.
Routing algorithms. We provide an overview of the current
techniques in routing we think ready to be generalized to               In introduction we were referring to semirings as being
our settings. Mostly, current competing algorithms rely               well-suited structures to model several problem classes in
either on the inherent hierarchy of the network (hierarchical         computer sciences. These problems can be unified using
techniques) or on the precomputation of distances between             matrices over ⇤-semirings (also known as closed semirings),
well-chosen pairs of vertices (bounded-hop techniques) [4].           semirings with a star operation. The theory of matrices over
   Hierarchical techniques are based on the observation a             ⇤-semirings [14, 1] exhibit number of similarities with linear
subnetwork of important roads (such as highways) concen-              algebra. All-pairs graph provenance is then equivalent to the
trate the traffic between sufficiently far away cities. This          computation of the asteration of the matrix corresponding to
allows to scan few vertices for long distance queries. Two            the graph representation with provenance tags as cell-values.
major algorithms belong to this framework: Contraction                  Notice the graph provenance is in fact an (infinite) sum over
Hierarchies [11] and Reach [13].                                      the provenances of all paths from the source node leading to
   On the other side, bounded-hop techniques such as Label-           the target node. Previous algorithms work over ⇤-semirings.
ing algorithms [17] or Transit Nodes Routing [5, 6] permit            Infinite sums can only appear because of circles in our graphs,
to answer queries based on a virtual network composed of              so we only need to be able to sum all the powers of a given
precomputed shortcuts and limiting to few-hop paths.                  values. To have a result semantically correct we need to
                                                                                         P
                                                                                         1
   These two kinds of techniques can rely on each other, for          ensure that a⇤ =      an and that associativity and distribu-
example the choice for hops in the Labeling algorithm can                                n=0

be done based on the most important nodes discovered by               tivity extends to these infinite sums. This class of semirings
the first step of a contraction hierarchy [4].                        is commonly known as countably complete star semirings, or
   In [2] a new measure of the network (the Highway dimen-            c-complete star semirings.
sion) has been introduced. This measure provides a way
to justify from a theoretical point of view the (sublinear)           3.    PRELIMINARY RESULTS
complexity of most of these techniques, dramatically helping            We now briefly present our preliminary results, achieved
our understanding of these techniques.                                during the first months of our PhD research. We emphasize


                                                                  2
                                   semirings                              Theorem 1. Let L be a fixed distributive lattice, with
                                  ⇤-semirings                          a chain decomposition of its join-irreducible elements of
                                matrix asteration                      width w. Single-source provenance of an RPQ over a graph
                                                                       database of n nodes and m edges with annotations in L can




                            e
                         et




                                                    C
                                                                       be computed using w applications of Dijkstra’s algorithm.


                        pl




                                                    on
                     m




                                                      w
                    co




                                                       ay
                                                                       This results in a complexity for the whole computation of
                   c-
                         c-complete star semirings
                             graph provenance                          O(w ⇥ (m + n log n)), assuming semiring operations in L
                                   k-closed                            take constant time.
                                  Mohri’s alg.

                                    0-closed
                                  absorptivity                         Links with Datalog queries. Recently, it has been shown
                                                                       we can make use of circuits to represent provenance for
                                    0-closed
                                  total order
                                                                       Datalog queries in a much more efficient way [9]. It is worth
                                  Dijkstra alg.                        noting that, in this framework, distributive lattices and 0-
                                                                       closed semirings are also distinguished classes of semirings
                                                                       for optimizing queries; PosBool(X) and Sorp(X) being
Figure 1: Taxonomy of the semirings we can use for                     respectively the free distributive lattice and the free 0-closed
graph provenance along with their key properties                       semiring.
and/or associated algorithms.                                             In order to permit investigations for the links between the
                                                                       two instances of the provenance concept, we describe how to
                                                                       translate our database and our query into a Datalog query.
                                                                          Graph database. Given a graph database G = (V, E, w)
how they allowed us to answer our research questions and how
                                                                       we can encode it into an edb. Let ⌃ be the alphabet for edge
they can be classified according to the four main approaches
                                                                       labels, create |⌃| binary relations {R | 2 ⌃} and populate
discussed in the introduction.
                                                                       them with facts corresponding to existing labeled edges: for
                                                                       each e = (u, v) 2 E, fact Rw(e) (u, v) holds. We tag these
A taxonomy of semirings. As a first task, to obtain a                  facts with the same provenance indication as for the edges.
better understanding of the domain of semiring-based prove-               RPQ. We recursively convert an RPQ L into a (linear)
nance, we classified all classes of semirings of interest, along       Datalog query:
with their key properties or the best-known algorithm for
provenance-based routing in these semirings. We then pro-                   • If L = a, RL (x, y)        Ra (x, y),
duced a taxonomy of these semirings, that is graphically                    • If L = L1 [ L2 , RL (x, y)        RL1 (x, y) and RL (x, y)
represented as Figure 1.                                                      RL2 (x, y),

0-closed semirings. Our first concern was to overcome the                   • If L = L1 · L2 , RL (x, z)       RL1 (x, y), RL2 (y, z),
need of a natural total order to ensure correctness of Dijk-                •   If L = L⇤1 , RL (x, z)   RL (x, y), RL1 (y, z) and RL (x, x)   .
stra’s algorithm. In [18], we gave an example where Dijkstra’s
                                                                          The size of the resulting (idb) program is linear in the
algorithm fails when the semiring is 0-closed but not totally
                                                                       size of the RPQ and linear (for the edb predicates) in the
ordered: the semiring of natural numbers with greatest-
                                                                       database instance.
common-divisor and least-common-multiple as semiring op-
erators. For this class of 0-closed semirings, already a strong
restriction, we do not know of any algorithm more efficient            4.       ROADMAP
for the single-pair problem than either node elimination                  Our PhD research started at the beginning of September
(cubic time) or Mohri’s algorithm (exponential in theory,              2018 and is expected to last three years. As such, we are
relatively efficient in practice). This is a huge gap with the         still at a very preliminary stage of our research.
situation of 0-closed totally ordered semirings for which Di-             Our initial observations opened new research directions
jkstra’s algorithm is polynomial-time. Our investigations led          we would like to investigate.
us to consider the case of 0-closed multiplicatively idempo-
tent semirings (0-closed semirings in which multiplication is          Comparison with Datalog queries. After observing we
idempotent). It turns out these are equivalent to bounded              can translate RPQs against graph databases into Datalog
distributive lattices [7]. A prominent example of such semir-          programs over relational encodings of these graphs, we would
ing in the database context is the PosBool(X) provenance               now like to compare the expressive power and efficiency of
semiring [12].                                                         algorithms for computing the provenance of RPQs on graphs
   Relying on the rich mathematical theory behind lattices,            to that for computing provenance of Datalog queries over
we were able to design a parameterized algorithm based                 relational databases. We are also interested in investigating a
on Dijkstra’s algorithm answering single-source provenance             reverse translation, to determine which fragment of Datalog
with a parameterized number of calls of Dijkstra’s algorithm.          can be translated back into our model, in order to derive
We consider as a parameter the width of the chain de-                  complexity bounds and obtain insights on the expressive
composition of the join-irreducible elements of the lattice            power of our framework.
[20]. Intuitively, each element of the lattice can be uniquely            Another to way to benefit from this observation is to
represented as a combination of join-irreducible elements.             compare actual optimizations for Datalog queries based on
Considering a chain decomposition of these elements allow              the analysis of derivation trees [9] with ours: Dijkstra’s
for applying Dijkstra independently for each component of              algorithm and extended Dijkstra’s algorithm for distributive
the product as they all are totally ordered.                           lattices.


                                                                   3
Adapting current state-of-the-art routing algorithms.                   [2] I. Abraham, A. Fiat, A. V. Goldberg, and R. F.
We want to investigate how we can adapt routing algorithms                  Werneck. Highway dimension, shortest paths, and
introduced in Section 2 within our framework. One major                     provably efficient algorithms. In SODA, pages 782–793.
challenge is that these routing algorithms rely on the assump-              2010.
tion that a small number of nodes or junctions concentrate              [3] P. Barceló Baeza. Querying graph databases. In PODS,
most of the long-distance traffic. This cannot be applied to                pages 175–188, 2013.
the computation of provenance in arbitrary semirings: for               [4] H. Bast, D. Delling, A. V. Goldberg,
example, in the counting semiring, we need to be able to                    M. Müller-Hannemann, T. Pajor, P. Sanders,
count all paths between two nodes, which seems to require                   D. Wagner, and R. F. Werneck. Route planning in
to explore all these paths. A natural question is then: does                transportation networks. CoRR, abs/1504.05140, 2015.
there exist specific semirings for which the assumption holds,          [5] H. Bast, S. Funke, and D. Matijević. Ultrafast
and for which we can apply these algorithms?                                shortest-path queries via transit nodes. In The shortest
                                                                            path problem: Ninth DIMACS implementation
Lower bounds. Finally, we would like to address the prob-                   challenge, pages 175–192. 2006.
lem of finding lower complexity bounds for our general frame-           [6] H. Bast, S. Funke, P. Sanders, and D. Schultes. Fast
work, especially involving c-complete star semirings (resp.,                routing in road networks with transit nodes. Science,
⇤-semirings). For now, the complexities of the algorithms we                316:566, 2007.
can use for all-pairs, single-source, and single-pair provenance        [7] S. Bistarelli, U. Montanari, and F. Rossi.
are the same. Thus we would like to know if computing the                   Semiring-based constraint satisfaction and optimization.
provenance for one pair is indeed as difficult as computing                 JACM, 44(2):201–236, 1997.
the full matrix asteration. These bounds are commonly                   [8] J. A. Brzozowski and E. J. McCluskey. Signal flow
hard to prove but we could rely on already known (or sus-                   graph techniques for sequential circuit state diagrams.
pected) bounds for the APP problem [21] or on some results                  IEEE Trans. Electronic Computers, EC-12(2):67–76,
based on circuit complexity (such as Theorem 1 of [9]. As                   1963.
our background is not in (circuit)-complexity theory but in             [9] D. Deutch, T. Milo, S. Roy, and V. Tannen. Circuits
routing algorithms and database theory, this may require a                  for Datalog Provenance. In ICDT, pages 201–212, 2014.
collaboration with specialists in this area.
                                                                       [10] R. Geisberger, P. Sanders, D. Schultes, and D. Delling.
                                                                            Contraction hierarchies: Faster and simpler hierarchical
5.   CONCLUSION                                                             routing in road networks. In Experimental Algorithms,
   We gave an overview of a provenance-based framework                      pages 319–333, 2008.
for routing queries, and of di↵erent approaches considered             [11] R. Geisberger, P. Sanders, D. Schultes, and C. Vetter.
to lower the complexity of query evaluation: considering                    Exact routing in large road networks using contraction
the shape of the network, restricting the expressive power                  hierarchies. Transportation Science, 46(3):388–404,
of provenance annotations, linking to other models, and                     2012.
adapting state-of-the-art routing algorithms.                          [12] T. J. Green, G. Karvounarakis, and V. Tannen.
   Our research has so far involved considering a new class                 Provenance semirings. In PODS, pages 31–40, 2007.
of semirings together with an e↵ective algorithm. We have              [13] R. Gutman. Reach-based routing: A new approach to
observed similarities with semirings for which optimizations                shortest path algorithms optimized for road networks.
of Datalog provenance computation have been proposed,                       In ALENEX, pages 100–111, 2004.
leading us to consider translation of our queries into Datalog         [14] D. J. Lehmann. Algebraic structures for transitive
programs for further investigations.                                        closure. TCS, 4(1):59–76, 1977.
   We then have exposed intended directions for our PhD                [15] S. Maniu, P. Senellart, and S. Jog. An experimental
work, pursuing our current research, and finally combining                  study of the treewidth of real-world graph data. In
them to obtain theoretical and practical results applicable                 ICDT, pages 12:1–12:18, 2019.
to real-world transportation networks.
                                                                       [16] M. Mohri. Semiring frameworks and algorithms for
   As the aim of our PhD is to o↵er a strong theoretical
                                                                            shortest-distance problems. J. Autom. Lang. Comb.,
foundation for an eventual implementation of a provenance
                                                                            7(3):321–350, 2002.
aware query optimizer/processor in a graph database system,
we conclude with a final note concerning how to adapt such an          [17] D. Peleg. Proximity-preserving labeling schemes. J.
existing system to support provenance. One way to proceed is                Graph Theory, 33(3):167–176, 2000.
to dynamically rewrite queries to make them process auxiliary          [18] Y. Ramusat, S. Maniu, and P. Senellart. Semiring
data carrying provenance information; this idea has been                    provenance over graph databases. In TaPP, 2018.
successfully applied in the context of a relational database           [19] P. Senellart, L. Jachiet, S. Maniu, and Y. Ramusat.
system [19]. The major drawback of this approach is that                    Provsql: Provenance and probability management in
most of the optimization strategies are no longer applicable,               postgresql. In Proc. VLDB, pages 2034–2037, Rio de
thus leading to a non-negligible computation time overhead;                 Janeiro, Brazil, Aug. 2018. Demonstration.
this would be the bulk of our system-focused research.                 [20] M. Siggers. On the representation of finite distributive
                                                                            lattices. arXiv [math], abs/1412.0011, 2014.
6.   REFERENCES                                                        [21] R. Williams. Faster all-pairs shortest paths via circuit
                                                                            complexity. CoRR, abs/1312.6680, 2013.
 [1] S. K. Abdali. Parallel computations in *-semirings.
                                                                       [22] R. Williams. Faster all-pairs shortest paths via circuit
     Computational Algebra, pages 1–16, 1994.
                                                                            complexity. In STOC, pages 664–673, 2014.



                                                                   4