=Paper=
{{Paper
|id=Vol-2399/paper08
|storemode=property
|title=Provenance-Based Routing in Probabilistic Graph Databases
|pdfUrl=https://ceur-ws.org/Vol-2399/paper08.pdf
|volume=Vol-2399
|authors=Yann Ramusat
|dblpUrl=https://dblp.org/rec/conf/vldb/Ramusat19
}}
==Provenance-Based Routing in Probabilistic Graph Databases==
ceur-ws.org/Vol-2399/paper08.pdf
Provenance-Based Routing
in Probabilistic Graph Databases
Yann Ramusat
DI ENS, ENS, CNRS, PSL University
& Inria
Paris, France
yann.ramusat@ens.fr
Supervised by Pierre Senellart and Silviu Maniu
ABSTRACT using a framework based on provenance annotations over
Optimizing routing queries over graphs is a rich research area graph databases.
with important applications, e.g., to road and transportation Graph databases are a common way to manage graph-like
networks. Thanks to progress made during past decades, data, with applications to social network analysis, trans-
current-day systems are able to compute paths across cities portation networks, or the Semantic Web. A notable graph
in continent-sized areas, paths that are optimal in terms of DBMS is Neo4j1 and its Cypher query language2 . These
distance or expected travel time. Nevertheless, the problem databases are typically queried using navigational queries, an
considered is quite limited, personal preferences cannot be abstraction for which are Regular Path Queries (RPQs) [3],
handled e↵ectively, and similar queries need to be computed which select pairs of vertices joined by a path whose label
separately. We explore a provenance-based framework as belongs to the regular language defined by the query.
a way to extend the expressive power of routing queries, In this PhD project, in order to incorporate additional in-
based on the idea of keeping track of meta-information about formation within a graph database, we enrich the graph with
query results. This framework, useful to deal with such provenance annotations. These annotations are propagated
aspects as uncertainty or preferences, cannot always benefit to query results, and can be used to determine how the result
of optimizations used for computing optimal routes, leading has been computed and how it reacts to slight changes in
to impractical algorithms. The aim of our PhD is to improve the initial database. A mathematically rich way to do this is
on routing techniques based on provenance to apply them to to choose provenance annotations to be elements of a semir-
real transportation networks. ing [12]. Semirings are well-suited to model computations
(e.g., choices and sequences) carried along in computations
PVLDB Reference Format: such as the shortest-path problem, which can be rephrased as
Yann Ramusat. Provenance-Based Routing in Probabilistic Graph solving equational systems over semirings. This framework
Databases. VLDB 2019 PhD Workshop, 2019. is expressive enough to deal with uncertainty and security
clearances, among many other applications.
We thus use as a basis of our framework to compute en-
1. CONTEXT riched routing queries: graph databases, navigational queries,
Progress made during past decades on optimization of and semiring-based provenance.
routing (i.e., computation of optimal paths) in road and Our past work [18] suggests that computing the semiring-
transportation networks led to current competing algorithms based provenance of navigational queries over graph databases
answering queries in milliseconds over continent-sized ar- results in high complexity. It is strongly believed we cannot
eas [4]. These algorithms exploit very specific properties of avoid a cubic (data)-complexity [22] for the general problem.
routing operations. Thus they tend to be very constrained To overcome this issue, we consider four main approaches to
and not able to handle the needs of real-life concerns, such as speed computations up.
uncertainty about traffic congestion or personal preferences
of the users. • Considering the shape of the network. It has
To extend the expressive power of routing queries and to been shown in [15] that transportation networks ex-
take into account this additional information, we propose hibit a relatively low treewidth. Exploiting this low
treewidth may help improve the running time of pre-
processing or query evaluation. It is worth noting these
ideas have already been applied to the routing problem
[10] without provenance.
• Restricting the expressive power of provenance
annotations. Considering subclasses of semirings can
lead to more optimized algorithms, thus yielding a
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los 1
Angeles, California. Copyright c 2019 for this paper by its authors. https://neo4j.com/
2
Copying permitted for private and academic purposes. http://www.opencypher.org/
1
trade-o↵ between expressive power and cost of compu- Provenance-aware routing algorithms. In [18] we gen-
tation. eralized three existing graph algorithms to compute the
provenance of regular path queries over graph databases.
• Linking to other models. Previous work has con- Each algorithm yields a di↵erent trade-o↵ between time com-
sidered optimizing the computation of provenance for plexity and generality, as each requires di↵erent properties
recursive query languages such as Datalog [12, 9]; re- over the semiring.
lating RPQs over graphs to these models may allow Together, these algorithms already cover a large class of
reusing these optimization techniques. semirings used for provenance (top-k, security, etc.).
Experimental results suggest these approaches are com-
• Adapting state-of-the-art routing algorithms. plementary and practical for various kinds of provenance
Standard routing algorithms are orders of magnitude indications, even on a relatively large transport network.
faster than provenance-aware routing algorithms be- In the following we do a brief review of them, their com-
cause they rely on very specific properties of routing plexity and the situations were they can be applied.
operations. It is worth determining whether they can • Dijkstra’s algorithm can be successfully applied for
be generalized for our purpose. computing single-source provenance. This algorithm
can be applied whenever we have 0-closed totally or-
Ultimately our PhD aims to o↵er a strong theoretical dered semirings (also known as bounded or absorptive
foundation for future systems supporting non-trivial real-life semirings) [18]. Under these restrictions we can still
routing applications. These four main directions need to compute security clearances.
be combined in an e↵ective way to allow designers of such
systems to apply the fastest algorithm possible without losing • Mohri’s algorithm [16] for the case when annotations
the capacity to handle the needs of their users. belong to k-closed semirings. This algorithm is ex-
The document is organized as follows: Section 2 discusses ponential in theory but experimental studies [16, 18]
in more detail the state of the art in routing and provenance- showed this algorithm is in fact practical in a real
aware routing algorithms. We then present in Section 3 context. Under these restrictions we can compute for
some preliminary results based on some improvements of example top-k shortest-paths and top-k distinct
already known algorithms for specific classes of semirings shortest-paths. Note these restrictions are weaker
(bounded semirings, distributive lattices, etc.). This leads us as for Dijkstra so we can still apply this algorithm for
to Section 4 where we present a roadmap for the next two security clearances.
years and a half of our PhD research. • The last algorithm is inspired by the node elimination
method to obtain the language recognized by a finite
2. STATE OF THE ART state automaton [8]. Notice that this problem can
be expressed inside our framework using the semiring
We will focus on two di↵erent areas of the research litera- of formal languages. The provenance for one pair of
ture: one for the algorithms for the (simple) routing problem nodes (x, y) is precisely the language recognized by the
and one dedicated to algorithms for provenance-based routing automaton with x as initial state and y as final state.
in the context of our framework. We can then compute single-pair provenance without
any restriction on the underlying semiring.
Routing algorithms. We provide an overview of the current
techniques in routing we think ready to be generalized to In introduction we were referring to semirings as being
our settings. Mostly, current competing algorithms rely well-suited structures to model several problem classes in
either on the inherent hierarchy of the network (hierarchical computer sciences. These problems can be unified using
techniques) or on the precomputation of distances between matrices over ⇤-semirings (also known as closed semirings),
well-chosen pairs of vertices (bounded-hop techniques) [4]. semirings with a star operation. The theory of matrices over
Hierarchical techniques are based on the observation a ⇤-semirings [14, 1] exhibit number of similarities with linear
subnetwork of important roads (such as highways) concen- algebra. All-pairs graph provenance is then equivalent to the
trate the traffic between sufficiently far away cities. This computation of the asteration of the matrix corresponding to
allows to scan few vertices for long distance queries. Two the graph representation with provenance tags as cell-values.
major algorithms belong to this framework: Contraction Notice the graph provenance is in fact an (infinite) sum over
Hierarchies [11] and Reach [13]. the provenances of all paths from the source node leading to
On the other side, bounded-hop techniques such as Label- the target node. Previous algorithms work over ⇤-semirings.
ing algorithms [17] or Transit Nodes Routing [5, 6] permit Infinite sums can only appear because of circles in our graphs,
to answer queries based on a virtual network composed of so we only need to be able to sum all the powers of a given
precomputed shortcuts and limiting to few-hop paths. values. To have a result semantically correct we need to
P
1
These two kinds of techniques can rely on each other, for ensure that a⇤ = an and that associativity and distribu-
example the choice for hops in the Labeling algorithm can n=0
be done based on the most important nodes discovered by tivity extends to these infinite sums. This class of semirings
the first step of a contraction hierarchy [4]. is commonly known as countably complete star semirings, or
In [2] a new measure of the network (the Highway dimen- c-complete star semirings.
sion) has been introduced. This measure provides a way
to justify from a theoretical point of view the (sublinear) 3. PRELIMINARY RESULTS
complexity of most of these techniques, dramatically helping We now briefly present our preliminary results, achieved
our understanding of these techniques. during the first months of our PhD research. We emphasize
2
semirings Theorem 1. Let L be a fixed distributive lattice, with
⇤-semirings a chain decomposition of its join-irreducible elements of
matrix asteration width w. Single-source provenance of an RPQ over a graph
database of n nodes and m edges with annotations in L can
e
et
C
be computed using w applications of Dijkstra’s algorithm.
pl
on
m
w
co
ay
This results in a complexity for the whole computation of
c-
c-complete star semirings
graph provenance O(w ⇥ (m + n log n)), assuming semiring operations in L
k-closed take constant time.
Mohri’s alg.
0-closed
absorptivity Links with Datalog queries. Recently, it has been shown
we can make use of circuits to represent provenance for
0-closed
total order
Datalog queries in a much more efficient way [9]. It is worth
Dijkstra alg. noting that, in this framework, distributive lattices and 0-
closed semirings are also distinguished classes of semirings
for optimizing queries; PosBool(X) and Sorp(X) being
Figure 1: Taxonomy of the semirings we can use for respectively the free distributive lattice and the free 0-closed
graph provenance along with their key properties semiring.
and/or associated algorithms. In order to permit investigations for the links between the
two instances of the provenance concept, we describe how to
translate our database and our query into a Datalog query.
Graph database. Given a graph database G = (V, E, w)
how they allowed us to answer our research questions and how
we can encode it into an edb. Let ⌃ be the alphabet for edge
they can be classified according to the four main approaches
labels, create |⌃| binary relations {R | 2 ⌃} and populate
discussed in the introduction.
them with facts corresponding to existing labeled edges: for
each e = (u, v) 2 E, fact Rw(e) (u, v) holds. We tag these
A taxonomy of semirings. As a first task, to obtain a facts with the same provenance indication as for the edges.
better understanding of the domain of semiring-based prove- RPQ. We recursively convert an RPQ L into a (linear)
nance, we classified all classes of semirings of interest, along Datalog query:
with their key properties or the best-known algorithm for
provenance-based routing in these semirings. We then pro- • If L = a, RL (x, y) Ra (x, y),
duced a taxonomy of these semirings, that is graphically • If L = L1 [ L2 , RL (x, y) RL1 (x, y) and RL (x, y)
represented as Figure 1. RL2 (x, y),
0-closed semirings. Our first concern was to overcome the • If L = L1 · L2 , RL (x, z) RL1 (x, y), RL2 (y, z),
need of a natural total order to ensure correctness of Dijk- • If L = L⇤1 , RL (x, z) RL (x, y), RL1 (y, z) and RL (x, x) .
stra’s algorithm. In [18], we gave an example where Dijkstra’s
The size of the resulting (idb) program is linear in the
algorithm fails when the semiring is 0-closed but not totally
size of the RPQ and linear (for the edb predicates) in the
ordered: the semiring of natural numbers with greatest-
database instance.
common-divisor and least-common-multiple as semiring op-
erators. For this class of 0-closed semirings, already a strong
restriction, we do not know of any algorithm more efficient 4. ROADMAP
for the single-pair problem than either node elimination Our PhD research started at the beginning of September
(cubic time) or Mohri’s algorithm (exponential in theory, 2018 and is expected to last three years. As such, we are
relatively efficient in practice). This is a huge gap with the still at a very preliminary stage of our research.
situation of 0-closed totally ordered semirings for which Di- Our initial observations opened new research directions
jkstra’s algorithm is polynomial-time. Our investigations led we would like to investigate.
us to consider the case of 0-closed multiplicatively idempo-
tent semirings (0-closed semirings in which multiplication is Comparison with Datalog queries. After observing we
idempotent). It turns out these are equivalent to bounded can translate RPQs against graph databases into Datalog
distributive lattices [7]. A prominent example of such semir- programs over relational encodings of these graphs, we would
ing in the database context is the PosBool(X) provenance now like to compare the expressive power and efficiency of
semiring [12]. algorithms for computing the provenance of RPQs on graphs
Relying on the rich mathematical theory behind lattices, to that for computing provenance of Datalog queries over
we were able to design a parameterized algorithm based relational databases. We are also interested in investigating a
on Dijkstra’s algorithm answering single-source provenance reverse translation, to determine which fragment of Datalog
with a parameterized number of calls of Dijkstra’s algorithm. can be translated back into our model, in order to derive
We consider as a parameter the width of the chain de- complexity bounds and obtain insights on the expressive
composition of the join-irreducible elements of the lattice power of our framework.
[20]. Intuitively, each element of the lattice can be uniquely Another to way to benefit from this observation is to
represented as a combination of join-irreducible elements. compare actual optimizations for Datalog queries based on
Considering a chain decomposition of these elements allow the analysis of derivation trees [9] with ours: Dijkstra’s
for applying Dijkstra independently for each component of algorithm and extended Dijkstra’s algorithm for distributive
the product as they all are totally ordered. lattices.
3
Adapting current state-of-the-art routing algorithms. [2] I. Abraham, A. Fiat, A. V. Goldberg, and R. F.
We want to investigate how we can adapt routing algorithms Werneck. Highway dimension, shortest paths, and
introduced in Section 2 within our framework. One major provably efficient algorithms. In SODA, pages 782–793.
challenge is that these routing algorithms rely on the assump- 2010.
tion that a small number of nodes or junctions concentrate [3] P. Barceló Baeza. Querying graph databases. In PODS,
most of the long-distance traffic. This cannot be applied to pages 175–188, 2013.
the computation of provenance in arbitrary semirings: for [4] H. Bast, D. Delling, A. V. Goldberg,
example, in the counting semiring, we need to be able to M. Müller-Hannemann, T. Pajor, P. Sanders,
count all paths between two nodes, which seems to require D. Wagner, and R. F. Werneck. Route planning in
to explore all these paths. A natural question is then: does transportation networks. CoRR, abs/1504.05140, 2015.
there exist specific semirings for which the assumption holds, [5] H. Bast, S. Funke, and D. Matijević. Ultrafast
and for which we can apply these algorithms? shortest-path queries via transit nodes. In The shortest
path problem: Ninth DIMACS implementation
Lower bounds. Finally, we would like to address the prob- challenge, pages 175–192. 2006.
lem of finding lower complexity bounds for our general frame- [6] H. Bast, S. Funke, P. Sanders, and D. Schultes. Fast
work, especially involving c-complete star semirings (resp., routing in road networks with transit nodes. Science,
⇤-semirings). For now, the complexities of the algorithms we 316:566, 2007.
can use for all-pairs, single-source, and single-pair provenance [7] S. Bistarelli, U. Montanari, and F. Rossi.
are the same. Thus we would like to know if computing the Semiring-based constraint satisfaction and optimization.
provenance for one pair is indeed as difficult as computing JACM, 44(2):201–236, 1997.
the full matrix asteration. These bounds are commonly [8] J. A. Brzozowski and E. J. McCluskey. Signal flow
hard to prove but we could rely on already known (or sus- graph techniques for sequential circuit state diagrams.
pected) bounds for the APP problem [21] or on some results IEEE Trans. Electronic Computers, EC-12(2):67–76,
based on circuit complexity (such as Theorem 1 of [9]. As 1963.
our background is not in (circuit)-complexity theory but in [9] D. Deutch, T. Milo, S. Roy, and V. Tannen. Circuits
routing algorithms and database theory, this may require a for Datalog Provenance. In ICDT, pages 201–212, 2014.
collaboration with specialists in this area.
[10] R. Geisberger, P. Sanders, D. Schultes, and D. Delling.
Contraction hierarchies: Faster and simpler hierarchical
5. CONCLUSION routing in road networks. In Experimental Algorithms,
We gave an overview of a provenance-based framework pages 319–333, 2008.
for routing queries, and of di↵erent approaches considered [11] R. Geisberger, P. Sanders, D. Schultes, and C. Vetter.
to lower the complexity of query evaluation: considering Exact routing in large road networks using contraction
the shape of the network, restricting the expressive power hierarchies. Transportation Science, 46(3):388–404,
of provenance annotations, linking to other models, and 2012.
adapting state-of-the-art routing algorithms. [12] T. J. Green, G. Karvounarakis, and V. Tannen.
Our research has so far involved considering a new class Provenance semirings. In PODS, pages 31–40, 2007.
of semirings together with an e↵ective algorithm. We have [13] R. Gutman. Reach-based routing: A new approach to
observed similarities with semirings for which optimizations shortest path algorithms optimized for road networks.
of Datalog provenance computation have been proposed, In ALENEX, pages 100–111, 2004.
leading us to consider translation of our queries into Datalog [14] D. J. Lehmann. Algebraic structures for transitive
programs for further investigations. closure. TCS, 4(1):59–76, 1977.
We then have exposed intended directions for our PhD [15] S. Maniu, P. Senellart, and S. Jog. An experimental
work, pursuing our current research, and finally combining study of the treewidth of real-world graph data. In
them to obtain theoretical and practical results applicable ICDT, pages 12:1–12:18, 2019.
to real-world transportation networks.
[16] M. Mohri. Semiring frameworks and algorithms for
As the aim of our PhD is to o↵er a strong theoretical
shortest-distance problems. J. Autom. Lang. Comb.,
foundation for an eventual implementation of a provenance
7(3):321–350, 2002.
aware query optimizer/processor in a graph database system,
we conclude with a final note concerning how to adapt such an [17] D. Peleg. Proximity-preserving labeling schemes. J.
existing system to support provenance. One way to proceed is Graph Theory, 33(3):167–176, 2000.
to dynamically rewrite queries to make them process auxiliary [18] Y. Ramusat, S. Maniu, and P. Senellart. Semiring
data carrying provenance information; this idea has been provenance over graph databases. In TaPP, 2018.
successfully applied in the context of a relational database [19] P. Senellart, L. Jachiet, S. Maniu, and Y. Ramusat.
system [19]. The major drawback of this approach is that Provsql: Provenance and probability management in
most of the optimization strategies are no longer applicable, postgresql. In Proc. VLDB, pages 2034–2037, Rio de
thus leading to a non-negligible computation time overhead; Janeiro, Brazil, Aug. 2018. Demonstration.
this would be the bulk of our system-focused research. [20] M. Siggers. On the representation of finite distributive
lattices. arXiv [math], abs/1412.0011, 2014.
6. REFERENCES [21] R. Williams. Faster all-pairs shortest paths via circuit
complexity. CoRR, abs/1312.6680, 2013.
[1] S. K. Abdali. Parallel computations in *-semirings.
[22] R. Williams. Faster all-pairs shortest paths via circuit
Computational Algebra, pages 1–16, 1994.
complexity. In STOC, pages 664–673, 2014.
4