Introduction

Semantics for querying paths in graph databases: No-repeated-node or no-repeated-edge?

Daniel Hernandez

Claudio Gutierrez

0 0 Computer Science Dept & CIWS, Universidad de Chile

Patterns for walks in graphs usually include regular expressions, that may result in loops, thus producing in nite solutions. To overcome this issue, practical query languages bound result sets to ensure niteness. On this paper we compare two bounding methods: edges are visited at most once; nodes are visited at most once. We show that despite these methods produce languages with di erent expressiveness, there is a duality, in the sense that the queries in one can be evaluated using the machinery used to evaluate the other.

Introduction

Most graph query languages support querying for walks (or paths) between nodes by using some fragment of regular path queries. Consider the expression ue v, where u; v are node and e is an edge; it can be considered a pattern that represents some type of query involving paths from u to v over edges with label e. The precise semantics for such query, though, has many possibilities. Some of them (see [ 1 ]) are the following: true if v is reachable from u through a path of edges with label e; the pairs of nodes (u; v) with the same previous condition; some of the walks from u to v ful lling the above condition; the graphs that are induced by such walks, etc. In this paper we study the problem of returning some walks.

An issue when returning walks is that there are possibly in nitely many when allowing loops. Thus, for practical concerns current implementations limit the number of walks returned by choosing the shortest ones; or choosing those that do not repeat nodes; or those that do not repeat edges; or choosing the k-top \best" walks.

In this paper we will study two strategies: not repeating nodes and not repeating edges. The rst is preferred in graph matching applications (see, e.g., [ 2 ]) where no nodes on the graph should be collapsed and was recently used in the context of property graph for pattern matching in social networks [3{5]. The second semantics is currently used by the Cypher query language [ 6 ]. Organization of the paper. In Sec. 2 we present the graph and query model. In Sec. 3 we compare the expressiveness of query languages induced by the two strategies and show how a machinery designed to evaluate queries in one language can be used for the other and viceversa. This is relevant to see when some languages can be used to simulate patterns of another.

Graph and query model

For our goals, it will be su cient to work over a simple graph data model without direction on edges, and without properties in nodes and edges.

We assume the existence of three in nite disjoint sets V, E, and X denoting respectively nodes, edges, and variables. A graph database G (a graph in that follows) is a set (V; E; ), where V V, E E, and : E ! [V ]2 is a function that maps each edge e 2 E to a subset fu; vg 2 [V ]2. Observe that according this model a graph database is a multigraph.

A join in a graph G, denoted as [aub], is an unordered pair of edges a; b with a common endpoint u. Think of it as the pair (u; (u)). Observe that a join is unordered (i.e., [aub] = [bua]).

A walk w in a graph G is an alternating sequence of nodes and edges v1e1 : : : en 1vn (i.e. beginning and ending in a node) such that (ej) = fvj; vj+1g. A connection c between two nodes u1; u2 in G is an alternating sequence of edges and nodes e1vn : : : vn 1en (i.e. beginning and ending with an edge) such that u1cu2 is a walk in G. A walk pattern P (a pattern in that follows) is a walk where some edges are replaced by variables. We will denote nodes and edges with lower case letters, and variables with upper case letters. For instance, uXveu is a pattern where u; v are nodes, e is an edge, and X is a variable.

A solution of the evaluation of a pattern P on a graph G is a mapping from the variables occurring in P to connections in G such that the result, denoted as (P ), is a walk obtained from replacing every variable X by h(X) in P . We denote the set of all solutions of P on G as JP KG. Formally, a solution in JP KG is an homomorphism from P to connections on G such that h(P ) is a walk in G. For example, consider the pattern P = uXv , and the graph G depicted by Fig. 1. Then, 1 : X 7! c, and 2 : X 7! awb are solutions in JP KG, because 1(P ) = uawbv and 2(P ) = ucv are a walks in G. 3

No-repeated-node versus no-repeated-edge semantics The semantics of patterns has problems when resulting sets may contain innitely many solutions. For instance, evaluating the pattern uXv over the graph G depicted in Fig. 1 produces in nitely many solutions because there are innitely many walks between u and v.

Most graph databases bound solution sets in some form [ 1 ]). We study two of the most common, based respectively in restricting solutions such that each node (respectively each edge) occurs at most once in (P ). We will denote them as JP jV KG (respectively JP jEKG), and we call them non-repeating-node and non-repeating-edge semantics [ 1 ].

It is easy to see that JP jV KG JP jEKG for every pattern P and a graph G. In fact, a walk w repeating and edge e with endpoints fu; vg has to repeat at least v or u. If w does not repeat u, then w contains veuev, thus w repeats v. The converse is not true: a counterexample is the walk ucvduev of the graph depicted in Fig. 1 that visits twice the nodes u and v without repeating edges. A corollary of this is that given an arbitrary pattern P , in general there does not exist a pattern P 0 such that JP jEKG = JP 0jV KG for every graph G. Furthermore, it is easy to see that this result is also true for the equality JP jV KG = JP 0jEKG. Hence, these two semantics induce languages with di erent expressiveness. Moreover, they are incomparable, that is, no one is more expressive than the other.

Despite both semantics induce incomparable languages, we will show that the result of one semantics can be obtained using the machinery that compute the results of the other.

We assume the existence of an injective function ( )d that maps every edge to a node, and every join to an edge. Also, given a graph G, we denote by G0 its dual graph: every node u0 in G0 correspond to an edge e in G such that u0 = ed, and every edge e0 in G0 correspond to a join [aub] in G such that e0 = [aub]d, and the endpoints of e0 in G0 are ad and bd. Then, we will write Gd to denote the result of adding two fresh nodes and ! to G0 and connecting them two every node in G0. Abusing notation, we will call Gd the dual of G. For example, the graph Gd depicted by Fig. 1 is the dual of G1.

The dual of a walk w = u1e1 : : : en 1un, denoted as wd, will be the walk [ u1e1]de1d : : : edn 1[en 1un!]d! in Gd.

Extending the notion of duality for patterns is more involving than for walks. Indeed, given the pattern uXv, then it is natural to assume that (uXv)d is the pattern [ uY1]dX[Y2v!]d!, where Y1 and Y2 are two variables that represent 1 For the sake of the readability, in Gd we denote ad and (awb)d simply as u and w, respectively, because the actual symbols can be inferred from the gure. edges holding that solutions of this dual pattern satisfy (X) = (Y1) : : : (Y2). This translation introduces the expressions [ ; u; Y1]d and [Y2; v; !]d whose semantics are not already formalized.

We will de ne the expression [ ; u; Y ]d as the set of all possible replacements of Y by an edge (i.e., f[ ue]d j e 2 Eg). Similarly, given a pattern P , then P d will represent the set, denoted as repr(P d), including every possible result of replacing each edge variable Y occurring in P d by an edge e 2 E.

Now we are ready to present our main result.

Theorem 1 (main). Let P be a pattern and G be a graph database. Then: JP jE KG = 2 JP 0 jV KGd : Proof sketch. Each walk w in G has a corresponding walk wd in Gd. Thus, for every solution 2 JP jE KG there is a walk w such that (w )d is in Gd. It su ces proving that there is a pattern P 0 in repr(P d) such that (w )d is a solution of evaluating JP 0 jV KG. Determining P 0 can be done by choosing the representation of P d that maps the edge variables to the values according to what maps around its respective sequence variables. For instance, let P be the pattern uXv and consider the graphs G and Gd depicted by Fig. 1. Then, a solution of JP jE KG is the mapping : X 7! awbvcue because (P ) = uawbvcuev is a walk in Gd. The pattern P d is [ uY1]dX[Y2; v; !]d!. A pattern P 0 2 repr(P d) matching the ends of (X) is [ ua]dX[ev!]d!. The mapping d is a solution of JP 0 jV K. Hence, P 0 is the pattern needed. tu

Computing SP 02repr(P d)JP 0 j V KG does not require to evaluate the in nitely many patterns in repr(P d). In fact, it su ces to check substitutions producing not trivially empty results. For instance, [ uY ]d is not trivial only for edges whose endpoints include u in G. As a consequence of this optimization, for data complexity, the time required for evaluating JP jE KG with the method of Theo. 1 is in the same complexity class than the one required to compute JP jE KG directly, and the space grows to at most O(jEj2)

Finally, let us say that the dual result (i.e. evaluate J jV KG using J jE KG) uses similar techniques and we do not show it here for space reasons.

Angles ,

Arenas ,

Barcelo ,

Hogan ,

Reutter ,

Vrgoc . Foundations of Modern Graph Query Languages. CoRR abs/1610.06264 , 2016 .

Horst

Bunke . Graph matching: Theoretical foundations, algorithms, and applications . In Proceedings of Vision Interface , 2000 .

Wenfei

Fan ,

Jianzhong

Li , Shuai Ma, Nan Tang, Yinghui Wu , and Yunpeng Wu . Graph Pattern Matching: From Intractable to Polynomial Time . PVLDB, 2010 .

Wenfei

Fan . Graph pattern matching revised for social network analysis . In 15th International Conference on Database Theory (ICDT) , 2012 .

Wenfei

Fan ,

Xin

Wang , and

Yinghui

Wu . Incremental graph pattern matching . ACM Trans. Database Systems , 2013 .

6. The Neo4j Team. The Neo4j Manual v3.0 . http://neo4j.com/docs/stable/. 2016 .