Reconstructing Graph Pattern Matches Using SPARQL Stephan Mennicke1( ) , Denis Nagel2 , Jan-Christoph Kalo2 , Niklas Aumann2 , and Wolf-Tilo Balke2 1 Institut für Programmierung und Reaktive Systeme, TU Braunschweig, Germany mennicke@ips.cs.tu-bs.de 2 Institut für Informationssysteme, TU Braunschweig, Germany {denis.nagel,n.aumann}@tu-bs.de,{kalo,balke}@ifis.cs.tu-bs.de Abstract. Pattern matching is the foundation for handling complex queries to graph databases. Commonly used algorithms stem from the realm of graph isomorphism and simulations, being well understood the- oretical frameworks. On the practical side, there are established graph query languages that often allow for a wide variety of query tasks, often even beyond pattern matching. However, very little is known how graph queries from common query languages relate to graph pattern matching relations. In this paper, we propose a study in this respect for SPARQL, the W3C recommendation for querying RDF data. The homomorphic na- ture of the SPARQL semantics allows for a straight-forward formulation of graph-isomorphic matching. However, the somewhat artificial nature of these queries motivates the study of sole basic graph patterns, the foundational concept of SPARQL. For basic graph patterns, we show a correspondence to strong simulation, an efficient graph pattern matching relation appreciated for its polynomial bound matches. In consequence, graph query languages are capable of serving as generating frameworks for established graph pattern matching relations. 1 Introduction Graph databases have gained lots of attention due to their popularity in emerging applications like the Semantic Web, social network analysis, or bio-technology. These graphs usually provide entity-centric data, in which nodes represent en- tities, while the edges model relations between entities. Several graph database query languages were developed, enabling users to query graph-structured data in an SQL-like fashion. Most notably, SPARQL is the W3C standard for query- ing Semantic Web data, and is also used for a wide range of applications [13]. On the foundational side of graph querying, graph pattern matching in terms of special homomorphisms forms the main influence. However, to the best of our knowledge, only little is known on the relationship between commonly used graph query languages — SPARQL in this paper — and other graph pattern matching relations that have been researched for decades in conceptual frame- works. Research in this area led to several complexity results, the development of Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB. Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org v1 v2 S.Spielberg M.Crichton Q.Tarantino directed wrote directed wrote directed wrote v3 Jurassic Park Pulp Fiction awarded awarded awarded v4 Oscar Golden Globe (a) (b) (c) Fig. 1: (a) An example graph pattern P and (b) an isomorphic match of P and (c) a (dual-)simulating match of P . fast algorithms, and insights on the semantics of the respective relations, whose potential is, in our view, not yet fully exploited in the area of graph database querying. Lately, some effort were expended to use graph pattern matching al- gorithms for matching relations different from classical subgraph isomorphism, often appreciated due to their advantages in performance over traditional graph querying languages [9, 11]. In this paper, we study the relationship between graph pattern relations and graph query languages, here exemplified by SPARQL. Our first result is best ex- plained by an example. Imagine a user writing a query to search for the directors and writers of movies that won an award, see Fig. 1(a). An isomorphic match to this query is, for instance, the movie Jurassic Park which won an Academy Award, was written by M. Crichton, and directed by S. Spielberg (cf. Fig. 1(b)). The same result is achieved by the SPARQL query, depicted in Fig. 2. The first part, also called basic graph pattern, comprises variables for the graph pattern nodes and is arranged as triples representing the edges of the graph pattern. The filter condition at the end of the pattern ensures that each assignment to the variables is a bijective one, a necessary condition for graph-isomorphic match- ing. Being able to formulate such a query is not a coincidence. We prove, in Sect. 4, for every graph pattern there is a query returning every subgraph from a database that is isomorphic to the given pattern. A closer look at the filter condition raises the question, whether a user would be using such an artificial formulation. Removing the filter allows for (possibly unintended) variable as- signments, and may produce answers as depicted in Fig. 1(c). Solely relying on basic graph patterns yields answers showing a (dual-)simulating character upon the original graph pattern, being the result of Sect. 5. We provide partial and complete characterizations of graph pattern matching by SPARQL, based on basic graph patterns. We observe that dual simulation cannot be fully characterized, since the matching relation allows for arbitrary additions to matches, being also matches of the pattern. Strong simulation, an extension of dual simulation, removes this arbitrariness and is a matching re- lation renown for its efficient evaluation and its polynomial bound number of matches [9]. We find that SPARQL results may be used as building blocks to obtain all strong simulating matches. In return, strong simulation may also serve as a pruning method for SPARQL query engines. Sect. 3 provides basic notions. SELECT ∗ WHERE { ?vv1 d i r e c t e d ?vv3 . ?vv2 wrote ?vv3 . ?vv3 awarded ?vv4 . FILTER ( ?vv1 != ?vv2 && ?vv1 != ?vv3 && ?vv1 != ?vv4 && ?vv2 != ?vv3 && ?vv2 != ?vv4 && ?vv3 != ?vv4 ) } Fig. 2: The query for graph isomorphism of the graph pattern Fig. 1(a) Related work and conclusions are given in Sect. 2 and Sect. 6. Due to space lim- itations, full proofs of the theorems are included in the appendix of this paper. 2 Related Work Graph Pattern Matching is an extensively studied topic in various domains of computer science [5]. Its applications range from social network analysis, over structural analysis of chemical entities to various applications in the database do- main, particularly in graph databases. Recently, emerging applications led to the trend of graph pattern matching relations, different from the canonical though costly candidate of graph isomorphism, with the goal of reducing structural re- quirements of the answer graphs. For example, the idea of simulation for graph pattern matching has been implemented for different graph database tasks [1, 3, 2]. Indeed, experiments have shown advantages of simulation-based matching relations when analyzing social network patterns, as they offer the possibility to collapse several nodes into one node and vice versa. Another recent case study in this respect are so called Exemplar Queries [11], representing an attempt to enable an easy access to databases without the need of knowing the formal re- quirements of a query language. Based on an example graph pattern from the database (the exemplar), the query process of Exemplar Queries checks for sim- ilarity, e. g., up to strong simulation, between the exemplar and other database structures, retrieving and ranking them for presentation to the user. Graph Query Languages basically all are founded on the idea of graph pat- tern matching (with suitable substitutions) [14]. A matching mechanism com- mon to most of these query languages is (sub-)graph isomorphism [4, 8]. Mainly due to the advances in the field of Semantic Web, SPARQL has become the W3C recommendation for querying Semantic Web data, i. e., RDF. A more de- tailed introduction to SPARQL follows in the next section. In general, graph query languages differ greatly with respect to their area of operation. Therefore, many different graph database operations, e. g., subgraph matching or adding new nodes, are considered, particularly when comparing the expressive power of different graph query languages [14]. Most languages solely rely on homomor- phic, more specifically, isomorphic pattern matching, but their connection to other matching relations is not yet studied extensively. Therefore, here we give insights into this aspect of graph query languages with respect to SPARQL. The basic idea however, could also be applied to other graph querying languages, also relying on the idea of basic graph patterns (e. g., Cypher or Gremlin). Regard- ing expressiveness of graph querying languages, in [7], the authors describe the graph query language GraphQL. It is based on a modified relational algebra that uses graph pattern matching for querying. They prove that their graph query algebra is relationally complete and therefore as expressive as relational algebra. While we focus on graph isomorphism and simulations, the key question behind this work is not restricted to those relations. Many more comparison relations are discussed in the literature which may correlate very well with the se- mantics of graph query languages. For instance, the linear-time branching-time spectrum [6] provides several matching relations, under the term comparative semantics, used w. r. t. different aspects of system correctness. It contains com- parative system relations for processes modeled as labeled transition systems, i. e., edge-labeled directed graphs with a distinct initial state. Most of the seman- tics come with a logical characterization in terms of a modal logic equipped with explicit quantification over edges to be traversed. Finding such a characteriza- tion is a common task, as it allows for expressing distinguishing characteristics of system behaviors in a precise manner. Similar to our subject, these character- izations express that two systems are equivalent whenever they satisfy the same logical formulas. In this paper, we try to adjust to the circumstances as imposed by SPARQL semantics. Variants of graph homomorphism are also studied in the context of graph databases [4]. While the relations in our work always match an edge of a graph to other edges, as of preserving structure of a graph pattern to a certain extent, p-homomorphism takes each edge and maps it to paths in the match graph. This way, a p-homomorphic match graph may show a very different structure, being only loosely coupled with the graph pattern. Instead, p-homomorphic matching relies on a metric of node similarity. 3 Preliminaries In this section, we define graphs, graph databases complemented by the general concept of graph pattern matching. Furthermore, we introduce an algebra of SPARQL. A (Σ-)labeled directed graph is a triple G = (V, Σ, E), where V is a finite set of nodes, Σ a finite alphabet, and E ⊆ V × Σ × V a labeled edge relation. a We represent an edge (v, a, v 0 ) ∈ E by v −→ v 0 . Labeled directed graphs range over by G, G1 , G2 , P with node sets V, V1 , V2 , VP , a fixed alphabet Σ common to all graphs, and edge relations E, E1 , E2 , EP with respective notations −→, −→1 , −→2 , −→P . A graph G1 is a subgraph of a graph G2 , denoted G1 v G2 , iff V1 ⊆ V2 and E1 ⊆ E2 ∩ (V1 × Σ × V1 ). Two graphs G1 and G2 are isomorphic, a written G1 ∼ = G2 , iff there is a bijective function κ : V1 → V2 such that v −→1 v 0 a 0 if and only if κ(v) −→2 κ(v ). κ is called an isomorphism between G1 and G2 . For two nodes v, v 0 of a graph G, we define the distance dist G (v, v 0 ) to be the length of the shortest undirected path from v to v 0 . If there is no path between v and v 0 , dist G (v, v 0 ) = ∞. However, we are interested in connected graphs throughout the rest of the paper, i. e., for every two nodes v, v 0 , dist G (v, v 0 ) 6= ∞. The diameter of a graph G with node set V , denoted dia(G), is the greatest distance between nodes in this graph, i. e., dia(G) := max{dist G (v, v 0 ) | v, v 0 ∈ V }. Graph databases store objects from a countable universe O together with attributes over the objects as either relations between objects or properties of objects. As an example, consider an isFriendOf relation between objects re- ferring to persons who are friends in social networks. Properties of objects are expressed as assignments of concrete data values, also called literals, from a usually infinite domain L, to an object, e. g., the age of a person as a positive integer. Inspired by the treatment of literals in RDF, objects as well as literals are represented as nodes in a graph database. Relation symbols and property symbols stem from a finite set, here Σ. A graph database is a directed labeled graph DB = (V, Σ, E) with a finite set of database objects V ⊆ O ∪ L. A graph pattern is a connected graph P = (VP , Σ, EP ). A subgraph G of a ∼ graph database DB is an isomorphic match of P (in DB) iff P ∼ = = G. By JP KDB we denote the set of all isomorphic matches of P . Since SPARQL aims at querying RDF-stored data, the basic building blocks of the query language are triples of the form (s, p, o). Subjects (s) refer to objects in O or variables being assigned by actual database objects during the querying process. Objects (o) may further be associated with literals from L. Predicates (p) are thought of as the relation and property symbols in Σ gluing together subjects with objects. Variables are place-holders for actual objects or literals as present in concrete databases. The result of a SPARQL query process is an assignment of objects and literals to the variables mentioned in a query expression. We denote the set of all variables by V. Notation and semantics are based on [12]. Sets of such (s, p, o)-triples are called basic graph patterns (BGP), which we will assume to be graphs. For every BGP B = {(s1 , p1 , o1 ), . . . , (sk , pk , ok )} (k ≥ 0) where si ∈ O ∪ V, pi ∈ Σ, and oi ∈ O ∪ L ∪ V (i = 1, . . . , k), the associated graph is (VB , Σ, B) such that VB = {s1 , . . . , sk , o1 , . . . , ok }. Notice that the nodes in the graph may also be variables. In fact, from the next section on, we employ BGPs in which all nodes are variables. In this paper, BGP B and its graph representation are used interchangeably. By vars(B) we denote the set of all variables occurring in B. The semantics of SPARQL BGPs B, and henceforth of SPARQL queries Q, is given in terms of assignments of objects and literals to variables in B (Q, respectively). An assignment is a partial function µ : V → (O ∪ L). By µ(B) we reference the graph where each variable node v ∈ vars(B) is replaced by µ(v). We define dom(µ) := {v ∈ V | µ(v) is defined}. An assignment µ is valid w. r. t. B and a graph database DB iff (a) dom(µ) = vars(B) and (b) µ(B) is a subgraph of DB. Thus, µ is a graph homomorphism. The set of all valid assignments w. r. t. B and a graph database DB forms the foundation of the SPARQL query semantics. We denote this set by JBKDB . The second concept of SPARQL we use is that of filter conditions, also called built-in conditions. Filters are used to further restrict the set of (valid) assign- ments of a SPARQL query. Thereby, we may check for equality (=) or inequality (<, ≤, ≥, >) of variable assignments, objects, and literals. The usual proposi- tional connectives (∧, ∨, ¬) are used to build complex constraints. For a full list of features we refer to the W3C recommendation report [13]. We de- note by µ |= ϕ that assignment µ satisfies filter condition ϕ. Let Q be any SPARQL query, e. g., Q = B for a BGP B, and ϕ a filter condition. Then Q filter ϕ is a SPARQL query. The semantics is given recursively in terms of the assignments from JQKDB such that every assignment conforms to ϕ. Thus, JQ filter ϕKDB := {µ ∈ JQKDB | µ |= ϕ}. Throughout the paper, we make use of BGPs adjoint with filter condi- tions. The last SPARQL concept we need throughout Sect. 5 is that of a join of two queries. Q1 and Q2 represents the join of Q1 and Q2 . Given two assignments µi ∈ JQi KDB (i = 1, 2). Then they are compatible if for every v ∈ dom(µ1 ) ∩ dom(µ2 ), µ1 (v) = µ2 (v). Compatible assignments may be joined, thus, JQ1 and Q2 KDB := {µ1 ∪µ2 | µi ∈ JQi KDB are compatible }. The remaining operations of union and optional queries are not needed in this paper. Next, we show that for every graph pattern P , there is a SPARQL query QP that uses a BGP and a specific filter condition to obtain all graph-isomorphic matches of P from a database DB. Please note that we assume graphs to be a loop-free, throughout the paper, i. e., for each edge v1 −→ v2 , v1 6= v2 . 4 Querying like Graph Isomorphism A graph pattern P gives rise to a canonical BGP. In order to characterize iso- morphic matches of P from some database DB, we need to adapt the nodes of P to obtain the possibility of arbitrary assignments of database objects/literals to the nodes of P . In SPARQL terms, this adaptation is performed by exchanging each node by a variable. Fig. 2 gives an example conversion in the where-clause, excluding the filter condition. Definition 1. Let P = (VP , Σ, EP ) be a graph pattern. Define ν : VP → V such that ν(v) := vv . The BGP of P is defined as the graph P̂ := (V̂P , Σ, ÊP ) such that V̂P = {ν(v) | v ∈ VP } and (ν(v), a, ν(v 0 )) ∈ ÊP iff (v, a, v 0 ) ∈ EP . From a graph-theoretic perspective, it directly follows that each graph pattern P is isomorphic to its BGP P̂ by isomorphism ν. Every assignment µ ∈ JP̂ KDB is a homomorphism, but it is not guaranteed that each graph µ(P̂ ) is isomorphic to P . We enforce bijectivity of µ by adding a filter condition checking that for each two distinct nodes v, v 0 ∈ V̂P , µ assigns different objects from the database. Definition 2. Let P be a graph pattern with set of nodes VPV= {v1 , v2 , . . . , vn }. Define a filter condition for P alongside ν (Def. 1) as ϕP = i