Introduction

On the Complexity of Enumerating the Answers to Well-Designed Pattern Trees (Ext. Abstract)?

Markus Kroll

Reinhard Pichler

Sebastian Skritek

0 0 Institute for Information Systems , TU Wien

With the steadily increasing amount of incomplete data on the web, the need for partial matching as an extension of Conjunctive Queries (CQs) is gaining more and more importance. Therefore, the OPTIONAL operator is a crucial feature in the semantic web query language SPARQL, and has also been studied for arbitrary relational vocabulary recently [3]. Intuitively, this operator (which corresponds to the left outer join in the Relational Algebra) allows the user to extend CQs by optional parts for which answers are retrieved from the data if available, but which do not cause the query to return no answer at all otherwise. In [9], based on a query fragment with desirable properties presented in [10], so-called well-designed pattern trees (wdPTs) were introduced as a convenient graphical representation of CQs extended by this optional matching feature. Intuitively, the nodes in a wdPT correspond to CQs while the tree structure represents the optional extensions. Several computational problems of wdPTs have been studied in recent years, such as the evaluation problem, the counting problem, as well as static analysis tasks including the containment and equivalence problems. In contrast, despite recent interest in the enumeration problem (i.e., computing one solution after the other without outputting duplicates) for FO queries and CQs [5, 2, 11], this natural problem has been hardly considered for wdPTs so far. A notable exception is [9]. Similar ideas were pursued in [4] for the enumeration of the answers to full disjunctions. However, being associative and commutative, full disjunctions di er signi cantly from wdPTs. Hence, the results in [4] do not apply to wdPTs. In this work, we embark on a systematic study of the complexity of the enumeration problem of wdPTs. We identify several tractable and intractable cases of this problem both from a classical complexity point of view and from a parameterized complexity point of view.

Introduction

A well-designed pattern tree (wdPT) p over a schema is a triple (T; ; x) s.t.: 1. T is a rooted tree;

maps each node in T to a set of relational atoms over . ? This is an extended abstract of a paper published at ICDT 2016 [ 8 ]. 2. For every variable y from T , the set of nodes where y occurs is connected. 3. The tuple x of distinct variables from T denotes the free variables of p. Assume p = (T; ; x) is a wdPT over . We write r to denote the root of T . For a subtree T 0 of T rooted in r, we de ne qT 0 to be the CQ Ans(y) R1(v1); : : : ; Rm(vm), where fR1(v1); : : : ; Rm(vm)g = St2T 0 (t), and y are all the variables that are mentioned in T 0. That is, all variables in qT 0 appear free.

The intuition behind the semantics of a wdPT p = (T; ; x) is as follows. A mapping h is an answer to (T; ) over a database D, if h is a solution to qT 0 and there is no way to \extend" h to a solution of some qT 00 , where T 0, T 00 are subtrees of T rooted in r. The evaluation of p over D corresponds then to the projection of the answers to (T; ) over D to x. Formally: Let D be a database and p = (T; ; x) a wdPT over schema . Let dom(D) be the set of elements in the active domain of D and X the variables mentioned in p. Then: { A homomorphism from p to D is a partial mapping h : X ! dom(D), for which it is the case that there is a subtree T 0 of T rooted in r such that h is a homomorphism from the CQ qT 0 to D. { The homomorphism h is maximal if there is no homomorphism h0 from p to D such that h0 extends h.

If h is a homomorphism from p to D, let hx denote the restriction of h to the variables in x. The evaluation of p over D, denoted p(D), corresponds to all mappings of the form hx, such that h is a maximal homomorphism from p to D. 3

Enumeration of wdPTs

Let C be a class of wdPTs. Given a wdPT p 2 C and a database D, we denote with Enum(C) the problem to enumerate p(D). Because of projection, it may happen that both, a mapping h and a proper extension of h are in p(D). However, in some settings only the maximal solutions are of interest, e.g. [ 1 ]. We thus also study the problem Max-Enum(C) of enumerating pm(D), the set of maximal solutions in p(D).

We aim at identifying the boundary between tractability and intractability for these problems. In [ 6 ], various notions of tractable enumeration are de ned. We concentrate on two such notions, namely polynomial delay (the time until the next output or termination is bounded by a polynomial in the size of the input) and output polynomial time (the total runtime of the algorithm is bounded by a polynomial in the size of the input plus the output). We denote with DelayP and OutputP the corresponding enumeration complexity classes.

The following decision problem ExtSol(C) (where C is a class of wdPTs) is closely related to the enumeration problems de ned above: Given a wdPT p = (T; ; x) 2 C, a database D, a partial mapping h and a set x0 x, does there exist some mapping h0 2 p(D) s.t. h h0 and dom(h0) \ x0 = ;? The relationship with enumeration is given by the following property. `-TW(k)\ BI(c) g-TW(k)\ SBI(c) `-TW(k)\ SBI(c) in P [ 3 ] in FPT W[ 1 ]-complete in P [ 3 ] in P [ 3 ] W[ 1 ]-hard

DelayP not OutputP1

DelayP DelayFPT

open not OutputP1

DelayFPT DelayFPT not OutputP2 not OutputP1 not OutputFPT2 not OutputFPT2 not OutputFPT2 DelayFPT Lemma 1. Let C be a class of wdPTs, p = (T; ; x) 2 C, and D a database. If there is a computable function f s.t. for every partial mapping and every subset x0 of x, the problem ExtSol(C) can be decided in O(f (jpj; jDj)), then there exists an algorithm enumerating p(D) with delay O(f (jpj; jDj) jDj jpj). Also ExtSol(C) is a generalization of the evaluation problems (Max-)Eval(C): Given p 2 C, a database D, and a mapping h, is h 2 p(D) (or pm(D))? Since these problems are known to be intractable in general, so is ExtSol(C). As a result, the evaluation problem for fragments of wdPTs has been studied in [ 3 ]. We use the same classes of wdPTs, and also introduce further restrictions to get a better understanding of the sources of complexity of enumeration.

Fragments of wdPTs have been de ned in two di erent ways: On the one hand by restrictions on some associated CQs, and on the other hand by restricting the number of variables shared between nodes. Let TW(k) be the class of CQs with bounded treewidth. A wdPT p = (T; ; x) is in the class `-TW(k) if for every node n in T the CQ Ans() (n) is in TW(k), and in the class g-TW(k) if for every subtree T 0 of T rooted in r the CQ qT 0 is in TW(k). Furthermore, we say p is in BI(c) if every node n in T contains at most c variables that also occur elsewhere in T , while p is in SBI(c) if no two nodes in T share more than c variables.

Table 1 summarizes our results. Observe that beside the classical complexity, we also study the parameterized complexity of the enumeration problems. I.e., p-Enum(C) and p-Max-Enum(C) denote the parameterized variants (with jpj as the parameter) of Enum(C) and Max-Enum(C), respectively. Also, DelayFPT and OutputFPT are de ned analogously to DelayP and OutputP. Since the parameterized complexity of wdPTs has not yet been studied, by p-Eval(C) and p-Max-Eval(C) we also consider the parameterized variant of the common evaluation problem for wdPTs, i.e., given p; D; h, is h 2 p(D) (resp. pm(D))?

We would like to point out that by introducing the class SBI(c) and performing a parameterized complexity analysis, we are able to draw a much more ne grained picture of the complexity of the decision problem. So far, only Eval(`-TW(k) \ BI(c)) and Eval(g-TW(k)) have been known to be tractable and NP-complete, respectively [ 3 ].

For the enumeration problems, we observe a surprising discrepancy between enumerating all answers and enumerating only the maximal ones. For the former setting, in most of the cases the corresponding decision problem is at least as hard as for the second setting (this is also the case for the non-parameterized problem [ 3 ]). A similar behaviour is observed for p-Enum(C) and p-Max-Enum(C). This, however, completely changes for Enum(C) and Max-Enum(C), where MaxEnum(C) is not even tractable in the most restricted case.

In the remainder, we would like to sketch some of the tools used to show the results in Table 1. Lemma 1 established a relationship between ExtSol(C) and enumeration. This relationship is put to use by the following result. Proposition 1. The following complexity results hold for ExtSol(C): 1. Let k; c 1 and C be a class of CQs for which the evaluation problem is in

P. Then ExtSol(`-C \ BI(c)) is in P. 2. ExtSol(g-TW(k)) is NP-complete for every k 1. 3. For k; c 1, ExtSol(g-TW(k) \ SBI(c)) parameterized by jpj is in FPT.

A useful tool to show negative results is Theorem 1. We call a class C robust if it is closed under some simple transformations (like replacing all occurrences of a variable by the same constant). All classes considered here are robust. Theorem 1. Let C be a robust class of wdPTs. If p-Enum(C) is in OutputFPT, then p-Eval(C) is in FPT.

Another way of providing intractability results for enumeration problems is by showing that, given a set of solutions, deciding if there exists another one is not possible in polynomial time (cf. [ 7 ]). We used this relationship to show the intractability of Max-Enum(C) (assuming P 6= NP) by proving the decision problem to be NP-complete for C = `-TW(k) \ BI(c). 4

Further Results and Future Work

In addition to the results presented above, following an active line of research [ 2, 11 ], we also had a brief look at the data complexity of the enumeration problems. In this case, the goal is to nd an enumeration algorithm that works with constant delay after some linear time preprocessing. We show that even for very simple settings such algorithms are very unlikely to exist, as for certain wdPTs without projection consisting of only two nodes, containing one respectively two binary atoms. In the presence of projection, this already holds for wdPTs consisting of two nodes with one atom each.

Besides closing the open case in Table 1, future work also includes the search for further tractable fragments. E.g., we do not know any interesting tractable classes for Max-Enum(C) yet.

Acknowledgments. This work was supported by the Vienna Science and Technology Fund (WWTF) through project ICT12-015 and by the Austrian Science Fund (FWF): P25207-N23. Markus Kroll was also supported by FWF project W1255-N23.

Ahmetaj ,

Fischl ,

Pichler ,

Simkus , and

Skritek . Towards reconciling SPARQL and certain answers . In Proc. WWW 2015 , pages 23 { 33 . ACM, 2015 .

Bagan ,

Durand , and

Grandjean . On acyclic conjunctive queries and constant delay enumeration . In Computer Science Logic , pages 208 { 222 . Springer, 2007 .

Barcelo ,

Pichler , and

Skritek . E cient evaluation and approximation of well-designed pattern trees . In Proc. PODS 2015 , pages 131 { 144 . ACM, 2015 .

Cohen ,

Fadida ,

Kanza ,

Kimelfeld , and

Sagiv . Full disjunctions: Polynomial-delay iterators in action . In Proc. VLDB 2006 , pages 739 { 750 . ACM, 2006 .

Durand ,

Schweikardt , and L. Segou n. Enumerating answers to FO queries over databases of low degree . In Proc. PODS 2014 , pages 121 { 131 . ACM, 2014 .

D. S.

Johnson ,

C. H.

Papadimitriou , and

Yannakakis . On generating all maximal independent sets . Inf . Process. Lett., 27 ( 3 ): 119 { 123 , 1988 .

Kimelfeld and P. G. Kolaitis. The complexity of mining maximal frequent subgraphs . ACM Trans. Database Syst ., 39 ( 4 ): 32 :1{ 32 : 33 , 2014 .

8. M. Kroll, R. Pichler, and

Skritek . On the complexity of enumerating the answers to well-designed pattern trees . In Proc. ICDT 2016 , volume 48 of LIPIcs , pages 22 :1{ 22 : 18 . Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016 .

Letelier ,

Perez ,

Pichler , and

Skritek . Static analysis and optimization of semantic web queries . ACM Trans. Database Syst ., 38 ( 4 ): 25 , 2013 .

10. J. Perez , M.

Arenas , and C.

Gutierrez . Semantics and complexity of SPARQL . ACM Trans. Database Syst ., 34 ( 3 ), 2009 .

11. L. Segou n. Enumerating with constant delay the answers to a query . In Proc. ICDT 2013 , pages 10 { 20 . ACM, 2013 .