-

Towards Approximating Incomplete Queries over Partially Complete Databases (Extended Abstract)

Ognjen Savkovic

Evgeny Kharlamov

Werner Nutt

Pierre Senellart

0 0 Ecole Normale Superieure , France 1 Free University of Bozen-Bolzano , Italy 2 University of Oxford , United Kingdom

Motivation. Building reliable systems over partially complete data poses significant challenges because queries they send to the available data retrieve answers that may signi cantly di er from the real answers. This may lead to a wrong understanding of the data and the events and processes it describes. This problem is especially critical for analytical systems that aggregate retrieved data since missing answers may signi cantly change results of analytical computations, e.g., computation of minimal or average values is sensitive to missing values [2,7]. One way to ensure reliability of (analytical) systems over partially complete data is to guarantee that whatever data they touch is complete w.r.t. to the real data. A possible way to model partial data completeness is with tuple generating dependencies (TGDs) [1] that specify what parts of a relation are complete [8{ 10]: where Ra is the available (or complete) part of the ideal Ri (or real) part of R, t is a term of size ary(R), i is a conjunction of atoms over i-annotated predicates. Let be a nite set of completeness statements as in Equation (1). With Da (resp. D ) we denote a database instance of a-annotated (resp. i-annotated) i predicates. The semantics of is de ned using pairs (Di; Da) where Da Di, i.e., where the available (incomplete) data is a subset of the real data, as follows: (Di; Da) j= i Da [ Di j= . The setting can be extended by considering constraints on the ideal part of relations and modelled with TGDs: 9X i(X; Y )

Ra(t)

Ri(t); i; (1) 'i(Y ; Z); (2) where and ' are conjunctive queries over i-annotated predicates. If is a nite set of constraints as in Equation (2), then the semantics of can be extended to account for constraints , that is, (Di; Da) j= ( ; ), by considering only those Di's that satisfy .

Given and , a query Q is ( ; )-complete if for any pair (Di; Da) such that (Di; Da) j= ( ; ) we have Qi(Di) = Qa(Da), where Qi (resp. Qa) is obtained from Q by annotating each predicate with i (resp. a). While query completeness is a desirable property, in practice many queries may be incomplete. In these cases we would like to be able to approximate the original query with alternative queries that are as close as possible to the original one, but whose answers can be veri ed to be complete. A natural kind of approximations are those from below, called query specialisations , and above, called query generalisation. Formally, given queries Q and Q0 and a setting ( ; ), Q0 is a ( ; )-specialisation of Q if Q0i v Qi, that is, if Q0 is contained in Q over any ideal database that satis es the constraints . We are interested in complete specialisations, and among them in maximal ones. Formally, a query Q0 is a ( ; )-maximal complete specialisation (MCS) of a query Q, or just MCS when ( ; ) is clear, if (i) Q0i v Qi, (ii) Q0 is ( ; )-complete, and (iii) Q0 is maximal in the sense that there is no other ( ; )-complete Q00 such that Q0i @ Q00i v Qi. Generalisations and maximal complete generalisations can be de ned analogously.

The problem of completeness has been studied in [8{10] for settings with a weaker form of constraints. In particular, it is known that the complexity of checking query completeness ranges from NP for the setting without constraints to 2P for the setting with nite domains. Moreover, approximation has not been studied in the context of partially complete data. In this work we investigate the problem of completeness for conjunctive queries for expressive constraints and the problem of approximation. We now give an overview of our results. Query Completeness. We prove characterisations of query completeness in terms of the well-studied problem of query containment over TGDs. That is, Q is ( ; )-complete i [ j= (Qi Qa). Interestingly, also the converse holds: the containment under TGDs can be represented as completeness. Query containment under TGDs is known to be undecidable, thus checking completeness is undecidable. Since the undecidability comes from the constraints, we turn our attention to settings with practically motivated types of constraints: (cyclic) foreign keys, acyclic TGDs [ 6 ] sticky TGDs [ 5 ], guarded TGDs [ 4 ]. For all these constraints we show that the combined complexity of completeness is high, at least PSpace-complete.

Query Specialisation. Intuitively, one can specialise a conjunctive query by instantiating the query variables or by joining new atoms. One can do it by following the TGDs of and backwards, and thus instantiate and add atoms as little as needed.

More formally, one can nd specialisations by a procedure that is similar to the resolution proof-scheme [ 5 ] or backward chaining [ 3 ]. More precisely, for each query atom one has to nd in [ a TGD that can transfer the atom or an instantiation. In this way, the atom may need to be instantiated according to the TGD, but it may also mean that one needs to add new atoms from the body of the TGD. For newly introduced atoms now again one has to nd their TGD, etc. The di erence with the backward chase is that the query that is specialised is the database and the query at the same time. Thus, with each backward application of the TGD one may instantiate the atom but also change the database. This produces a (potentially in nite) set of (potentially in nite) queries that includes all MCSs but also non-maximal specialisations. However, it contain all MCSs. This is because some combination of rules may lead to more general specialisations than others. We also observe that a query may have more than one MCS both among in nite and nite conjunctive queries. Checking if a query Q0 is a ( ; )-specialisation of a query Q is undecidable for unrestricted and this corresponds to the case when the procedure above does not terminate. Weak acyclicity of inverted TGDs from (that is, where the direction of the arrow is reversed) yields termination and for such settings each conjunctive query has a nite number of nite size MCSs. It remains an open question if for sticky or guarded inverted we can have a terminating procedure.

We are still working on the problem of generalisation for incomplete queries. Acknowledgements This was was partially supported by the EPSRC projects MaSI3, DBOnto, and ED3, and by the projects MAGIC and PARCIS, funded by the Free University of Bozen-Bolzano.

1. Serge

Abiteboul

, Richard Hull, and

Victor

Vianu . Foundations of databases. Addison-Wesley , 1995 .

Marcelo

Arenas , Leopoldo Bertossi, Jan Chomicki, Xin He, Vijay Raghavan , and Jeremy Spinrad . Scalar aggregation in inconsistent databases . Theor. Comput. Sci. , 296 ( 3 ), 2003 .

3. Jean-Francois

Baget

, Michel Leclere, and Marie-Laure Mugnier . Walking the decidability line for rules with existential variables . KR'10 , pages 466 { 476 . AAAI Press, 2010 .

Andrea

Cal , Georg Gottlob, and

Thomas

Lukasiewicz . A general datalog-based framework for tractable query answering over ontologies . J. Web Sem ., 14 : 57 { 83 , 2012 .

Andrea

Cal , Georg Gottlob, and

Andreas

Pieris . Towards more expressive ontology languages: The query answering problem . Artif . Intell., 193 : 87 { 128 , 2012 .

Ronald

Fagin , Phokion G. Kolaitis, Renee J . Miller ,

and Lucian

Popa . Data exchange: Semantics and query answering . In Proc. ICDT , pages 207 { 224 , 2002 .

Paolo

Guagliardo and

Leonid

Libkin . Making sql queries correct on incomplete databases: A feasibility study . PODS '16 , 2016 .

8. A.Y. Levy. Obtaining complete answers from incomplete databases . In Proc. VLDB , pages 402 { 412 , 1996 .

Werner

Nutt , Sergey Paramonov, and

Ognjen

Savkovic . Implementing query completeness reasoning . In CIKM , pages 733 { 742 , 2015 .

10.

Simon

Razniewski and

Werner

Nutt . Completeness of queries over incomplete databases . PVLDB , 4 ( 11 ): 749 { 760 , 2011 .