1. Introduction

Querying Data Exchange Settings Beyond Positive Queries

Marco Calautti

Sergio Greco

Cristian Molinaro

Irina Trubitsyna

0 0 Department of Computer Science , Modeling , Electronics and Systems Engineering, University of Calabria , Italy 1 Department of Information Engineering and Computer Science, University of Trento , Italy

27 41

Data exchange, the problem of transferring data from a source schema to a target schema, has been studied for several years. The semantics of answering positive queries over the target schema has been defined in early works, but little attention has been paid to more general queries. A few semantics proposals for more general queries exist but they either do not properly extend the standard semantics under positive queries, giving rise to counterintuitive answers, or they make query answering undecidable even for the most important data exchange settings, e.g., with weakly-acyclic dependencies. The goal of this paper is to provide a new semantics for data exchange that is able to deal with general queries. At the same time, we want our semantics to coincide with the classical one when focusing on positive queries, and to not trade-of too much in terms of complexity of query answering. We show that query answering is undecidable in general under the new semantics, but it is coNP-complete when the dependencies are weakly-acyclic. Moreover, in the latter case, we show that our semantics allow for the construction of a representative target instance, similar in spirit to a universal solution, that can be exploited for computing approximate answers, instead of exact ones.

eol>Data Exchange Semantics Closed Word Assumption Approximations

1. Introduction

Data exchange is the problem of transferring data from a source schema to a target schema, where the transfer process is usually described via so-called schema mappings: a set of logical assertions specifying how the data should be moved and restructured. Furthermore, the target schema may have its own constraints to be satisfied. Schema mappings and target constraints are usually encoded via standard database dependencies: tuple-generating dependencies (TGDs) and equality-generating dependencies (EGDs). Thus, given an instance over the source schema S, the goal is to materialize an instance over the target schema T, called solution, in such a way that and together satisfy the dependencies.

Since multiple solutions might exist, a precise semantics for answering queries is needed. By now, the certain answers semantics is the most accepted one. The certain answers to a query is the set of all tuples that are answers to the query in every solution of the data exchange setting [ 1 ]. Although it has been formally shown that for positive queries (e.g., conjunctive queries) the notion of solution of [ 1 ] is the right one to use, for more general queries such solutions become inappropriate, as they easily lead to counterintuitive results. Example 1. Consider a data exchange setting denoted by = ⟨S, T, Σ , Σ ⟩, where S is the source schema, storing orders about products in a binary relation Ord, where the first argument is the id of the order, and the second one specifies whether the order has been paid. Moreover, T is the target schema having unary relations AllOrd and Paid, storing all orders and paid orders, respectively. The schema mapping is described by the source-to-target TGDs Σ : 1 = ∀, Ord(, ) → AllOrd(), 2 = ∀ Ord(, yes) → Paid().

In this example, we assume that the set of target dependencies Σ is empty. The above schema mapping states that all orders in the source schema must be copied to the AllOrd relation, and all the paid orders must be copied to the Paid relation. Assume the source instance is as follows: = {Ord(1, yes), Ord(2, no)}, and assume we want to pose the query over the target schema asking for all the unpaid orders. This can be written as the following FO query:

() = AllOrd() ∧ ¬Paid().

One would expect the answer to be {2}, since the schema mapping above is simply copying to the target schema, and hence = {AllOrd(1), AllOrd(2), Paid(1)} should be the only candidate solution. However, under the classical notion of solution of [ 1 ], also the instance ′ = {AllOrd(1), AllOrd(2), Paid(1), Paid(2)} is a solution (since ∪ ′ satisfies the TGDs), and every order in ′ is paid. Hence, the certain answers to , which are computed as the intersection of the answers over all solutions, are empty. □

The issue above arises because the classical notion of solution is too permissive, in that it allows the existence of facts in a solution that have no support from the source (e.g., Paid(2) in the solution ′ of Example 1 above).

Some eforts exist in the literature that provide alternative notions of solutions for which certain answers to general queries become more meaningful. Prime examples are the works of [ 2 ] and [ 3 ]. In both approaches, the certain answers in the example above are {2}. However, also the works above have their own drawbacks. In [ 2 ], so-called CWA-solutions are introduced, which are a subset of the classical solutions with some restrictions. However, these restrictions are so severe that certain answers over such solutions fail to capture certain answers over classical solutions, when focusing on positive queries. Moreover, even when focusing on more general queries, answers can still be counterintuitive, as shown in the following example. Example 2. Consider the data exchange setting = ⟨S, T, Σ , Σ ⟩, where S stores employees of a company in the unary relation Emp. For some employees, the city they live in is known, and it is stored in the binary relation KnownC. The target schema T contains the binary relation EmpC, storing employees and the cities they live in, and the binary relation SameC, storing pairs of employees living in the same city. The sets Σ = { 1, 2} and Σ = { 3, } are as follows (for simplicity, we omit the universal quantifiers): 1 = 2 =

Emp() → ∃ EmpC(, ), 3 KnownC(, ) → EmpC(, ), = EmpC(, ), EmpC(′, ) → SameC(, ′), = EmpC(, ), EmpC(, ) → = .

The above setting copies employees from the source to the target. The TGD 1 states that every copied employee must have some city associated, whereas 2 states that when the city of an employee is known, this should be copied as well. Moreover, the target schema requires that employees living in the same city should be stored in relation SameC ( 3), and each employee must live in only one city ( ). Assume the source instance is

= {Emp(john), Emp(mary), KnownC(john, miami)}, and assume our query asks for all pairs of employees living in diferent cities. This can be written as:

(, ′) = ∃∃′ EmpC(, ) ∧ EmpC(′, ′) ∧ ¬SameC(, ′).

One would expect that the set of certain answers to is empty, since it is not certain that john and mary live in diferent cities. However, no CWA-solution admits mary and john to live in the same city, and thus (john, mary) is a certain answer under the CWA-solution-based semantics. □

The approach of [ 3 ], where the notion of GCWA* -solution is presented, seems to be the most promising one. For positive queries, certain answers w.r.t. GCWA* -solutions coincide with certain answers w.r.t. classical solutions. Moreover, GCWA* -solutions solve some other limitations of CWA-solutions, like the one discussed in Example 2. However, the practical applicability of this semantics is somehow limited, since the (rather involved) construction of GCWA* -solutions easily makes certain query answering undecidable, even for very simple settings with only two source-to-target TGDs, and no target dependencies.

Other semantics have been proposed in [ 4 ], but they are only defined for data exchange settings without target dependencies. Hence, one needs to assume that the target schema has no dependencies at all.

As a final remark, in a data exchange setting, it is usually assumed that the source is not always available, and thus the materialization of a single solution, over which certain answers can be computed, is a crucial requirement. This is especially true when using weakly-acyclic dependencies, which form the standard language for data exchange [ 1 ]. However, none of the semantics above allow for the materialization of such a special solution, for weakly-acyclic settings.

In this paper, we propose a new notion of data exchange solution, dubbed supported solution, which allows us to deal with general queries, but at the same time is suitable for practical applications. That is, we show that certain answers under supported solutions naturally generalize certain answers under classical solutions, when focusing on positive queries. Moreover, such solutions do not make any assumption on how values associated to existential variables compare to other values, hence solving issues like the ones of Example 2.

As expected, there is a price to pay to get meaningful answers over general queries: we show that certain answering is undecidable for general settings, but becomes coNP-complete when we focus on weakly-acyclic dependencies.

Morever, we also show that for weakly-acyclic settings, we can construct a target instance in polynomial time, which is similar in spirit to a universal solution of [ 1 ], that can be exploited for computing exact answers, for positive queries, and approximate answers, for general FO queries, in polynomial time. The latter is achieved by adapting existing approximation algorithms originally defined for querying incomplete databases.

2. Preliminaries

Basics. We consider pairwise disjoint countably infinite sets Const, Var, Null of constants, variables, and labeled nulls. Nulls are denoted by the symbol ⊥, possibly subscripted. A term is a constant, a variable, or a null. We additionally assume the existence of countably infinite set Rel of relations, disjoint from the previous ones. A relation has an arity, denoted (), which is a non-negative integer. We also use / to say that is a relation of arity . A schema is a set of relations. A position is an expression of the form [], where is a relation and ∈ {1, . . . , ()}.

An atom (over a schema S) is of the form (t), where is an -ary relation (of S) and t is a tuple of terms of length . We use t[] to denote the -th term in t, for ∈ {1, . . . , }. An atom without variables is a fact. An instance (over a schema S) is a finite set of facts (over S). A database is an instance without nulls. For a set of atoms , dom() is the set of all terms in , whereas var() is the set dom() ∩ Var. A homomorphism from a set of atoms to a set of atoms is a function ℎ : dom() → dom() that is the identity on Const, and such that for each atom (t) = (1, . . . , ) ∈ , (ℎ(t)) = (ℎ(1), . . . , ℎ()) ∈ . Dependencies. A tuple-generating dependency (TGD) (over a schema S) is a first-order formula of the form ∀x, y (x, y) → ∃z (y, z), where x, y, z are disjoint tuples of variables, and and are conjunctions of atoms (over S) without nulls, and over the variables in x, y and y, z respectively. The body of , denoted body( ), is (x, y), whereas the head of , denoted head( ), is (y, z). We use exvar( ) to denote the tuple z and fr( ) to denote the tuple y, also called the frontier of . An equality-generating dependency (EGD) (over a schema S) is a first-order formula of the form ∀x (x) → = , where x is a tuple of variables, a conjunction of atoms (over S) without nulls, and over x, and , ∈ x. The body of , denoted body( ), is (x), and the head of , denoted head( ), is the equality = . For clarity, we will omit the universal quantifiers in front of dependencies and replace the conjunction symbol ∧ with a comma. Moreover, with a slight abuse of notation, we sometimes treat a conjunction of atoms as the set of its atoms. Consider an instance . We say that satisfies a TGD if for every homomorphism ℎ from body( ) to , there is an extension ℎ′ of ℎ such that ℎ′ is a homomorphism from head( ) to . We say that satisfies an EGD = (x) → = , if for every homomorphism ℎ from body( ) to , ℎ() = ℎ(). satisfies a set of TGDs and EGDs Σ if satisfies every TGD and EGD in Σ .

Queries. A query (x), with free variables x, is a first-order (FO) formula (x) with free variables x. The arity of (x), denoted (), is the number |x|. The output of (x) over an instance , denoted (), is the set {t ∈ dom()|x| | |= (t)}, where |= is FO entailment.1 A 1We assume active domain semantics, i.e., quantifiers range over the terms in the given instance. query is Boolean if it has arity 0, in which case its output over an instance is either the empty set or the empty tuple ⟨⟩. A conjunctive query (CQ) is a query of the form (x) = ∃y (x, y), where (x, y) is a conjunction of atoms over x and y. A union of conjunctive queries (UCQ) is a query of the form (x) = ⋁︀

=1 (x), where each (x) is a CQ. We also refer to UCQs as positive queries.

Data Exchange Settings. A data exchange setting (or simply setting) is a tuple of the form = ⟨S, T, Σ , Σ ⟩, where S, T are disjoint schemas, called source and target schema, respectively; Σ is a finite set of TGDs, called the source-to-target TGDs of , such that for each TGD ∈ Σ , body( ) is over S and head( ) is over T; Σ is a finite set of TGDs and EGDs over T, called the target dependencies of . We say is TGD-only if Σ contains only TGDs.

A source (resp., target) instance of is an instance over S (resp., T). We assume that source instances are databases, i.e., they do not contain nulls. Given a source instance of , a solution of w.r.t. is a target instance of such that ∪ satisfies Σ and satisfies Σ [ 1 ]. We use sol(, ) to denote the set of all solutions of w.r.t. .

Given a data exchange setting = ⟨S, T, Σ , Σ ⟩, a source instance of and a query over T, the certain answers to over w.r.t. is the set cert (, ) = ⋂︀∈sol(,) ( ).

To distinguish between the notion of solution (resp., certain answers) above and the one defined in Section 3, we will refer to the former as classical.

A universal solution of w.r.t. is a solution ∈ sol(, ) such that, for every ′ ∈ sol(, ), there is a homomorphism from to ′ [ 1 ]. Letting ( )↓ = ( ) ∩ Const|x|, for any instance and query (x), the following is well-known: Theorem 1 ([ 1 ]). Consider a data exchange setting , a source instance of and a positive query . If is a universal solution of w.r.t. , then cert (, ) = ( )↓.

3. Semantics for General Queries

The goal of this section is to introduce a new notion of solution for data exchange that we call supported. As already discussed, the main issue we want to solve w.r.t. classical solutions is that such solutions are too permissive, i.e., they allow for the presence of facts that are not a certain consequence of the source instance and the dependencies. Consider again Example 1. The (classical) solution ′ in Example 1 is not supported, since from the source instance and the dependencies, we cannot conclude that the fact Paid(2) should occur in the target. On the other hand, the solution = {AllOrd(1), AllOrd(2), Paid(1)} is supported: it contains precisely the facts supported by and the dependencies, and no more than that. Similarly, considering Example 2, the instance = {EmpC(john, miami), EmpC(mary, chicago), SameC(john, mary)} is a solution, but it is not supported, since from the source and the dependencies we cannot certainly conclude that john and mary live in the same city. We now formalize the above intuitions.

Consider a TGD and a mapping ℎ from the variables of to Const. We say that a TGD ′ is a ground version of (via ℎ) if ′ = ℎ(body( )) → ℎ(head( )).

Definition 1 (ex-choice). An ex-choice is a function , that given as input a TGD = (x, y) → ∃z (y, z) and a tuple t ∈ Const|y|, returns a set (, t) of pairs of the form (, ), one for each existential variable ∈ exvar( ), where is a constant of Const. Note that if does not contain existential variables, (, t) is the empty set.

Intuitively, given a TGD, an ex-choice specifies a valuation for the existential variables of the TGD which depends on a given valuation of its frontier variables.

We now define when a ground version of a TGD indeed assigns existential variables according to an ex-choice.

Definition 2 (Coherence). Consider a TGD = (x, y) → ∃z (y, z), an ex-choice and a ground version ′ of via some mapping ℎ. We say that ′ is coherent with if for each existential variable ∈ exvar( ), (, ℎ()) ∈ (, ℎ (y)).

For a set Σ of TGDs and EGDs, and an ex-choice , Σ denotes the set of dependencies obtained from Σ , where each TGD in Σ is replaced with all ground versions of that are coherent with . Note that the set Σ can be infinite. We are now ready to present our notion tuples scert (, ) = ⋂︀

∈ssol(,) ( ).

Definition 4 (Supported Certain Answers). Consider a data exchange setting , a source instance of and a query over T. The supported certain answers to over w.r.t. is the set of supported solutions of w.r.t. . such that ∪ satisfies Σ and satisfies Σ such that ∪ ′ satisfies Σ and ′ satisfies Σ

Definition 3 (Supported Solution). Consider a setting = ⟨S, T, Σ , Σ ⟩ and a source instance of . A target instance of is a supported solution of w.r.t. if there exists an ex-choice , and there is no other target instance ′ ⊊ of

Note that a supported solution contains no nulls. We use ssol(, ) to denote the set of all {(, chicago)}. Then, Σ is Example 3. Consider the data exchange setting and the source instance of Example 2. The target instance = {EmpC(john, miami), EmpC(mary, chicago)} is a supported solution of w.r.t. . Indeed, consider the ex-choice such that ( 1, john) = {(, miami)}, and ( 1, mary) = {Emp( ) → EmpC(, {KnownC(, ) → EmpC(, ) | ,

∈ Const}∪ ) | ∈ Const ∧ (, ) ∈ ( 1, )}, whereas Σ is the set containing the EGD of Example 2, and the set of TGDs {EmpC(, ), EmpC( ′, ) → SameC(, ′) | , ′, ∈ Const}.

Clearly, ∪ satisfies Σ , and satisfies Σ , and any other strict subset ′ of is such that ∪ ′ does not satisfy Σ . Another supported solution is {EmpC(john, miami), EmpC(mary, miami),

SameC(john, mary)}.

supported certain answers.

With the notion of supported solution in place, it is now straightforward to define the □ □ Example 4. Consider the data exchange setting , the source instance , and the query of Example 1. It is not dificult to see that the only supported solution of w.r.t. is the instance = {AllOrd(1), AllOrd(2), Paid(1)}. Thus, the supported certain answers to over w.r.t. are scert (, ) = ( ) = {2}. Consider now the data exchange setting , the source instance , and the query of Example 2. Then, one can verify that scert (, ) = ∅. □

We now start establishing some important results regarding supported solutions and supported certain answers. The following theorem states that supported solutions are a refined subset of the classical ones, but whether a supported solution exists is still tightly related to the existence of a classical one.

Theorem 2. Consider a data exchange setting . For every source instance of , its holds that (1) ssol(, ) ⊆ sol(, ), and (2) ssol(, ) = ∅ if sol(, ) = ∅.

Regarding certain answers, we show that supported solutions indeed enjoy an important property: supported certain answers and classical certain answers coincide, when focusing on positive queries. Note that this does not necessarily follow from Theorem 2. Theorem 3. Consider a setting = ⟨S, T, Σ , Σ ⟩ and a positive query over T. For every source instance of , scert (, ) = cert (, ).

From the above, we conclude that for positive queries, certain query answering can be performed as done in the classical setting, and thus all important results from that setting, like query answering via universal solutions, carry over.

Corollary 1. Consider a setting = ⟨S, T, Σ , Σ ⟩ and a positive query over T. If is a (classical) universal solution of w.r.t. , then scert (, ) = ( )↓.

Proof. It follows from Theorem 1 and Theorem 3. □

We now move to the complexity analysis of the two most important data exchange tasks: deciding whether a supported solution exists, and computing the supported certain answers to a query.

4. Complexity

In data exchange, it is usually assumed that a setting does not change over time, and a given query is much smaller than a given source instance. Thus, for understanding the complexity of a data exchange problem, it is customary to assume that and are fixed, and only is considered in the complexity analysis, i.e., we consider the data complexity of the problem. Hence, the problems we are going to discuss will always be parametrized via a setting , and a query (for query answering tasks). The first problem we consider is deciding whether a supported solution exists; is a fixed data exchange setting.

PROBLEM : EXISTS-SSOL()

INPUT : A source instance of .

QUESTION : Is ssol(, ) ̸= ∅? The above problem is very important in data exchange, as one of the main goals is to actually construct a target instance that can be exploited for query answering purposes. Hence, knowing in advance whether at least a supported solution exists is of paramount importance.

Thanks to Item 2 of Theorem 2, all the complexity results for checking the existence of a classical solution can be directly transfered to our problem.

Theorem 4. There exists a data exchange setting such that EXISTS-SSOL() is undecidable. Proof. It follows from Theorem 2 and from the fact that there exists a data exchange setting such that checking whether a classical solution exists is undecidable [ 5 ]. □

Despite the negative result above, we also inherit positive results from the literature, when focusing on some of the most important data exchange scenarios, known as weakly-acyclic. Such settings only allow target TGDs to belong to the language of weakly-acyclic TGDs, which have been first introduced in the seminal paper [ 1 ], and is now well-established as the main language for data exchange purposes.

We start by introducing the notion of weak-acyclicity. We recall that for a schema S, pos(S) denotes the set of all positions [], where / ∈ S and ∈ {1, . . . , }, and for a TGD = (x, y) → ∃z (y, z), fr( ) denotes the tuple y.

Definition 5 (Dependency Graph [ 1 ]). Consider a set Σ of TGDs over a schema S. The dependency graph of Σ is a directed graph dg(Σ) = ( , ), where = pos(S) and contains only the following edges. For each ∈ Σ , for each ∈ fr( ), and for each position in body( ) where occurs: • there is a normal edge (, ′) ∈ , for each position ′ in head( ) where occurs, and • there is a special edge (, ′) ∈ , for each position ′ in head( ) where an existentially quantified variable ∈ exvar( ) occurs. □ Definition 6. A set of TGDs Σ is weakly-acyclic if no cycle in dg(Σ) contains a special edge. A data exchange setting ⟨S, T, Σ , Σ ⟩ is weakly-acyclic if the set of TGDs in Σ is weakly-acyclic.□ Example 5. The settings of Examples 1 and 2 are weakly-acyclic, whereas the data exchange setting = ⟨S, T, Σ , Σ ⟩, where S = {/2}, T = { /2}, Σ = {(, ) → (, )}, and Σ = { (, ) → ∃ (, )} is not, since ( [ 2 ], [ 2 ]) is a special edge in dg(Σ ). □

The following result follows.

Theorem 5. For every weakly-acyclic data exchange setting , EXISTS-SSOL() is in PTIME. Proof. It follows from Theorem 2 and [1, Corollary 3.10]. □

We now move to the second crucial task: computing supported certain answers. Since this problem outputs a set, it is standard to focus on its decision version. For a fixed data exchange setting and a fixed query , we consider the following decision problem:

PROBLEM : SCERT(, ) INPUT : A source instance of and a tuple t ∈ Const().

QUESTION : Is t ∈ scert (, )?

One can easily show that the above problem is logspace equivalent to the one of computing the supported certain answers.

We start by studying the problem in its full generality, and show that there is a price to pay for query answering with general queries.

Theorem 6. There exists a data exchange setting = ⟨S, T, Σ , Σ ⟩, with Σ having only TGDs, and a query over T, such that SCERT(, ) is undecidable.

Although the complexity result above tells us that computing supported certain answers might be infeasible in some settings, we can show that for weakly-acyclic settings, the complexity is more manageable.

Theorem 7. For every weakly-acyclic setting and every query , SCERT(, ) is in coNP, and there exists a weakly-acyclic setting that is TGD-only and a query such that SCERT(, ) is coNP-hard.

We point out that the above result is in contrast with all the data exchange semantics discussed in the introduction, for which computing certain answers is undecidable, even for weakly-acyclic settings [ 2, 3 ].

We conclude this section by recalling that for positive queries, supported certain answers coincide with the classical ones (Theorem 3), and computing (classical) certain answers for weakly-acyclic settings, under positive queries, is tractable [ 1 ]. Hence, the result below follows. Corollary 2. For every weakly-acyclic setting and every positive query , SCERT(, ) is in PTIME.

5. Query Answering via Materialization

As already discussed in the introduction, it is crucial in data exchange, whenever it is possible, to materialize a target instance starting from the source instance and the schema mapping, in such a way that supported certain query answers can be computed by considering the target instance alone. The goal of this section is thus to study the problem of materializing such an instance, when focusing on our notion of supported solutions.

It would be very useful if such a special target instance could be computed in polynomial-time, already for weakly-acyclic settings. However, due to Theorem 7, this would imply PTIME = coNP. Hence, we need something diferent.

We introduce a special instance that enjoys the following properties: the answers over this instance are an approximation (i.e., a subset) of the supported certain answers, for general queries, but they coincide with supported certain answers, for positive queries. We also show that we can compute such an instance in polynomial time for weakly-acyclic settings.

Our approach relies on conditional instances [ 6 ], which we introduce in the following. Conditional instances. A valuation is a mapping from Const ∪ Null to Const that is the identity on Const. A condition is an expression that can be built using the standard logical connectives ∧, ∨, ¬, ⇒, and expressions of the form = , where , ∈ Const ∪ Null. We will also use ̸= as a shorthand for ¬( = ). We write |= to state that satisfies , and |= if all valuations satisfying satisfy the condition . A conditional fact is a pair ⟨, ⟩, where is a fact and is a condition. A conditional instance ℐ is a finite set of conditional facts. We also denote ℐ[ 1 ] = { | ⟨, ⟩ ∈ ℐ}. A possible world of a conditional instance ℐ is an instance such that there exists a valuation with = { ( ) | ⟨, ⟩ ∈ ℐ and |= }. We use pw(ℐ) to denote the set of all possible worlds of ℐ.

Definition 7. Consider a conditional instance ℐ and a query . The conditional certain answers of over ℐ is the set con-cert(ℐ, ) = ⋂︀∈pw(ℐ) ( ). □

We are now ready to introduce our main tool.

Definition 8 (Approximate Conditional Solution). Consider a data exchange setting and a source instance of . A conditional instance is an approximate conditional solution of w.r.t. , if for every query : 1. ssol(, ) ⊆ pw( ), and thus con-cert( , ) ⊆ scert (, ), and 2. if is positive, con-cert( , ) = scert (, ). □

That is, an approximate conditional solution is a conditional instance that allows to compute approximate answers for general queries, and exact answers for positive queries.

It is easy to observe that there are settings = ⟨S, T, Σ , Σ ⟩ and source instances for which an approximate conditional solution might not exist, even if is weakly-acyclic. This is due to the presence of EGDs in Σ .

However, for weakly-acyclic settings without EGDs, an approximate conditional solution always exists, and we present a polynomial-time algorithm that is able to construct one. We plan to deal with general weakly-acyclic settings with EGDs in a future work.

The algorithm is a variation of the well-known chase algorithm, which iteratively introduces new facts, starting from a source instance, whenever a TGD is not satisfied, i.e., it triggers the TGD (e.g., see [ 7 ]). This variation also allows for a conditional triggering of TGDs, where new atoms are introduced, under the condition that some terms in the body coincide.

Normal TGDs. To simplify the discussion, we consider an extension of TGDs that allow for equality predicates in the body. We will use these TGDs to rewrite standard TGDs in the following normal form. A normal form TGD is an expression of the form (x, y) ∧ (x, y) → ∃z (y, z), where and are conjunctions of atoms, uses only variables and each variable in occurs once in . The formula is a conjunction of equalities of the form = , where is a variable in x or y, and is either a variable in x or y, or a constant. The above equalities denote which variables should be considered to be the same and which positions should contain a constant. A (set of) standard TGDs Σ can be converted in normal form in the obvious way. We denote norm(Σ) as the (set of) TGDs in normal form obtained from Σ .

In the following, fix a conditional instance ℐ, a TGD with norm( ) = (x, y) ∧ (x, y) → ∃z (y, z), and a homomorphism ℎ from (x, y) to ℐ[ 1 ]. We use ℎ( (x, y)) to denote the condition obtained from (x, y) by replacing each variable therein with ℎ(). Letting ℎ( (x, y)) = { 1, . . . , }, we use Φ ,ℐℎ to denote the set of all conditions of the form ℎ( (x, y)) ∧ 1 ∧ · · · ∧ , such that ⟨ , ⟩ ∈ ℐ, for each ∈ {1, . . . , }. Example 6. Consider the TGD 3 of Example 2. The normal form TGD norm( 3) is EmpC(, ), EmpC(′, ′), = ′ → SameC(, ′). Consider now the conditional instance ℐ = {⟨EmpC(john, miami), ⊥1 = ⟩, ⟨EmpC(mary, ⊥2), true⟩}, where is a constant. Then, the mapping ℎ = {/john, /miami, ′/mary, ′/⊥2} is a homomorphism from {EmpC(, ), EmpC(′, ′)} to ℐ[ 1 ]. Moreover, Φ ℐ3,ℎ = {⊥2 = miami ∧ ⊥1 = }. □

We are now ready to define the notion of conditional chase step. In what follows, for a conditional instance ℐ, a TGD with norm( ) = (x, y) ∧ (x, y) → ∃z (y, z) and a homomorphism ℎ from (x, y) to ℐ[ 1 ], we use result(ℐ, , ℎ ) to denote the set of atoms obtained from head(norm( )), where each frontier variable in fr(norm( )) is replaced with ℎ(), and each existential variable in exvar(norm( )) is replaced with a fresh null not occurring in ℐ. Definition 9 (Conditional Chase Step). Consider a conditional instance ℐ, a TGD , and let norm( ) = (x, y)∧ (x, y) → ∃z (y, z). A conditional chase step of ℐ w.r.t. is an expression ,ℎ, of the form ℐ →− , where (i) ℎ is a homomorphism from (x, y) to ℐ[ 1 ], (ii) ∈ Φ ,ℐℎ is such that ̸|= false, and (iii) = ℐ ∪ {⟨, ⟩ | ∈ result(ℐ, , ℎ )}. □ Example 7. Consider the conditional instance ℐ, the homomorphism ℎ and the TGD 3 of Example 6. Then, ℐ →3,ℎ, − is a conditional chase step, where is the condition ⊥2 = miami ∧ ⊥1 = , and = ℐ ∪ {⟨SameC(john, mary), ⟩}. □

With the notion of conditional chase step at hand, we can define conditional chase sequences, which are sequences of conditional chase steps. For this we need one additional notion. A conditional tuple is a pair ⟨t, ⟩, where t is a tuple of constants and nulls, and a condition. For two conditional tuples ⟨t, ⟩, ⟨u, ⟩, with |t| = |u| = , we write ⟨t, ⟩ ⊑ ⟨u, ⟩ if |= and |= t = u, where t = u is a shortand for the condition ⋀︀ =1 t[] = u[]. We write ⟨t, ⟩ ̸⊑ ⟨u, ⟩, if ⟨t, ⟩ ⊑ ⟨u, ⟩ does not hold.

Intuitively, ⟨t, ⟩, ⟨u, ⟩ should be understood to be two tuples, each of them belonging to a set of “worlds”, described by the valuations that satisfy their conditions. Moreover, ⟨t, ⟩ ⊑ ⟨u, ⟩ means that every world in which t occurs, is also a world in which u occurs ( |= ), and in each such world, t and u are the same tuples.

Definition 10 (Conditional Chase Sequence). Consider a TGD-only data exchange setting = ⟨S, T, Σ , Σ ⟩ and a source instance of . A conditional chase sequence of w.r.t. is a (possibly infinite) sequence of conditional instances ()≥ 0, where for each ≥ 0, →−,ℎ, +1, and (i) 0 = {⟨, true⟩ | ∈ }, (ii) ∈ Σ ∪ Σ , for ≥ 0, and (iii) for every < , if = = , then ⟨ℎ(fr( )), ⟩ ̸⊑ ⟨ℎ (fr( )), ⟩. □

Intuitively, condition (iii) of the definition above is required to prevent the chase sequence to apply superfluous steps. An example follows.

Example 8. Consider the data exchange setting = ⟨S, T, Σ , Σ ⟩, with S = {/1, /1}, T = {/2, /1, /1}, where the sets Σ = { 1, 2} and Σ = { 3} are such that 1 = () → ∃ (, ), 2 = () → (), and 3 = (, ), () → (). Given = {(), (1), (2)}, the following is a conditional chase sequence of w.r.t. : 0 = {⟨(), true⟩, ⟨(1), true⟩, ⟨(2), true⟩}, 1 = 0 ∪ {⟨(, ⊥), true⟩},

For a TGD-only setting = ⟨S, T, Σ , Σ ⟩ and a source instance of , a finite conditional chase sequence ()0≤ ≤ of w.r.t. is maximal if there is no conditional instance +1, such that ()0≤ ≤ +1 is a conditional chase sequence of w.r.t. . We call the result of the maximal sequence.

Example 9. Consider the conditional chase sequence 0, . . . , 5 of Example 8. The sequence is ,ℎ, maximal, since any conditional chase step of the form 5 →− , for some conditional instance , cannot satisfy condition (iii) of Definition 10. The sequence 0, . . . , 4 is not maximal because although a conditional atom of the form ⟨ (), ⟩ is already present in 4, an additional conditional atom of the same form needs to be introduced in 5. This is needed to allow the fact () to be present for two diferent reasons (either because ⊥ = 1 or ⊥ = 2), and both reasons should occur in the result of the sequence. □

We are now ready to present the main result of this section. In what follows, given a schema S and a conditional instance ℐ, ℐ|S denotes the restriction of ℐ to its conditional facts with relations in S.

Theorem 8. Consider a TGD-only setting = ⟨S, T, Σ , Σ ⟩ and a source instance of . If is the result of a maximal conditional chase sequence of w.r.t. , then |T is an approximate conditional solution of w.r.t. .

We can further show that for TGD-only weakly-acyclic settings, a maximal conditional chase sequence always exists, and its length is polynomial. Moreover, its result can be computed in polynomial time.

Theorem 9. Consider a data exchange setting that is TGD-only and weakly-acyclic, and a source instance of . Every conditional chase sequence = ()0≤ ≤ of w.r.t. is such that is a polynomial of ||, and the result of can be computed in polynomial time w.r.t. ||. Querying Approximate Conditional Solutions. What now remains to show is how we can compute the conditional certain answers over an approximate conditional solution, e.g., obtained via the conditional chase. It is known that the problem of computing the conditional certain answers of a query is coNP-hard in general, even when we assume all the conditions in the given conditional instance are true [ 6 ]. Hence, given a data exchange setting and a source instance of , if an approximate conditional instance of w.r.t. can be computed in polynomial time w.r.t. ||, one cannot always compute con-cert( , ), in polynomial time. Hence, we require an additional step of approximation.

To this end, we exploit an existing algorithm presented in [ 8 ] to compute approximate certain answers over incomplete databases. Here we only recall the main properties of the algorithm. For more details, we refer the reder to [ 8 ].

For a query , the function ̇ from conditional instances to sets of tuples is defined in [ 8 ], and it is such that the following holds. Theorem 10 ([ 8 ]). Given a conditional instance over some schema S and a query over S, then 1) ̇( ) ⊆ con-cert( , ); 2) if is positive, ̇( ) = con-cert( , ); 3) if every condition in is a conjunction of equalities, then ̇( ) is computable in polynomial time w.r.t. | |.

From the result above, Theorem 9, and Definition 10, we obtain the following crucial result. Corollary 3. Consider a TGD-only weakly-acyclic setting . For every source instance of , an approximate conditional solution of w.r.t. can be constructed in polynomial time, and for every query , ̇ is such that 1) ̇( ) ⊆ con-cert( , ) ⊆ scert (, ); 2) if is positive, ̇( ) = con-cert( , ) = scert (, ); 3) ̇( ) is computable in polynomial time w.r.t. | |.

6. Connections with Other Work and Next Steps

Conditional instances and, more in general, incomplete databases, have already been employed in the context of data exchange. However, in most of previous work, incomplete databases are used to encode source and target instances with incomplete information. In [ 9 ], the authors extend the standard data exchange framework by allowing source and target instances to be incomplete databases, encoded via some representation system, such as conditional instances. There, the main goal is to study the semantics of data exchange under the assumption that the source and target instances can be incomplete. In contrast, in our work, we focus on the classical data exchange setting, where source and target instances are standard (complete) databases. Here we employ incomplete databases, in particular conditional instances, only as a tool to compute the (approximate) certain answers of a query over our set of supported solutions, which are standard databases as well.

In Section 5 we have seen how a conditional extension of the chase procedure can be employed to compute in polynomial time, for weakly-acyclic settings, an approximate conditional solution. The idea of extending the chase procedure with conditional TGD applications is not new and has been explored in previous work. In particular, the work of [ 10 ] introduces a conditional version of the chase procedure which is similar to ours. The main diference is that the conditional chase of [ 10 ] is much simpler, since it is an extension of the simplest variant of the chase algorithm, called oblivious chase, while ours can be seen as an extension of the more refined semi-oblivious (a.k.a. skolem) chase (see., e.g., [ 11, 12, 13, 14, 15 ] for more details). For this reason, it is not dificult to show that when considering weakly-acyclic settings, the conditional chase of [ 10 ] is not guaranteed to terminate, while termination for weakly-acyclic settings is a crucial property for our purposes, since we need to be able to construct a finite conditional instance in this case.

The problem of dealing with non-monotonic queries has been investigated beyond data exchange, as for example for Ontology-Mediated Query Answering (OMQA). In this setting, we are given an instance (database) , an ontology Σ encoded in some logical formalism (e.g., via TGDs), and a query (x), and the goal is to compute all the certain answers of (x) w.r.t. and Σ , i.e., the tuples that are answers to in every model of the logical theory ∪ Σ . A relevant work in this scenario is the one in [ 16 ], where the authors define the query language EQL-Lite(), parametrized with a standard (positive) query language (e.g., UCQs), and supports a limited form of negation. In particular, an expression in EQL-Lite() is of the form := K | 1 ∧ 2 | ¬ 1 | ∃ 1, where ∈ , and 1, 2 are EQL-Lite() expressions.

Here, the epistemic operator K is applied to expressions ∈ and returns the certain answers of w.r.t. the input database and the ontology Σ . The main instantiation of EQL-Lite that the authors study is EQL-Lite(UCQ), i.e., where coincides with the set of all UCQs.

From the above definition, we observe that negation is applied only to (a combination of) the certain answers of positive queries. This gives a semantics to negation that fundamentally difers from ours, as illustrated in the following example.

Consider the data exchange setting = ⟨S, T, Σ , Σ ⟩, where S stores employees of a company in the unary relation Emp. The target schema T contains a unary relation Emp′ storing employees, the ternary relation Addr assigning to each employee her work and home address, and the unary relation WorkFromHome, storing employees working from home. Assume we have Σ = { 1 = Emp() → Emp′(), 2 = Emp() → ∃ ∃ Addr(, , )} and Σ = { 3 = Addr(, , ) → WorkFromHome()}.

The above setting copies employees from the source to the target via the TGD 1, while the TGD 2 states that each employee must have a work and home address, denoted via the existential variables and , respectively. Finally, the TGD 3 states that if the work and home address of an employee coincide, then this employee works from home.

Assume the source instance is = {Emp(john)}, and let be the query asking for all employees who do not work from home, i.e., () = Emp′() ∧ ¬WorkFromHome().

According to [ 16 ], the query corresponds to the EQL-Lite(UCQ) expression ′() = K Emp′() ∧ ¬K WorkFromHome(). Letting = , and Σ = Σ ∪ Σ , roughly, the above means that an empolyee is an answer to the query ′ if she is present in all models of ∪ Σ and such that there is at least one model in which the employee does not work from home. Under this interpretation, the answer to ′ is john. However, under our semantics, the answer to is empty. Hence, the fundamental diference is that negation, under EQL-Lite, is interpreted as negating classical certain answering, and thus an expression ¬K is “satisfied” when at least one model/solution does not entail , while in our case, we consider the given query as a whole, and require it to be satisfied in every valid solution.

We conclude by discussing avenues for further research. First, we would like to extend the conditional chase to weakly-acyclic settings with EGDs, and identify relevant data exchange settings for which computing the supported certain answers is tractable. Moreover, we plan to experimentally evaluate the current approximation approaches both in terms of quality and of running time via a dedicated benchmark, as did e.g., in the context of approximate consistent query answering [ 17 ]. Since explaining query answering has recently drawn considerably attention under existential rule languages (e.g., see [ 18, 19, 20, 21, 22 ]), and knowledge representation in general (e.g., in the context of argumentation [ 23 ]) an interesting direction for future work is to address such issue in our setting. Also, it would be interesting to account for user preferences when answering queries, as recently done in [ 24 ].

[1]

Fagin ,

P. G.

Kolaitis ,

R. J.

Miller ,

Popa , Data exchange: semantics and query answering , TCS 336 ( 2005 ) 89 - 124 .

[2]

Hernich ,

Libkin ,

Schweikardt , Closed world data exchange, TODS 36 ( 2011 ) 14 : 1 - 14 : 40 .

[3]

Hernich , Answering Non-Monotonic Queries in Relational Data Exchange , LMCS Volume 7, Issue 3 ( 2011 ).

[4]

Libkin ,

Sirangelo , Data exchange and schema mappings in open and closed worlds , JCSS 77 ( 2011 ) 542 - 571 .

[5]

P. G.

Kolaitis ,

Panttaja ,

W. C.

Tan , The complexity of data exchange , in: PODS , 2006 , pp. 30 - 39 .

[6]

Imielinski , W. Lipski, Incomplete information in relational databases, J. ACM 31 ( 1984 ) 761 - 791 .

[7]

Tsamoura ,

Carral ,

Malizia ,

Urbani , Materializing knowledge bases via trigger graphs , Proc. VLDB Endow . 14 ( 2021 ) 943 - 956 .

[8]

Greco ,

Molinaro , I. Trubitsyna , Approximation algorithms for querying incomplete databases , Inf. Syst . 86 ( 2019 ) 28 - 45 .

[9]

Arenas ,

Pérez ,

J. L.

Reutter , Data exchange beyond complete data , J. ACM 60 ( 2013 ) 28 : 1 - 28 : 59 .

[10]

Grahne ,

Onet , On conditional chase termination , in: AMW , 2011 .

[11]

Calautti ,

Gottlob ,

Pieris , Chase termination for guarded existential rules , in: PODS , 2015 , pp. 91 - 103 .

[12]

Grahne ,

Onet , Anatomy of the chase , Fundam. Inform . 157 ( 2018 ) 221 - 270 .

[13]

Calautti ,

Pieris , Oblivious chase termination: The sticky case , in: ICDT , 2019 , pp. 17 : 1 - 17 : 18 .

[14]

Calautti ,

Pieris , Semi-oblivious chase termination: The sticky case , Theory Comput. Syst . 65 ( 2021 ) 84 - 121 .

[15]

Calautti ,

Gottlob ,

Pieris , Non-uniformly terminating chase: Size and complexity , in: PODS, 2022 , pp. 369 - 378 .

[16]

Calvanese , G. De Giacomo,

Lembo ,

Lenzerini ,

Rosati , Eql-lite: Efective ifrst-order query processing in description logics ., in: IJCAI , 2007 , pp. 274 - 279 .

[17]

Calautti ,

Console ,

Pieris , Benchmarking approximate consistent query answering , in: L. Libkin , R. Pichler , P. Guagliardo (Eds.), PODS, 2021 , pp. 233 - 246 .

[18]

Lukasiewicz ,

Malizia ,

Molinaro , Explanations for negative query answers under inconsistency-tolerant semantics , in: Proc. IJCAI , 2022 , pp. 2705 - 2711 .

[19]

İ.

İ. Ceylan ,

Lukasiewicz ,

Malizia ,

Molinaro ,

Vaicenavicius , Preferred explanations for ontology-mediated queries under existential rules , in: Proc. AAAI , 2021 , pp. 6262 - 6270 .

[20]

İ.

İ. Ceylan ,

Lukasiewicz ,

Malizia ,

Molinaro ,

Vaicenavicius , Explanations for negative query answers under existential rules , in: Proc. KR , 2020 , pp. 223 - 232 .

[21]

Lukasiewicz ,

Malizia ,

Molinaro , Explanations for inconsistency-tolerant query answering under existential rules , in: Proc. AAAI , 2020 , pp. 2909 - 2916 .

[22]

İ.

İ. Ceylan ,

Lukasiewicz ,

Malizia ,

Vaicenavicius , Explanations for query answers under existential rules , in: Proc. IJCAI , 2019 , pp. 1639 - 1646 .

[23]

Alfano ,

Calautti ,

Greco ,

Parisi , I. Trubitsyna , Explainable acceptance in probabilistic abstract argumentation: Complexity and approximation , in: KR, 2020 , pp. 33 - 43 .

[24]

Calautti ,

Greco ,

Molinaro , I. Trubitsyna , Preference-based inconsistency-tolerant query answering under existential rules , Artif. Intell . 312 ( 2022 ) 103772 .