=Paper= {{Paper |id=Vol-1378/paper7 |storemode=property |title=Approximation Algorithms for Schema-Mapping Discovery from Data Examples |pdfUrl=https://ceur-ws.org/Vol-1378/AMW_2015_paper_7.pdf |volume=Vol-1378 |dblpUrl=https://dblp.org/rec/conf/amw/CateKQT15 }} ==Approximation Algorithms for Schema-Mapping Discovery from Data Examples== https://ceur-ws.org/Vol-1378/AMW_2015_paper_7.pdf
Approximation Algorithms for Schema-Mapping
       Discovery from Data Examples

Balder ten Cate2,1 , Phokion G. Kolaitis1,3 , Kun Qian1 , and Wang-Chiew Tan1

                        University of California Santa Cruz1
                                  LogicBlox, Inc.2
                            IBM Research - Almaden3
                 {btencate,kolaitis,kunqian,tan}@soe.ucsc.edu



       Abstract. In recent years, data examples have been at the core of several
       different approaches to schema-mapping design. In particular, Gottlob
       and Senellart introduced a framework for schema-mapping discovery from
       a single data example, in which the derivation of a schema mapping is cast
       as an optimization problem. Our goal is to refine and study this framework
       in more depth. Among other results, we design a polynomial-time log(n)-
       approximation algorithm for computing optimal schema mappings from a
       given data example, for a restricted class of schema mappings; moreover,
       we show that this approximation ratio cannot be improved.

       Keywords: Schema Mappings, Data Examples, Approximation


1    Introduction
Schema mappings between a source schema and a target schema constitute the
essential building blocks for the specification of data exchange and data integration
tasks [6,15,16]. However, deriving a schema mapping between two schemas can be
an involved and time-consuming process. Most commercial systems (e.g., Altova
Mapforce and the Stylus Studio) and research prototypes derive schema mappings
automatically using correspondences between the elements of two schemas, as
specified by a user through a graphical interface [7, 13, 18]. A shortcoming of this
approach is that multiple non-equivalent schema mappings may be compatible
with the same set of correspondences. More recently, data examples have been
used as the basis of alternative approaches to schema-mapping design.
    Data examples have been used in several different areas of data management
(e.g., see [11,17,19,21]). In the context of schema-mapping design, a data example
is a pair (I, J) consisting of a source instance and a target instance. In [2, 22],
the goal is, given a schema mapping, to produce meaningful data examples, that
illustrate the schema mapping at hand. In [1], the goal is to investigate whether a
given schema mapping can be uniquely characterized by a finite set of examples.
In the reverse direction, there has been work on deriving a schema mapping that
fits one or more given data examples [3, 8]
    Gottlob and Senellart [12] introduced and studied a cost model for deriving a
schema mapping from a ground data example, i.e., a data example consisting of
instances without nulls. Ideally, given a ground data example (I, J), one would
like to find a GLAV schema mapping M that is valid (meaning that (I, J)
satisfied the constraints of M) and fully explaining for (I, J) (meaning that
every fact in J belongs to every target instance K such that (I, K) satisfies
the constraints of M). However, such a schema mapping need not exist. For
this reason, Gottlob and Senellart developed a framework that uses two schema-
mapping languages: the standard language of GLAV constraints and an extended
language GLAV=,6= of repairs that augments, in precise way, GLAV constraints
with equalities, inequalities, and ground facts. The cost of a GLAV schema
mapping M w.r.t. a data example (I, J) is the minimum of the sizes of valid
and fully explaining schema mappings M0 obtained from M via a sequence of
operations that transform M to a schema mapping in the extended language.
Given a ground data example (I, J), the goal then is to find an optimal GLAV
schema mapping for (I, J), that is to say, a GLAV schema mapping whose cost
w.r.t. (I, J) is as small as possible. Gottlob and Senellart [12] investigated several
different decision problems arising naturally in the context of the above framework,
including the Existence-Cost problem: given a ground data example (I, J)
and a positive integer k, is there a GLAV schema mapping M whose cost w.r.t.
(I, J) is at most k? In [12], it is shown that the Existence-Cost problem is
NP-hard, and that it belongs to the third level Σ3p of the polynomial hierarchy.
    Here, we contribute to the study of schema-mapping discovery from data
examples by refining and investigating the Gottlob-Senellart framework in more
depth. We refine the framework by considering several different schema-mapping
languages. At the base level, we consider sublanguages L of the standard language
of GLAV constraints, such as the languages of GAV constraints, LAV constraints,
and SH-LAV (single-head LAV) constraints. For each of these schema-mapping
languages L, we consider two corresponding repair languages, namely, L=,6= and
L= ; the former extends L with equalities, inequalities, and ground facts (as
in [12]), while the latter extends L with equalities and ground facts only.
    The main algorithmic problem in the Gottlob-Senellart framework is to
compute an optimal schema mapping for a given ground data example. The
tractability of this optimization problem was left open in [12] (the hardness
of the Existence-Cost problem does not imply hardness of the optimization
problem, because computing the cost of a given schema mapping for a given
ground data example is itself a hard problem). We show that this optimization
problem is indeed hard for GLAV mappings, w.r.t. both GLAV= -repair and
GLAV=,6= -repairs. More precisely, unless RP = NP, there is no polynomial-time
algorithm that, given a ground data example, computes a GLAV mapping whose
cost is bounded by some fixed polynomial in the cost of the optimal GLAV
mapping. Moreover, an analogous result holds for GAV mappings.
   After this, we design an approximation algorithm for computing near optimal
schema mappings for the case of SH-LAV schema mappings. Specifically, we
present a polynomial-time O(log n)-approximation algorithm that, given a data
example (I, J), produces a SH-LAV mapping M together with a SH-LAV= -repair
M0 of M for (I, J), whose cost is within a logarithmic factor of the cost of the
optimal SH-LAV mapping for (I, J). Finally, we establish that this is a best
possible approximation result.


2    Preliminaries

Schemas and Instances An instance over a schema R = {R1 , . . . , Rk } can be
identified with the finite set of all facts Ri (a1 , . . . , am ), such that Ri is a relation
symbol of R and (a1 , . . . , am ) is a tuple that belongs to the relation RiI of I
interpreting Ri . Database instances contain values that are either constants or
nulls. A ground instance is an instance such all of its facts are ground. i.e., they
consist entirely of constants. In this paper, we will primarily consider ground
instances, and we will, at times, drop the adjective “ground”. We write adom(I)
to denote the active domain of an instance I.
Schema Mappings Let S, T be two relational schemas, called the source schema
and the target schema. A schema mapping is a triple M = (S, T, Σ) consisting
of a source schema S, a target schema T, and a set Σ of constraints that are
typically expressed in some fragment of first-order logic.
    A GLAV (Global-and-Local-As-View) constraint, also known as a tuple-
generating dependency (tgd), is a FO-formula of the form ∀x(ϕ(x) → ∃yψ(x,y)),
where ϕ(x) is a conjunction of atoms over S and ψ(x,y) is a conjunction of atoms
over T. We will often drop the universal quantifiers when writing constraints.
    The following are two important special cases of GLAV constraints: (1) A GAV
(Global-As-View) constraint is a GLAV constraint whose right-hand side is a
single atom without existential quantifiers, i.e., it is of the form ∀x(ϕ(x) → T (x)).
(2) A LAV (Local-As-View) constraint is GLAV constraint whose left-hand side
is a single atom, i.e. it is of the form ∀x(S(x) → ∃yψ(x,y)). There is another
special case of LAV constraints, called SH-LAV (Single-Head LAV) constraints.
A SH-LAV constraint is a GLAV constraint in which both the left-hand side and
right-hand side are a single atom, i.e., it is of the form ∀x(S(x) → ∃yT (x,y)).
In fact every DL-lite concept subsumption axiom is equivalent to a SH-LAV
constraint [5].

Example 1. Geo(x, y) → City(y) is a LAV constraint that is also a GAV con-
straint, and hence, in particular, is a SH-LAV contraint; Geo(x, y) → ∃zCity(z)
is a SH-LAV constraint that is not a GAV constraint. Finally, the constraint
Geo(x, y) ∧ Geo(x, z) → City(y) is a GAV constraint that is not a LAV constraint.

    A GLAV schema mapping is a schema mapping M = (S, T, Σ), where Σ is a
finite set of GLAV constraints. The notions of GAV, LAV, and SH-LAV schema
mappings are defined in an analogous way.
Data Examples Let S be a source schema and T be a target schema. A data
example is a pair (I, J) such that I is a source instance and J is a target instance.
A data example (I, J) is ground if both I and J are ground instances. The size
||(I, J)|| of a data example (I, J) is the total number of facts in I and J.
                            Table 1. Data Example (I, J)
Source Instance                                                  Target Instance
Geo(US, San Francisco)   Geo(Calif, San Jose)                    City(San Francisco)
Geo(US, Los Angeles )    Geo(Calif, San Francisco)               City(San Jose)
Geo(US, Miami)           Geo(Calif, Los Angeles)                 City(San Diego)
Geo(US, Boston)          Geo(Calif, San Diego)                   City(Los Angeles)
Geo(US, New York)        Geo(Canada, Vancouver)                  City(Boston)
Geo(NorthAm, Boston)     Geo(Germany, Berlin)                    City(Toronto)
Geo(NorthAm, Toronto)    Geo(Japan, Tokyo)                       City(New York)
Geo(NorthAm, New York)   Geo(China, Beijing)                     City(Miami)
Geo(NorthAm, Miami)      Geo(France, Paris)
                         Geo(UK, London)


    Let M = (S, T, Σ) be a schema mapping, and let (I, J) be a ground data
example. We say that M is valid for (I, J) if (I, J) |= Σ, that is, (I, J) ∈ E
satisfies all constraints in Σ. We say that M explains a fact f of J with respect
to I if, for all target instances K such that (I, K) |= Σ, we have that f ∈ K.
Finally, we say that M is fully explaining for a data example (I, J) if M explains
each fact of J with respect to I. We will use the expression vfe as a shorthand
for “valid and fully explaining”.
    In [1, 3], a schema mapping M was said to fit a data example (I, J) if J is
a universal solution of I w.r.t. M, i.e., J is a solution for I w.r.t. M such that
for every solution J 0 for I, there is a homomorphism from J to J 0 that maps
constants to themselves (see [10] for more on universal solutions). It is not hard
to verify that if (I, J) is a ground data example and M is a schema mapping,
then M fits (I, J) if and only if M is vfe for (I, J).
Example 2. Let M = (S, T, Σ), where S = {Geo(area, city)}, T = {City(cityName)},
Σ = {Geo(x, y) → City(y)}. Consider the data examples (Ii , Ji ), i = 1, 2, 3, where
I1 : Geo(CA, San Francisco), Geo(CA,San Jose), Geo(US,Boston), Geo(US, Los Angeles)
J1 : City(San Francisco), City(San Jose)
I2 : Geo(CA, San Francisco)
J2 : City(San Francisco), City(New York)
I3 : Geo(CA, San Francisco), Geo(US,New York)
J3 : City(San Francisco), City(New York)

 In these data examples, since I1 contains Geo(US,Boston), but City(Boston)
is not in J1 , we have that M is not valid for (I1 , J1 ); moreover, M is valid for
(I2 , J2 ), but it fails to explain the target fact City(New York); however, M is
valid and fully explaining for (I3 , J3 ). Using a simple automorphism argument, it
is easy to see there is no valid and fully explaining GLAV schema mapping for
(I1 , J1 ). Moreover, since J2 contains a constant that does not appear in the I2 ,
there is no valid and fully explaining GLAV schema mapping for (I2 , J2 ).


3    Repair Framework and Cost Model
In [12], the language of GLAV constraints was used as the “base language”,
and an extended “repair language” was introduced; the repair language includes
equalities, inequalities, and ground facts, and is used for “repairing” GLAV
schema mappings, so that they are valid and fully explaining for a given ground
data example. We refine this framework by considering different sublanguages of
the language of GLAV constraints, as well different repair languages.

Definition 1. Fix a schema-mapping language L (e.g., GLAV). An L=,6= -constraint
is a formula θ of the form ∀x(ϕ(x) ∧ α(x) → ∃y(ψ(x,y) ∧ β(x,y))), where
1. ∀x(ϕ(x) → ∃yψ(x,y)) is an L-constraint;
2. α(x) is a possibly empty conjunction of formulas of the form x ∼ c, where
   ∼∈ {=, 6=}, x ∈ x, and c is a constant;
3. β(x,y) is a possibly empty conjunction of formulas of the form (x1 = c1 ∧ . . . ∧ xn = cn ) →
   y = c, where each xi ∈ x, y ∈ y, and c1 , . . . , cn , c are constants.
A L= -constraint is a L=,6= -constraint with no conjuncts of the form x 6= c in α.

    We will write L to denote a schema-mapping language (e.g., GLAV), and we
will write L=,6= and L= to denote the corresponding repair language (with and
without inequalities). We will use L∗ to refer to either L=,6= or L= . We say that
the formula θ as above is a L∗ -repair of the constraint ∀x(ϕ(x) → ∃yψ(x,y)).

Definition 2. An L∗ -repair of an L-schema mapping M = (S, T, Σ) is an L∗ -
schema mapping M0 = (S, T, Σ 0 ) such that each ψ ∈ Σ 0 is either a ground fact
or is an L∗ -repair of some φ ∈ Σ, and, for each φ ∈ Σ, at least one L∗ -repair of
φ belongs to Σ 0 . We write repairL∗ (M) to denote the set of all L∗ -repairs of M.

    The above definition differs from that of repairs in [12] in that we allow M0 to
contain multiple repairs of the same L-constraint from M, whereas, in [12], the
L∗ -constraints of M0 stand in a one-to-one correspondence with the L-constraints
of M. There are cases in which a L-constraint may need to be repaired more
than once with, say, different combinations of equalities and inequalities. In such
cases, the optimal schema mapping in [12] may contain multiple copies of the
same GLAV constraint or multiple GLAV constraints that are identical up to a
renaming of variables (see Example 4). Our definition addresses this shortcoming.

Example 3. Recall that the data example (I1 , J1 ) in Example 2 had no vfe GLAV
schema mapping. Consider the following GLAV =,6= schema mappings:
 • M1 ={Geo(x, y) ∧ (x = CA) → City(y)}
 • M2 ={Geo(x, y) → ∃z City(z) ∧ (y = San Jose → z = San Jose)
   ∧ (y = San Francisco → z = San Francisco)}
 • M3 ={City(San Francisco), City(San Jose)}
 • M4 ={Geo(x, y) ∧ (x 6= US) → City(y)}
 • M5 ={Geo(x, y) ∧ (x = moon) → City(x)}
M1 , M2 , M3 , and M4 are vfe for (I1 , J1 ), but in different ways: M1 consists of
a repair of Geo(x, y) → City(y) that uses an equality to restrict the constraint
to source facts whose first attribute has value CA; M2 consists of a repair of
Geo(x, y) → ∃zCity(z) that uses two conditional equalities to explicitly specify
a value of z depending on the value of y; M3 is a repair of the empty schema
mapping, and it lists all the ground facts of J1 ; finally, M4 consists of a repair
of the same constraint as M1 that, instead, uses an inequality.
    M5 is valid, but does not explain (I1 , J1 ). It uses the equality (x = moon),
where moon is outside the active domain of (I1 , J1 ). We informally say that this
equality cancels M5 (with respect to the data example (I1 , J1 )).
Example 4. We illustrate the need for multiple repairs of the same constraint.
For the data example (I, J) in Table 1, consider the GLAV schema mappings
 – M1 = {Geo(x, y) → City(y)},
 – M2 = {Geo(x1 , y1 ) → City(y1 ), Geo(x2 , y2 ) → City(y2 )}
and consider the GLAV=,6= -schema mapping Mr specified by the constraint
      Geo(x, y) ∧ (x = Calif) → City(y), Geo(x, y) ∧ (x = NorthAm) → City(y).
    Note that Mr is vfe for (I, J). Note also that M2 consists of two renamings
of the GLAV constraints of M. By our definition of repair, Mr is a repair of M1
and also of M2 . However, according to the definition of repair in [12], Mr does
not qualify as a repair of M1 , and, indeed, one of the smallest repairs of M1 is
obtained by adding an equality (x = US) to the left-hand side of the constraint
in M1 together with adding three ground facts: City(San Jose), City(Toronto),
and City(San Diego). Since the cost of a schema mapping w.r.t. a data example
will be measured by the size of the smallest repair, we have that, under the cost
model of [12], M2 has a lower cost than M1 , which is counterintuitive.
    As pointed out in [12], if (I, J) is a ground data example, then there is always
a vfe GLAV=,6= schema mapping for (I, J). In fact, trivially, the collection of
all the ground facts in J is a vfe schema mapping for (I, J). This shows that,
for all schema-mapping languages L considered her, every ground data example
has a L= repair. Indeed, every L-schema mapping has a L= -repair that is vfe
(obtained by cancelling all constraints and adding all ground target facts).
    The cost model introduced in [12] focuses on finding a schema mapping in
the base language, such that the cost of transforming it to a vfe schema mapping
in the repair language is as small as possible.
Definition 3. [12] The size of a first-order formula ϕ, denoted size(ϕ), is the
number of occurrences of variables and constants in ϕ (each variable and constant
is counted as many times as it occurs in ϕ); occurrences of variables as arguments
of quantifiers do not count. A ground fact R(a1 , . . . , an ), for present purposes,
is viewed as shorthand for ∃x1 , . . . , xn R(x1 , . . . , xn ) ∧ x1 = a1 ∧ · · · ∧ xn = an ;
therefore, we consider its size to be 3n. The size of a repair of a schema mapping
is the sum of the sizes of the constraints and the ground facts of the repair.
Note that the motivation for the above treatment of ground facts is to discourage
the use of ground facts in repairing schema mappings.
Definition 4. The cost of an L-schema mapping M for a data example (I, J)
and a repair language L∗ , denoted by cost(M, (I, J), L∗ ), is the smallest size of
a vfe L∗ -repair of M for (I, J). An L-schema mapping M is L∗ -optimal for
(I, J) if cost(M, (I, J), L∗ ) is the minimum of the quantities cost(M0 , (I, J), L∗ ),
where M0 ranges over all L-schema mappings.
Example 5. We continue from Example 3 by considering the schema mappings
Ma = {Geo(x, y) → City(y)}, Mb = {Geo(x, y) → ∃z City(z)}, and Mc = ∅.
    Both M1 and M5 are vfe repairs of Ma for (I1 , J1 ) and both have size 5;
it is easy to verify that no other vfe repairs of Ma has size less than 5, hence
cost(Ma , {(I1 , J1 )}, GLAV =,6= ) = 5. The repair M3 is the only vfe repair of
Mc , and cost(Mc , {(I1 , J1 )}, GLAV =,6= ) = 6. Moreover, M2 is a vfe repair of
Mb and size(M2 ) = 11, but the cost of Mb is not 11. To see this, consider the
schema mapping M02 specified by the constraint
        M02 = {Geo(x, y) → ∃z City(z) ∧ (z = San Jose), City(San Francisco)}.
Clearly, M02 is also a vfe repair of Mb for (I1 , J1 ) and has size of 8.
    The notion of an optimal schema mapping in Definition 4 differs from the
corresponding definition in [12] in that we consider a slightly different notion
of repair, as explained in the remarks following Definition 2. A consequence of
this is that an optimal schema mapping in our sense has cost less than or equal
than the cost of an optimal schema mapping in the sense of [12]. However, the
complexity-theoretic results concerning the various algorithmic problems do not
depend on this, and, indeed, the proofs are not affected by this change.

4     Hardness of Computing Optimal Schema Mappings
As we have seen, the main characteristic of the framework introduced in [12] and
refined here is to cast schema-mapping discovery as an optimization problem:
given a finite set of ground data examples, produce a schema mapping of minimum
cost. Two different decision problems naturally arise in this setting.
Definition 5. The decision problem CostL∗ asks: given a ground data example
(I, J), a L-schema mapping M, and an integer k ≥ 0, is cost(M, (I, J), L∗ ) ≤ k?
Definition 6. The decision problem Existence-CostL∗ asks: given a ground
data example (I, J) and an integer k ≥ 0, does there exists a L-schema mapping
M such that cost(M, (I, J), L∗ ) ≤ k?
    In these two problems, the source schema and the target schema are part
of the input. One can also consider the variants of these problems, such as
Existence-CostL∗ (S, T), obtained by fixing the source and target schemas.
    The preceding problems were introduced and studied in [12] when the base
language L is GLAV or GAV, and the repair language is L=,6= . It was shown there
that CostGLAV=,6= belongs to Σ3p (the third level of the polynomial hierarchy) and
is Π2p -hard, while CostGAV=,6= belongs to Σ2p and is DP-hard. It was also shown
that Existence-CostGLAV=,6= belongs to Σ3p and that Existence-CostGAV=,6=
belongs to Σ2p ; moreover, both these problems were shown to be NP-hard.1
    We show that the problems Cost and Existence-Cost are NP-complete for
the languages of LAV schema mappings and SH-LAV schema mappings. Neither
of these schema-mapping languages was considered in [12].
1
    The NP-hardness proof given in [12] is flawed. The authors have shared with us, in
    private communication, a correct NP-hardness proof.
Theorem 1. Assume that L∗ ∈ {LAV=,6= , LAV= , SH-LAV=,6= , SH-LAV= }. Then
the problems CostL∗ and Existence-CostL∗ are NP-complete. In fact, there
are fixed schemas S and T for which these problems are NP-complete.
    Theorem 1 is established via a reduction from the Set-Cover problem. Next,
we show that the problem of computing optimal GLAV schema mappings is
not solvable in polynomial time, unless RP=NP. Note that this does not follow
directly from the Π2p -hardness of Existence-CostGLAV =,6= established in [12],
because CostGLAV =,6= is DP-hard. The same holds true for GAV mappings.
Actually, we show a stronger result: unless RP = NP, there is no polynomial-time
algorithm that, given a ground data example (I, J), computes a GLAV schema
mapping whose cost is bounded by a polynomial in the cost of the optimal GLAV
schema mapping; moreover, the same holds true for GAV schema mappings.
Theorem 2. Assume that L ∈ {GLAV, GAV } and let p(x) be a polynomial. Un-
less RP = NP, there is no polynomial-time algorithm that, given a ground data ex-
ample (I, J), computes a L-schema mapping M such that cost(M, (I, J), L=,6= ) ≤
p(cost(Mopt , (I, J), L=,6= )), where Mopt is an L=,6= -optimal schema mapping for
(I, J). Moreover, the same holds true when L=,6= is replaced by L= . In fact, there
are fixed source and target schemas for which these results hold.
       Theorem 2 also shows that the computational hardness is, to some extent,
robust under changes to the precise cost function. The proof is based on procedure
that transforms a ground data example (I, J) into another ground data example
(I 0 , J 0 ), which, intuitively, contains many isomorphic copies of the original data
example (I, J), such that every near-optimal GLAV (GAV) schema mapping
for (I 0 , J 0 ) corresponds to a fitting GLAV (GAV) schema mapping for (I, J) of
near-minimal size. This is combined with a hardness result for computing fitting
GAV schema mappings of near-minimal size, established in [8].


5    Approximation of Optimal Single-Head LAV Mappings
We now study the approximation properties of the following optimization problem:
Definition 7. The Optimal-RepairL∗ (S, T) problem asks: given a ground data
example (I, J) with source schema S and target schema T, compute a minimal-size
valid and fully explaining L∗ -schema mapping for (I, J) and L∗ .
    This problem is equivalent to the problem that asks to compute an optimal
L-schema mapping M together with a minimal-size L∗ -repair of M; in particular,
it contains, as a special case, the problem of computing an optimal L-schema
mapping for a given ground data example. This is so because, from a minimal-size
valid and fully explaining L∗ -schema mapping, we can immediately extract an
optimal L-schema mapping by, simply by dropping all equalities, inequalities,
and ground facts. Therefore, it follows from Theorem 2 that (assuming RP 6= NP)
there is no polynomial-time algorithm that solves Optimal-RepairL∗ , when
L∗ ∈ {GLAV = , GLAV =,6= , GAV = , GAV =,6= }.
    We establish a positive approximability result for SH-LAV= mappings.
Theorem 3. Fix a pair of schemas S, T. There is a polynomial-time H(n)-
approximation
Pn 1            algorithm for Optimal-RepairSH-LAV= (S, T), where H(n) =
  i=1 i is the n-th harmonic number.
    Our approximation algorithm is obtained by establishing a close connection
between Set-Cover and Optimal-RepairL∗ . It is known that, although the
Set-Cover problem is not constant-approximable, it is H(n)-approximable,
where n is the size of the universe. Moreover, this approximation ratio is known to
be asymptotically optimal: unless P = NP, Set-Cover can only be approximated
up to a factor of c ln(n), where c is a constant (specified in [4, 14]) and n is the
size of universe (recall that H(n) = O(ln n)). Our approximation algorithm is
largely based on the approximation algorithm of Weighted Set-Cover [9].
Here, we can think of each constraint as describing a set of facts, namely, the set
of target facts that it explains, and the size of the constraint is the weight of the
corresponding set.
    The above approximation algorithm can be extended in a straightforward
manner to the case of SH-LAV=,6= -constraints with a bounded number of inequal-
ities per variable, as well as to LAV=,6= -constraints with a bounded number of
inequalities per variable and a bounded number of atoms in the right-hand side.
We leave it as an open problem whether an analog of Theorem 3 holds for the
general case of SH-LAV=,6= and LAV=,6= .
    We conclude with a matching lower bound for the approximability of Optimal-
RepairSH-LAV= (S, T). This result is established via an approximation-preserving
reduction (more precisely, an L-reduction [20]) from Minimum Set Cover.

Theorem 4. Let L ∈ {LAV,SH-LAV}. There are fixed schemas S and T, and a
constant c such that there is no polynomial-time c ln(n)-approximation algorithm
for Optimal-RepairL= (S, T), where n is the total number of target facts in the
input data example. The same holds true for Optimal-RepairL=,6= (S, T).


6    Concluding Remarks
The cost function in Definition 5 naturally extends to sets of data examples,
where cost(M, E, L∗ ) is defined as the smallest size of an L∗ -repair of M that is
vfe for each (I, J) ∈ E (provided that such a repair exists). Similarly, the decision
problems and optimization problems we studied naturally extend to the case
where the input is a finite set of data examples. All upper bounds presented in
this paper hold true also in the more general setting for multiple data examples.
    Two important questions that remain to be answered are the existences of
a polynomial time H(n)-approximation algorithms for computing near-optimal
LAV=,6= and SH-LAV=,6= schema mappings.
    We have focused on obtaining a complexity-theoretic understanding of the
algorithmic aspects of schema-mapping discovery from data examples. Our results
pave the way for leveraging further the rich area of approximation algorithms
and applying it to schema-mapping discovery. In parallel, we have embarked
on a prototype implementation of our approximation algorithm enhanced with
heuristic rules. While much remains to be done, there is preliminary evidence
that this is an approach that may lead to a reasonably efficient system for
schema-mapping discovery from data examples.


7   Acknowledgments

Research partially supported by NSF Grant IIS-1217869; Tan is partially sup-
ported by NSF grant IIS-1450560.


References
 1. B. Alexe, B. T. Cate, P. G. Kolaitis, and W.-C. Tan. Characterizing schema
    mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, 2011.
 2. B. Alexe, L. Chiticariu, R. J. Miller, and W. C. Tan. Muse: Mapping Understanding
    and deSign by Example. In ICDE, pages 10–19, 2008.
 3. B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Designing and refining
    schema mappings via data examples. In SIGMOD, pages 133–144, 2011.
 4. N. Alon, D. Moshkovitz, and S. Safra. Algorithmic construction of sets for k-
    restrictions. ACM Trans. Algorithms, 2(2):153–177, Apr. 2006.
 5. A. Artale, D. Calvanese, R. Kontchakov, and M. Zakharyaschev. The dl-lite family
    and relations. JAIR, 36:1–69, 2009.
 6. P. Barceló. Logical foundations of relational data exchange. SIGMOD Record,
    38(1):49–58, 2009.
 7. A. Bonifati, E. Chang, T. Ho, V. Lakshmanan, and R. Pottinger. HePToX: Marrying
    XML and Heterogeneity in Your P2P Databases. In VLDB, pages 1267–1270, 2005.
 8. B. ten Cate, V. Dalmau, and P. G. Kolaitis. Learning schema mappings. ACM
    Trans. Database Syst., 38(4):28, 2013.
 9. V. Chvtal. A greedy heuristic for the set covering problem. Math. Oper. Res.,
    4:233–235, 1979.
10. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange: Semantics and
    Query Answering. TCS, 336(1):89–124, 2005.
11. G. H. Fletcher and C. M. Wyss. Towards a general framework for effective solutions
    to the data mapping problem. Journal on Data Semantics, XIV, 2009.
12. G. Gottlob and P. Senellart. Schema mapping discovery from data instances. JACM,
    57(2), 2010.
13. L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio Grows Up: From
    Research Prototype to Industrial Tool. In ACM SIGMOD, pages 805–810, 2005.
14. D. S. Johnson. Approximation algorithms for combinatorial problems. In Proceedings
    of STOC, pages 38–49, New York, NY, USA, 1973. ACM.
15. P. G. Kolaitis. Schema Mappings, Data Exchange, and Metadata Management. In
    ACM PODS, pages 61–75, 2005.
16. M. Lenzerini. Data Integration: A Theoretical Perspective. In ACM PODS, pages
    233–246, 2002.
17. H. Mannila and K.-J. Räihä. Automatic generation of test data for relational
    queries. JCSS, 38(2):240–258, 1989.
18. R. J. Miller, L. M. Haas, and M. A. Hernández. Schema Mapping as Query
    Discovery. In VLDB, pages 77–88, 2000.
19. C. Olston, S. Chopra, and U. Srivastava. Generating example data for dataflow
    programs. In ACM SIGMOD, pages 245–256, 2009.
20. C. Papadimitriou and M. Yannakakis. Optimization, approximation, and complexity
    classes. In Proceedings of STOC, pages 229–234, New York, NY, USA, 1988. ACM.
21. A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and J. Widom. Synthesizing
    view definitions from data. In ICDT, pages 89–103, 2010.
22. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-Driven Understanding and
    Refinement of Schema Mappings. In ACM SIGMOD, pages 485–496, 2001.