=Paper=
{{Paper
|id=Vol-2994/paper21
|storemode=property
|title=Benchmarking Approximate Consistent Query Answering (Discussion Paper)
|pdfUrl=https://ceur-ws.org/Vol-2994/paper21.pdf
|volume=Vol-2994
|authors=Marco Calautti,Marco Console,Andreas Pieris
|dblpUrl=https://dblp.org/rec/conf/sebd/CalauttiCP21
}}
==Benchmarking Approximate Consistent Query Answering (Discussion Paper)==
Benchmarking Approximate Consistent Query
Answering
(Discussion Paper)
Marco Calautti1 , Marco Console2 and Andreas Pieris3
1
DISI, University of Trento, Italy
2
Sapienza, University of Rome, Italy
3
University of Edinburgh, UK
Abstract
Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated
over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent
databases whose difference from the inconsistent one is somehow minimal. A more informative notion is
the percentage of repairs in which a candidate answer is true, called relative frequency. Computing this
percentage is intractable in general, but for the relevant setting of conjunctive queries and primary keys,
data-efficient randomized approximation schemes exist. Our goal is to perform a thorough experimental
evaluation and comparison of those approximation schemes and provide new insights on which technique
is indicated depending on key characteristics of the input.
1. Introduction
A database is inconsistent if it does not conform to its specifications given in the form of
constraints. There is a consensus that inconsistency is a real-life phenomenon that arises
due to many reasons such as integration of conflicting sources. With the aim of obtaining
conceptually meaningful answers to queries posed over inconsistent databases, Arenas, Bertossi,
and Chomicki introduced in the late 1990s the notion of Consistent Query Answering (CQA) [1].
The key elements underlying CQA are (1) the notion of repair of an inconsistent database ๐ท,
that is, a consistent database that differs somehow minimally from the original database ๐ท, and
(2) the notion of query answering based on certain answers, i.e., answers that are entailed by
every repair. Here is a simple example taken from [2]:
Example 1. Consider the single relation Employee(id, name, dept) where the attribute id is the
key of the relation Employee. Consider also the database ๐ท consisting of the tuples:
(1, Bob, HR) (1, Bob, IT) (2, Alice, IT) (2, Tim, IT).
Observe that the above database is inconsistent w.r.t. the key constraint since we are uncertain
about Bobโs department, and the name of the employee with id 2. In this case, to devise a repair, we
only need to keep one of the two atoms that are in a conflict. In this way, we obtain a โ-maximal
subset of ๐ท that is consistent. Consider now the Boolean query that asks whether employees 1 and
SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy
" marco.calautti@unitn (M. Calautti); console@diag.uniroma1.it (M. Console); apieris@inf.ed.ac.uk (A. Pieris)
ยฉ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
2 work in the same department. This query is true only in two repairs, thus, according to certain
answers, is not entailed.
CQA has been extensively studied both in theory (see, e.g., [3, 4, 5]), and in practice (see,
e.g., [6, 7]). Nevertheless, the CQA approach comes with a conceptual limitation. The notion of
certain answers only says that a candidate answer is entailed by all repairs, or is not entailed by
at least one repair. But, as it has been frequently observed (e.g., see [2, 8, 9]), the former is too
strict, while the latter is not very useful in a practical context.
A Refined Approach. A simple step in obtaining more information for a candidate answer
is to consider its relative frequency, i.e., the percentage of repairs that entail the answer [2].
In Example 1, the relative frequency of the empty tuple (i.e., the only candidate answer for
Boolean queries) is 50% since, out of four repairs in total, only two satisfy the query. However,
computing this simple measure is #P-hard, even in data complexity, i.e., when the query and
the constraints are fixed, and even if we focus on conjunctive queries and primary keys [10].
Hence, the way to proceed is to give up exact solutions, and target approximations.
Data-efficient Approximations. Approximation algorithms have been crucial for tackling
intractable problems in different areas of Data Management. Examples are in the context of
approximating certain answers over incomplete databases, where different approximation algo-
rithms have been devised and experimentally evaluated [11, 12, 13]. In the case of inconsistent
databases, data-efficient randomized algorithms exist, in the relevant setting of conjunctive
queries and primary keys, that approximate the relative frequency of a candidate answer within
a desired error, with high probability [2]. These rely on techniques that have been originally
proposed to approximate the number of satisfying assignments of DNF Boolean formulas [14].
However, no corresponding infrastructure for experimentally evaluating such techniques exists.
Main Objective. The main objective of this work is to provide a benchmark for these random-
ized approximation schemes, when considering primary keys and conjunctive queries, covering
a wide range of scenarios, and exploit this benchmark to asses how the performance of the
approximation schemes is affected by key parameters of the input.1
2. Preliminaries
Relational Databases. A schema S is a finite set of relation symbols (or predicates) with
associated arity. A database ๐ท over a schema S is a set of facts of the form ๐
(๐กฬ), where ๐
โ S,
and ฬ๐ก = ๐ก1 , . . . , ๐ก๐ is a tuple of terms that are drawn from a countably infinite set C of constants.
We write dom(๐ท) for the active domain of ๐ท, that is, the set of constants occurring in ๐ท.
Primary Key Constraints. A key constraint (or key) ๐
over a schema S is an expression
key(๐
) = {1, . . . , ๐}, where ๐
โ S has arity ๐, and ๐ โค ๐; we also call it an ๐
-key. A
database ๐ท satisfies ๐
if, for every ๐
(๐กฬ), ๐
(๐ ฬ) โ ๐ท, ฬ๐ก[๐] = ฬ๐ [๐] for each ๐ โ {1, . . . , ๐} implies
ฬ๐ก = ฬ๐ . We say that ๐ท is consistent w.r.t. a set ฮฃ of keys, written ๐ท |= ฮฃ, if ๐ท satisfies each
key in ฮฃ; otherwise, it is inconsistent. A set of primary keys ฮฃ is a set of keys such that,
for each predicate ๐
, ฮฃ has at most one ๐
-key. For ๐ผ = ๐
(๐1 , . . . , ๐๐ ), the key value of
1
An extended version of this paper has been accepted to PODS 2021, and can be found at https://tinyurl.com/
f8sev5pj. Source code and test scenarios can be found at https://gitlab.com/mcalautti/cqabench.
๐ผ w.r.t. ฮฃ, denoted keyฮฃ (๐ผ), is defined as โจ๐
, โจ๐1 , . . . , ๐๐ โฉโฉ, if key(๐
) = {1, . . . , ๐} โ ฮฃ,
and โจ๐
, โจ๐1 , . . . , ๐๐ โฉโฉ otherwise. Let blockฮฃ (๐ผ, ๐ท) = {๐ฝ โ ๐ท | keyฮฃ (๐ฝ) = keyฮฃ (๐ผ)}, and
blockฮฃ (๐ท) = {blockฮฃ (๐ผ, ๐ท) | ๐ผ โ ๐ท}. A repair of ๐ท w.r.t. ฮฃ is a database obtained by choosing
one fact from each ๐ต โ blockฮฃ (๐ท). We denote the set of repairs of ๐ท w.r.t ฮฃ as rep(๐ท, ฮฃ).
Conjunctive (๏ธ Queries. A conjunctive)๏ธ query ๐(๐ฅ ฬ ) over a schema S is a first-order expression of
the form โ๐ฆฬ ๐
1 (๐งฬ1 ) โง ยท ยท ยท โง ๐
๐ (๐งฬ๐ ) that mentions only atoms over S, and has free variables ๐ฅฬ.
For a tuple ๐ก โ C , ๐(๐ก) denotes ๐ after replacing the ๐-th variable in ๐ฅ
ฬ |๐ฅ
ฬ | ฬ ฬ with the ๐-th constant
in ฬ๐ก. A homomorphism from ๐ to a database ๐ท is a function โ from the set of variables and
constants in ๐ to dom(๐ท) that is the identity over C such that ๐
๐ (โ(๐งฬ๐ )) โ ๐ท for every ๐ โ [๐].
โ
We use ๐ โ ๐ท to say that โ is a homomorphism from ๐ to ๐ท. The answer to ๐(๐ฅ
ฬ ) over a
โ
database ๐ท, denoted ๐(๐ท), is the set of tuples {โ(๐ฅ
ฬ ) โ dom(๐ท)|๐ฅฬ| | ๐(๐ฅ
ฬ ) โ ๐ท}.
Consistent Query Answering. Consider a database ๐ท, a set ฮฃ of primary keys, and a
CQ ๐(๐ฅ ฬ ). The relative frequency of a tuple ฬ๐ก โ dom(๐ท)|๐ฅฬ| w.r.t. ๐ท, ฮฃ and ๐ is the ratio
R๐ท,ฮฃ,๐ (๐กฬ) = |{๐ทโฒ โ rep(๐ท, ฮฃ) | ฬ๐ก โ ๐(๐ทโฒ )}|/|rep(๐ท, ฮฃ)|. Our main problem is:
PROBLEM : CQA
INPUT : A database
{๏ธ ๐ท, a set ฮฃ of primary keys, and a CQ ๐(๐ฅ ฬ ).
OUTPUT : The set (๐กฬ, R๐ท,ฮฃ,๐ (๐กฬ)) | ฬ๐ก โ dom(๐ท)|๐ฅฬ| and R๐ท,ฮฃ,๐ (๐กฬ) > 0 .
}๏ธ
That is, output, for each candidate answer tuple ฬ๐ก, its relative frequency, when this is not zero.
We dub the problem of computing R๐ท,ฮฃ,๐ (๐กฬ), given ๐ท, ฮฃ, ๐, ฬ๐ก, RelativeFreq. We know that
RelativeFreq is #P-hard, in data complexity. Clearly, having an efficient way of approximating
the relative frequency of a tuple will immediately provide an algorithm for approximately
solving CQA. Hence, in the rest, we focus on RelativeFreq.
Approximate CQA. A (data-efficient) randomized approximation scheme for the problem
RelativeFreq is a randomized algorithm A that takes a database ๐ท, a set ฮฃ of primary keys,
a CQ ๐(๐ฅ ฬ ), a tuple ฬ๐ก โ dom(๐ท)|๐ฅฬ| , and numbers ๐ > 0 and 0 < ๐ฟ < 1, runs in polyno-
mial time in ||๐ท|| + ||๐กฬ||,2 1/๐ and log(1/๐ฟ), and produces a random variable ๐ such that
Pr (|๐ โ R๐ท,ฮฃ,๐ (๐กฬ)| โค ๐ ยท R๐ท,ฮฃ,๐ (๐กฬ)) โฅ 1 โ ๐ฟ.
3. Approximation Schemes
In principle, an approximation scheme for RelativeFreq would require to access the input
database ๐ท. However, doing this naively is prohibitive in practice since, in general, ๐ท is very
large. However, an approximation scheme for RelativeFreq only needs a small part of the
database, which we call synopsis. In what follows, let ๐ท, ฮฃ, ๐(๐ฅ ฬ ), ฬ๐ก be a database, a set of
primary keys, a CQ and a tuple in dom(๐ท) , respectively.
|๐ฅ
ฬ |
โ
The (ฮฃ, ๐)-synopsis of ๐ท for ฬ๐ก is the pair (โ, โฌ), where โ = {โ(๐) | ๐(๐กฬ) โ
๐ท, and โ(๐) |= ฮฃ} and โฌ = {blockฮฃ (๐ผ, ๐ท) | ๐ผ โ โช๐ปโโ ๐ป}. That is, the (ฮฃ, ๐)-synopsis
of ๐ท for ฬ๐ก collects all the consistent homomorphic images of ๐(๐กฬ) in ๐ท (the set โ), and
the blocks of the atoms occurring in a consistent homomorphic image of ๐(๐กฬ) in ๐ท (the
2
As usual, for a syntactic object ๐, we write ||๐|| for its size.
set โฌ). We also let db(โฌ) = {{๐ผ1 , . . . , ๐ผ๐ } | โจ๐ผ1 , . . . , ๐ผ๐ โฉ โ ร๐ตโโฌ ๐ต}, i.e., the set of
databases that can be formed by keeping exactly one atom from each member ๐ต of โฌ. Fi-
nally, R(โ,โฌ) = |{๐ผ โ db(โฌ) | ๐ป โ ๐ผ for some ๐ป โ โ}||db(โฌ)|.3
We show that the (ฮฃ, ๐)-synopsis of a database ๐ท for a tuple ฬ๐ก can be efficiently constructed,
and it contains enough information that allows us to compute the relative frequency of ฬ๐ก.
Lemma 2. If (โ, โฌ) is the (ฮฃ, ๐)-synopsis of ๐ท for ฬ๐ก, then :1) (โ, โฌ) can be computed in
polynomial time in ||๐ท|| + ||๐กฬ||; 2) R๐ท,ฮฃ,๐ (๐กฬ) = R(โ,โฌ) ; 3) R๐ท,ฮฃ,๐ (๐กฬ) = 0 iff โ = โ
.
3.1. Monte Carlo Approach
Let (โ, โฌ) be the (๐ท, ฮฃ)-synopsis of some tuple ฬ๐ก and let Sample be a randomized procedure
that with input (โ, โฌ), computes a random number in [0, 1]. Being randomized, its output is
determined by the underlying sampling space obtained from (โ, โฌ). Sample(โ, โฌ) produces a
random variable, and a crucial problem for us is computing its expected value E[Sample(โ, โฌ)].
E[Sample(โ, โฌ)] can be approximated by performing the following: (1) ๐ := 0, (2) for ๐
times do the following: ๐ := ๐ + Sample(โ, โฌ), and finally (3) return ๐/๐ .
It is clear that as long as we increase the number ๐ of iterations in the above procedure,
the ratio ๐/๐ is a better approximation of E[Sample(โ, โฌ)]. From [15], we know that we can
compute the minimum number (up to constant factors) of iterations ๐ such that the above
algorithm approximates E[Sample(โ, โฌ)] within a given error ๐, and with probability at least
1 โ ๐ฟ, for some given 0 < ๐ฟ < 1. By exploiting this approach in the algorithm discussed
above, we obtain the algorithm MonteCarlo[Sample], which is parametrized with a randomized
procedure Sample. Let ||โ, โฌ|| = |โ| + max๐ปโโ {||๐ป||} + ||โฌ||. From [15], we know that:
(*) If Sample runs in polynomial time in ||โ, โฌ|| and E[Sample(โ, โฌ)] > 1/pol(||โ, โฌ||), when
E[Sample(โ, โฌ)] > 0, for some polynomial pol, then MonteCarlo[Sample] is a data-efficient
randomized approximation scheme for computing E[Sample(โ, โฌ)].4
By combining the above property and Lemma 2, an approximation scheme for RelativeFreq
can be obtained by devising the right sampling procedure Sample.
Natural Sampling Space. The first sampler, which we call SampleNatural, takes as input
(โ, โฌ), and has E[SampleNatural(โ, โฌ)] = R(โ,โฌ) , i.e., its expected value precisely coincides
with the relative frequency. This sampler simply performs the following: 1) sample a database
๐ผ โ db(โฌ) uniformly at random (u.a.r.); 2) if ๐ป โ ๐ผ, for some ๐ป โ โ, output 1, other-
wise output 0. We can prove that SampleNatural enjoys (*) and E[SampleNatural(โ, โฌ)] =
R(โ,โฌ) . By Lemma 2, the algorithm that constructs the (๐ท, ฮฃ)-synopsis (โ, โฌ) of ฬ๐ก, and runs
MonteCarlo[SampleNatural] with input (โ, โฌ), is an approximation scheme for RelativeFreq.
This algorithm is called Natural.
Symbolic Sampling Space. An alternative sampler is one whose expected value, on input
(โ, โฌ), is a number from which R(โ,โฌ) can be derived. We devise a slightly more complex sam-
pling space, called symbolic, by exploiting ideas from [14]. Let ๐ป1 , . . . , ๐ป๐ be an arbitrary order-
3
In case โ = โ
, we let R(โ,โฌ) = 0.
4
In a data-efficient randomized approximation scheme for EV[Sample] the output to be approximated is
E[Sample((โ, โฌ))], and the running time must be polynomial in ||โ, โฌ|| (and 1/๐ and log(1/๐ฟ)), as these are the
parameters of a synopsis that depend on the database.
ing (e.g., lexicographical) among the databases of โ. For each ๐ โ [๐], โโ,โฌ
๐ is set of all ๐ผ โ db(โฌ)
{๏ธ }๏ธ
such that ๐ป๐ โ ๐ผ. Our sampling space is defined as ๐ฎโ,โฌ โ = (๐, ๐ผ) | ๐ โ [๐] and ๐ผ โ โโ,โฌ ๐ .
Our second sampler, called SampleKL, performs the following: 1) sample (๐, ๐ผ) โ ๐ฎโ,โฌ u.a.r.,
โ
2) if there is no ๐ < ๐ s.t. (๐, ๐ผ) โ ๐ฎโ,โฌ
โ , then output 1, otherwise output 0. We can prove
that SampleKL enjoys (*) and that its expected value is ๐/|๐ฎโ,โฌ โ |, where ๐ is the numerator of
R(โ,โฌ) . Hence, we consider the algorithm KL, that constructs the (๐ท, ฮฃ)-synopsis (โ, โฌ) of
ฬ๐ก, runs the algorithm MonteCarlo[SampleKL] with input (โ, โฌ), and multiplies its output by
โ |/|db(โฌ)|. By Lemma 2, KL is an approximation scheme for RelativeFreq.
|๐ฎโ,โฌ
One can devise a slightly different sampler, called SampleKLM, that has a lower variance
in general w.r.t. SampleKL. Using SampleKLM in place of SampleKL, as discussed above, we
obtain the approximation scheme for RelativeFreq called KLM.
3.2. Union of Sets Approach
An approximation scheme, called self-adjusting coverage algorithm was presented in [14] for
the
โ๏ธ union of sets problem, which takes a description of ๐ โฅ 1 sets ๐1 , . . . , ๐๐ , and the output is
| ๐โ[๐] ๐๐ |. For space reasons, we do not present this algorithm, but only show how RelativeFreq
reduces
โ๏ธ ๐ to this problem. We first construct the (ฮฃ, ๐)-synopsis of ๐ท for ๐กโ๏ธ
ฬ, (โ, โฌ). Then, since
๐ โโ,โฌ is the numerator of the ratio R(โ,โฌ) , we approximate the value | ๐ โโ,โฌ | via the self-
๐
adjusting coverage algorithm of [14], by seeing (โ, โฌ) as the description of the sets โโ,โฌ ๐ , for
each ๐. Then, dividing the result by |db(โฌ)| will give an approximation of R(โ,โฌ) . This, together
with Lemma 2, provides an approximation scheme for RelativeFreq, which we dub Cover.
4. The Benchmark
Implementation. Note that to compute the approximate relative frequency of each candidate
tuple ฬ๐ก, the corresponding synopsis must be computed first, regardless of the chosen algorithm.
We show we can construct the set synฮฃ,๐ (๐ท) of all pairs (๐กฬ, ๐), with ๐ the (๐ท, ฮฃ)-synopsis of
ฬ๐ก โ dom(๐ท)|๐ฅฬ| , with R๐ท,ฮฃ,๐ (๐กฬ) > 0, via a single SQL query, obtained by a careful rewriting
of the original query ๐. We call this initial precomputation of synฮฃ,๐ (๐ท) the preprocessing
step. For implementing the approximation schemes, we extended the framework of [16] of
approximation algorithms for the number of satisfying assignments of DNF formulas.
Experimental Setting. To effectively evaluate our algorithms, our test scenarios must expose
how the running time of the approximation schemes is affected by key input parameters like
the inconsistency of the database, and the number of joins in the query.
Data Generator. We generate databases using the dbgen tool of the TPC-H benchmark.
Databases of varied size can be generated by providing a scale factor as input. Since the gener-
ated databases are consistent w.r.t. the underlying constraints we will later inject inconsistency
via a noise generator that we developed.
Query-aware Noise Generator. General-purpose noise generators exist in the literature; see,
e.g., [17]. However, all such tools are query-oblivious, i.e., do not take into account any query:
if we generate noise by randomly adding facts to the database, considering only the primary
keys, it is likely that we will not affect the evaluation of the query (and hence the synopses), as
in general, databases are much larger than the actual synopses.
Given ๐ท, ฮฃ, ๐(๐ฅ ฬ ), a percentage 0 < ๐ < 1 and an integer interval [๐, ๐ข], our noise generator
constructs the set synฮฃ,๐ (๐ท), and for each (๐กฬ, (โ, โฌ)), randomly chooses โ = โ๐ ยท |โฌ|โ blocks
๐ต1 , . . . , ๐ตโ , and for each ๐ต๐ , adds to ๐ท a random number ๐ก โ [๐, ๐ข] of tuples having the same
key value as the one of ๐ต๐ . These tuples are constructed by reusing existing tuples from ๐ท, and
by changing their key value. This ensures that the new tuples preserve the join patterns of ๐ท.
Query Generator. To generate our stress test queries we exploit a recent query generator [18],
which we call static query generator (SQG) since it can control static query parameters, such as
the number of joins, and the number of free variables (i.e., output variables). As we also want
to precisely control the size of the synopsis, we also consider some dynamic query parameters.
Letting synฮฃ,๐ (๐ท) = {(๐ก โ๏ธ๐๐ , (โ๐ , โฌ๐ ))}๐โ[๐] for some ๐ โฅ 1, we consider the homomorphic size
ฬ
of ๐ w.r.t. ๐ท defined as | ๐=1 โ๐ |, which measures how large is the portion of ๐ท that can affect
any (๐ท, ฮฃ)-synopsis in synฮฃ,๐ (๐ท). Since synopses in synฮฃ,๐ (๐ท) can have different sizes, it
is natural to considerโ๏ธthe average size of a (ฮฃ, ๐)-synopsis in synฮฃ,๐ (๐ท) as a key parameter,
| ๐๐=1 โ๐ |
which is given by . We normalize the above to a number in the interval (0, 1) by
|synฮฃ,๐ (๐ท)|
considering its inverse. We call this number the balance of ๐ w.r.t. ๐ท. The closer the balance to
1 (resp., 0) is, the smaller (resp., larger) the average size of a (ฮฃ, ๐)-synopsis is.
We developed a query generator, called dynamic, that modifies a query generated by the SQG
so that it has a desired balance w.r.t. the given database ๐ท.
Test Scenarios. For our test scenarios we consider the TPC-H schema, denoted SH , and the
set ฮฃH of primary keys over SH coming with the TPC-H benchmark. We generated a large set
of database-query pairs ๐H as follows: (1) First, we generated a (consistent) database ๐ทH over
SH , using scale factor 1GB. The database contains roughly 9M tuples. (2) For each number of
joins ๐ โ [5], we generated 5 base CQs ๐๐๐ , with ๐ โ [5], having ๐ joins, using SQG, producing
a set ๐ฌ of 25 CQs. Using our query-aware noise generator, for each ๐ โ ๐ฌ, we generated
10 inconsistent databases ๐ท๐ [๐], using noise levels ๐ โ {0.1, 0.2, . . . , 1}, and the same block
interval [2, 5]. (4) Finally, for each ๐ โ ๐ฌ, and for each ๐ท๐ [๐], we generated 11 variations of ๐,
denoted ๐๐ [๐], having balance ๐ โ {0, 0.1, 0, 2, . . . , 1} w.r.t. ๐ท๐ [๐], where balance ๐ = 0 means
the query ๐๐ [๐] is Boolean. ๐H is then the set of pairs {(๐ท๐ [๐], ๐๐ [๐]) | ๐ โ {๐๐๐ }๐,๐โ[5] , ๐ โ
{0.1, . . . , 1} and ๐ โ {0, 0.1, . . . , 1}}, containing 2750 database-query pairs.
We considered different scenarios (subsets of ๐H ) by fixing all parameters but one. Noise
scenarios: The sets Noise[๐, ๐] of all pairs of ๐H , where the balance level (๐) and the number of
joins (๐) are fixed. Join scenarios: The sets Join[๐, ๐] of all pairs of ๐H , where the noise level (๐)
and the balance level (๐) are fixed. Balance scenarios: The sets Balance[๐, ๐] of all pairs of ๐H ,
where the noise level (๐) and the number of joins (๐) are fixed.
Hardware and execution. For the experiments we used two Desktop PCs each with an Intel(R)
Core(TM) i5-8500 CPU@3.00GHz processor, 16GB RAM, 500GB mechanical drive, running
Xubuntu 19.04 64-bit. All databases are stored in a PostgreSQL 11.5 instance. The approximation
schemes and the preprocessing step are implemented in C++ and SQL, respectively.
We fixed ๐ฟ = 0.25 and ๐ = 0.1, i.e., 75% confidence and 10% error. We report the time
required to execute each algorithm over the synopses of all output tuples, excluding the time of
the preprocessing step since it is the same for all the approximation schemes. Each data point
(a) Noise[0, 3] (b) Noise[0.5, 3] (c) Balance[0.2, 3] (d) Balance[0.4, 3]
Figure 1: Noise and balance scenarios - Noise[balance, joins], and Balance[noise, joins]
in a plot of Fig. 1 is the average runtime over all the CQs for that X-axis value; due to space
constraints, here we report only some representative noise and balance scenarios (see Fig. 1).
5. Take-home Messages
Our analysis reveals a striking difference between Boolean and non-Boolean CQs.
The Boolean Case. Here, Natural is the best performer, no matter the noise and the number
of joins in the query, whereas Cover is the worst. Only in the case of CQs with many joins
Cover is comparable to KL(M), but in any case, Natural is the way to go. This is mainly due
to the fact that for Boolean queries, the (only) synopsis is very large in general, and thus the
relative frequency is very high. Hence, since Natural directly estimates the value of the relative
frequency via the natural sampling space, requires less samples to obtain a good estimate.
The Non-Boolean Case. In this case, KLM is the way to go in almost all the scenarios, i.e.,
for any level of noise and for any level of (non-zero) balance of the query. Only for CQs with
many joins, and high noise, KL is comparable to KLM. Nevertheless, KL is never going to
outperform KLM in practice. The worst algorithms for non-Boolean CQs are Natural and Cover.
They perform similarly for low levels of noise, balance and joins, but, in general, Natural is the
slowest. The reson for this outcome is that for higher values of balance, the average size of a
synopsis is low, which then implies that the relative frequency of the corresponding tuple is
usually close to 0, hence, sampling from the symbolic space helps avoiding to perform a large
number of samples like Natural does. Moreover, the low-variance sampler of KLM lets the
algorithm find the optimal number of samples earlier than KL, on average.
Practical Applicability. In most of our experiments, the preprocessing step took less than 30
seconds. Furthermore, for modest scenarios, which is what we expect to face in practice, the
running time of the best performing approximation scheme is in the order of seconds. We also
considered larger databases, up to 120M tuples (15GBs), for which the overall running time of
the preprocessing step is encouraging (considering the weak machines used): it is in the order
of minutes, while the best approximation algorithm for each case always performs in the order
of seconds. Thus, we believe applying approximate CQA in practice is not an unrealistic goal.
Acknowledgments
Part of Calauttiโs and Consoleโs work was done while they were research associates at the
University of Edinburgh. Console has been partially supported by MIUR under the PRIN 2017
project โHOPEโ (prot. 2017MMJJRE), and by the EU under the H2020-EU.2.1.1 project TAILOR,
grant id. 952215. Pieris was supported by the EPSRC grant EP/S003800/1 โEQUIDโ.
References
[1] M. Arenas, L. E. Bertossi, J. Chomicki, Consistent query answers in inconsistent databases,
in: PODS, 1999, pp. 68โ79.
[2] M. Calautti, M. Console, A. Pieris, Counting database repairs under primary keys revisited,
in: PODS, 2019, pp. 104โ118.
[3] P. Koutris, J. Wijsen, Consistent query answering for primary keys in logspace, in: ICDT,
2019, pp. 23:1โ23:19.
[4] B. ten Cate, G. Fontaine, P. G. Kolaitis, On the data complexity of consistent query
answering, Theory Comput. Syst. 57 (2015) 843โ891.
[5] S. Greco, C. Molinaro, I. Trubitsyna, Computing approximate query answers over incon-
sistent knowledge bases, in: IJCAI, 2018, pp. 1838โ1846.
[6] A. A. Dixit, P. G. Kolaitis, A sat-based system for consistent query answering, in: SAT,
2019, pp. 117โ135.
[7] A. Fuxman, E. Fazli, R. J. Miller, Conquer: Efficient management of inconsistent databases,
in: SIGMOD, 2005, pp. 155โ166.
[8] S. Greco, C. Molinaro, Probabilistic query answering over inconsistent databases, Ann.
Math. Artif. Intell. 64 (2012) 185โ207.
[9] M. Calautti, L. Libkin, A. Pieris, An operational approach to consistent query answering,
in: PODS, 2018, pp. 239โ251.
[10] D. Maslowski, J. Wijsen, A dichotomy in the complexity of counting database repairs, J.
Comput. Syst. Sci. 79 (2013) 958โ983.
[11] N. Fiorentino, S. Greco, C. Molinaro, I. Trubitsyna, ACID: A system for computing
approximate certain query answers over incomplete databases, in: SIGMOD, 2018, pp.
1685โ1688.
[12] S. Greco, C. Molinaro, I. Trubitsyna, Approximation algorithms for querying incomplete
databases, Inf. Syst. 86 (2019) 28โ45.
[13] M. Console, P. Guagliardo, L. Libkin, E. Toussaint, Coping with incomplete data: Recent
advances, in: PODS, 2020, pp. 33โ47.
[14] R. M. Karp, M. Luby, N. Madras, Monte-carlo approximation algorithms for enumeration
problems, J. Algorithms 10 (1989) 429โ448.
[15] P. Dagum, R. M. Karp, M. Luby, S. M. Ross, An optimal algorithm for monte carlo estimation,
SIAM J. Comput. 29 (2000) 1484โ1496.
[16] K. S. Meel, A. A. Shrotri, M. Y. Vardi, Not all FPRASs are equal: Demystifying FPRASs for
DNF-counting, Constraints 24 (2019) 211โ233.
[17] P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro, Messing up with
BART: Error generation for evaluating data-cleaning algorithms, PVLDB 9 (2015) 36โ47.
[18] M. Benedikt, G. Konstantinidis, G. Mecca, B. Motik, P. Papotti, D. Santoro, E. Tsamoura,
Benchmarking the chase, in: PODS, 2017, pp. 37โ52.