<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Benchmarking Approximate Consistent Query Answering</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Marco</forename><surname>Calautti</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">DISI</orgName>
								<orgName type="institution">University of Trento</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Console</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Sapienza</orgName>
								<orgName type="institution">University of Rome</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andreas</forename><surname>Pieris</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">University of Edinburgh</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Benchmarking Approximate Consistent Query Answering</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">DEF001031F848D6F096B30B357D189F9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:52+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Consistent query answering (CQA) aims to deliver meaningful answers when queries are evaluated over inconsistent databases. Such answers must be certainly true in all repairs, which are consistent databases whose difference from the inconsistent one is somehow minimal. A more informative notion is the percentage of repairs in which a candidate answer is true, called relative frequency. Computing this percentage is intractable in general, but for the relevant setting of conjunctive queries and primary keys, data-efficient randomized approximation schemes exist. Our goal is to perform a thorough experimental evaluation and comparison of those approximation schemes and provide new insights on which technique is indicated depending on key characteristics of the input.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>A database is inconsistent if it does not conform to its specifications given in the form of constraints. There is a consensus that inconsistency is a real-life phenomenon that arises due to many reasons such as integration of conflicting sources. With the aim of obtaining conceptually meaningful answers to queries posed over inconsistent databases, Arenas, Bertossi, and Chomicki introduced in the late 1990s the notion of Consistent Query Answering (CQA) <ref type="bibr" target="#b0">[1]</ref>.</p><p>The key elements underlying CQA are <ref type="bibr" target="#b0">(1)</ref> the notion of repair of an inconsistent database 𝐷, that is, a consistent database that differs somehow minimally from the original database 𝐷, and (2) the notion of query answering based on certain answers, i.e., answers that are entailed by every repair. Here is a simple example taken from <ref type="bibr" target="#b1">[2]</ref>: Example 1. Consider the single relation Employee(id, name, dept) where the attribute id is the key of the relation Employee. Consider also the database 𝐷 consisting of the tuples:</p><p>(1, Bob, HR) (1, Bob, IT) (2, Alice, IT) (2, Tim, IT).</p><p>Observe that the above database is inconsistent w.r.t. the key constraint since we are uncertain about Bob's department, and the name of the employee with id 2. In this case, to devise a repair, we only need to keep one of the two atoms that are in a conflict. In this way, we obtain a ⊆-maximal subset of 𝐷 that is consistent. Consider now the Boolean query that asks whether employees 1 and SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy marco.calautti@unitn (M. Calautti); console@diag.uniroma1.it (M. Console); apieris@inf.ed.ac.uk (A. Pieris)</p><p>2 work in the same department. This query is true only in two repairs, thus, according to certain answers, is not entailed.</p><p>CQA has been extensively studied both in theory (see, e.g., <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>), and in practice (see, e.g., <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>). Nevertheless, the CQA approach comes with a conceptual limitation. The notion of certain answers only says that a candidate answer is entailed by all repairs, or is not entailed by at least one repair. But, as it has been frequently observed (e.g., see <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>), the former is too strict, while the latter is not very useful in a practical context. A Refined Approach. A simple step in obtaining more information for a candidate answer is to consider its relative frequency, i.e., the percentage of repairs that entail the answer <ref type="bibr" target="#b1">[2]</ref>. In Example 1, the relative frequency of the empty tuple (i.e., the only candidate answer for Boolean queries) is 50% since, out of four repairs in total, only two satisfy the query. However, computing this simple measure is #P-hard, even in data complexity, i.e., when the query and the constraints are fixed, and even if we focus on conjunctive queries and primary keys <ref type="bibr" target="#b9">[10]</ref>. Hence, the way to proceed is to give up exact solutions, and target approximations.</p><p>Data-efficient Approximations. Approximation algorithms have been crucial for tackling intractable problems in different areas of Data Management. Examples are in the context of approximating certain answers over incomplete databases, where different approximation algorithms have been devised and experimentally evaluated <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>. In the case of inconsistent databases, data-efficient randomized algorithms exist, in the relevant setting of conjunctive queries and primary keys, that approximate the relative frequency of a candidate answer within a desired error, with high probability <ref type="bibr" target="#b1">[2]</ref>. These rely on techniques that have been originally proposed to approximate the number of satisfying assignments of DNF Boolean formulas <ref type="bibr" target="#b13">[14]</ref>. However, no corresponding infrastructure for experimentally evaluating such techniques exists.</p><p>Main Objective. The main objective of this work is to provide a benchmark for these randomized approximation schemes, when considering primary keys and conjunctive queries, covering a wide range of scenarios, and exploit this benchmark to asses how the performance of the approximation schemes is affected by key parameters of the input. <ref type="foot" target="#foot_0">1</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Preliminaries</head><p>Relational Databases. A schema S is a finite set of relation symbols (or predicates) with associated arity. A database 𝐷 over a schema S is a set of facts of the form 𝑅(𝑡 ̄), where 𝑅 ∈ S, and 𝑡 ̄= 𝑡 1 , . . . , 𝑡 𝑛 is a tuple of terms that are drawn from a countably infinite set C of constants. We write dom(𝐷) for the active domain of 𝐷, that is, the set of constants occurring in 𝐷. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Primary Key</head><formula xml:id="formula_0">R 𝐷,Σ,𝑄 (𝑡 ̄) = |{𝐷 ′ ∈ rep(𝐷, Σ) | 𝑡 ̄∈ 𝑄(𝐷 ′ )}|/|rep(𝐷, Σ)|.</formula><p>Our main problem is:</p><formula xml:id="formula_1">PROBLEM : CQA INPUT :</formula><p>A database 𝐷, a set Σ of primary keys, and a CQ 𝑄(𝑥 ̄). OUTPUT :</p><p>The set</p><formula xml:id="formula_2">{︀ (𝑡 ̄, R 𝐷,Σ,𝑄 (𝑡 ̄)) | 𝑡 ̄∈ dom(𝐷) |𝑥 ̄| and R 𝐷,Σ,𝑄 (𝑡 ̄) &gt; 0 }︀ .</formula><p>That is, output, for each candidate answer tuple 𝑡 ̄, its relative frequency, when this is not zero. We dub the problem of computing R 𝐷,Σ,𝑄 (𝑡 ̄), given 𝐷, Σ, 𝑄, 𝑡 ̄, RelativeFreq. We know that RelativeFreq is #P-hard, in data complexity. Clearly, having an efficient way of approximating the relative frequency of a tuple will immediately provide an algorithm for approximately solving CQA. Hence, in the rest, we focus on RelativeFreq.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Approximate CQA. A (data-efficient) randomized approximation scheme for the problem</head><p>RelativeFreq is a randomized algorithm A that takes a database 𝐷, a set Σ of primary keys, a CQ 𝑄(𝑥 ̄), a tuple 𝑡 ̄∈ dom(𝐷) |𝑥 ̄|, and numbers 𝜖 &gt; 0 and 0 &lt; 𝛿 &lt; 1, runs in polynomial time in ||𝐷|| + ||𝑡 ̄||, <ref type="foot" target="#foot_2">2</ref> 1/𝜖 and log(1/𝛿), and produces a random variable 𝒜 such that</p><formula xml:id="formula_3">Pr (|𝒜 − R 𝐷,Σ,𝑄 (𝑡 ̄)| ≤ 𝜖 • R 𝐷,Σ,𝑄 (𝑡 ̄)) ≥ 1 − 𝛿.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Approximation Schemes</head><p>In principle, an approximation scheme for RelativeFreq would require to access the input database 𝐷. However, doing this naively is prohibitive in practice since, in general, 𝐷 is very large. However, an approximation scheme for RelativeFreq only needs a small part of the database, which we call synopsis. In what follows, let 𝐷, Σ, 𝑄(𝑥 ̄), 𝑡 ̄be a database, a set of primary keys, a CQ and a tuple in dom(𝐷) |𝑥 ̄|, respectively. The (Σ, 𝑄)-synopsis of 𝐷 for 𝑡 ̄is the pair (ℋ, ℬ), where ℋ = {ℎ(𝑄) | 𝑄(𝑡 ̄) ℎ → 𝐷, and ℎ(𝑄) |= Σ} and ℬ = {block Σ (𝛼, 𝐷) | 𝛼 ∈ ∪ 𝐻∈ℋ 𝐻}. That is, the (Σ, 𝑄)-synopsis of 𝐷 for 𝑡 ̄collects all the consistent homomorphic images of 𝑄(𝑡 ̄) in 𝐷 (the set ℋ), and the blocks of the atoms occurring in a consistent homomorphic image of 𝑄(𝑡 ̄) in 𝐷 (the set ℬ). We also let db(ℬ) = {{𝛼 1 , . . . , 𝛼 𝑛 } | ⟨𝛼 1 , . . . , 𝛼 𝑛 ⟩ ∈ × 𝐵∈ℬ 𝐵}, i.e., the set of databases that can be formed by keeping exactly one atom from each member 𝐵 of ℬ. Finally, R (ℋ,ℬ) = |{𝐼 ∈ db(ℬ) | 𝐻 ⊆ 𝐼 for some 𝐻 ∈ ℋ}||db(ℬ)|. <ref type="foot" target="#foot_3">3</ref>We show that the (Σ, 𝑄)-synopsis of a database 𝐷 for a tuple 𝑡 ̄can be efficiently constructed, and it contains enough information that allows us to compute the relative frequency of 𝑡 ̄.</p><formula xml:id="formula_4">Lemma 2. If (ℋ, ℬ) is the (Σ, 𝑄)-synopsis of 𝐷 for 𝑡 ̄, then :1) (ℋ, ℬ) can be computed in polynomial time in ||𝐷|| + ||𝑡 ̄||; 2) R 𝐷,Σ,𝑄 (𝑡 ̄) = R (ℋ,ℬ) ; 3) R 𝐷,Σ,𝑄 (𝑡 ̄) = 0 iff ℋ = ∅.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Monte Carlo Approach</head><p>Let (ℋ, ℬ) be the (𝐷, Σ)-synopsis of some tuple 𝑡 ̄and let Sample be a randomized procedure that with input (ℋ, ℬ), computes a random number in [0, 1]. Being randomized, its output is determined by the underlying sampling space obtained from (ℋ, ℬ). Sample(ℋ, ℬ) produces a random variable, and a crucial problem for us is computing its expected value E[Sample(ℋ, ℬ)].</p><p>E[Sample(ℋ, ℬ)] can be approximated by performing the following: (1) 𝑆 := 0, (2) for 𝑁 times do the following: 𝑆 := 𝑆 + Sample(ℋ, ℬ), and finally (3) return 𝑆/𝑁 .</p><p>It is clear that as long as we increase the number 𝑁 of iterations in the above procedure, the ratio 𝑆/𝑁 is a better approximation of E[Sample(ℋ, ℬ)]. From <ref type="bibr" target="#b14">[15]</ref>, we know that we can compute the minimum number (up to constant factors) of iterations 𝑁 such that the above algorithm approximates E[Sample(ℋ, ℬ)] within a given error 𝜖, and with probability at least 1 − 𝛿, for some given 0 &lt; 𝛿 &lt; 1. By exploiting this approach in the algorithm discussed above, we obtain the algorithm MonteCarlo[Sample], which is parametrized with a randomized procedure Sample. Let ||ℋ, ℬ|| = |ℋ| + max 𝐻∈ℋ {||𝐻||} + ||ℬ||. From <ref type="bibr" target="#b14">[15]</ref>, we know that:</p><p>(*) If Sample runs in polynomial time in ||ℋ, ℬ|| and E[Sample(ℋ, ℬ)] &gt; 1/pol(||ℋ, ℬ||), when E[Sample(ℋ, ℬ)] &gt; 0, for some polynomial pol, then MonteCarlo[Sample] is a data-efficient randomized approximation scheme for computing E[Sample(ℋ, ℬ)]. <ref type="foot" target="#foot_4">4</ref>By combining the above property and Lemma 2, an approximation scheme for RelativeFreq can be obtained by devising the right sampling procedure Sample.</p><p>Natural Sampling Space. The first sampler, which we call SampleNatural, takes as input (ℋ, ℬ), and has E[SampleNatural(ℋ, ℬ)] = R (ℋ,ℬ) , i.e., its expected value precisely coincides with the relative frequency. This sampler simply performs the following: 1) sample a database 𝐼 ∈ db(ℬ) uniformly at random (u.a.r.); 2) if 𝐻 ⊆ 𝐼, for some 𝐻 ∈ ℋ, output 1, otherwise output 0. We can prove that SampleNatural enjoys (*) and E[SampleNatural(ℋ, ℬ)] = R (ℋ,ℬ) . By Lemma 2, the algorithm that constructs the (𝐷, Σ)-synopsis (ℋ, ℬ) of 𝑡 ̄, and runs MonteCarlo[SampleNatural] with input (ℋ, ℬ), is an approximation scheme for RelativeFreq. This algorithm is called Natural.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Symbolic Sampling Space.</head><p>An alternative sampler is one whose expected value, on input (ℋ, ℬ), is a number from which R (ℋ,ℬ) can be derived. We devise a slightly more complex sampling space, called symbolic, by exploiting ideas from <ref type="bibr" target="#b13">[14]</ref>. Let 𝐻 1 , . . . , 𝐻 𝑛 be an arbitrary order-ing (e.g., lexicographical) among the databases of ℋ. For each 𝑖 ∈ [𝑛], ℐ 𝑖 ℋ,ℬ is set of all 𝐼 ∈ db(ℬ) such that 𝐻 𝑖 ⊆ 𝐼. Our sampling space is defined as 𝒮 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Union of Sets Approach</head><p>An approximation scheme, called self-adjusting coverage algorithm was presented in <ref type="bibr" target="#b13">[14]</ref> for the union of sets problem, which takes a description of 𝑛 ≥ 1 sets 𝑆 1 , . . . , 𝑆 𝑛 , and the output is</p><formula xml:id="formula_5">| ⋃︀ 𝑖∈[𝑛] 𝑆 𝑖 |.</formula><p>For space reasons, we do not present this algorithm, but only show how RelativeFreq reduces to this problem. We first construct the (Σ, 𝑄)-synopsis of 𝐷 for 𝑡 ̄, (ℋ, ℬ). Then, since ⋃︀ 𝑖 ℐ 𝑖 ℋ,ℬ is the numerator of the ratio R (ℋ,ℬ) , we approximate the value | ⋃︀ 𝑖 ℐ 𝑖 ℋ,ℬ | via the selfadjusting coverage algorithm of <ref type="bibr" target="#b13">[14]</ref>, by seeing (ℋ, ℬ) as the description of the sets ℐ 𝑖 ℋ,ℬ , for each 𝑖. Then, dividing the result by |db(ℬ)| will give an approximation of R (ℋ,ℬ) . This, together with Lemma 2, provides an approximation scheme for RelativeFreq, which we dub Cover.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">The Benchmark</head><p>Implementation. Note that to compute the approximate relative frequency of each candidate tuple 𝑡 ̄, the corresponding synopsis must be computed first, regardless of the chosen algorithm. We show we can construct the set syn Σ,𝑄 (𝐷) of all pairs (𝑡 ̄, 𝑆), with 𝑆 the (𝐷, Σ)-synopsis of 𝑡 ̄∈ dom(𝐷) |𝑥 ̄|, with R 𝐷,Σ,𝑄 (𝑡 ̄) &gt; 0, via a single SQL query, obtained by a careful rewriting of the original query 𝑄. We call this initial precomputation of syn Σ,𝑄 (𝐷) the preprocessing step. For implementing the approximation schemes, we extended the framework of <ref type="bibr" target="#b15">[16]</ref> of approximation algorithms for the number of satisfying assignments of DNF formulas.</p><p>Experimental Setting. To effectively evaluate our algorithms, our test scenarios must expose how the running time of the approximation schemes is affected by key input parameters like the inconsistency of the database, and the number of joins in the query. Data Generator. We generate databases using the dbgen tool of the TPC-H benchmark. Databases of varied size can be generated by providing a scale factor as input. Since the generated databases are consistent w.r.t. the underlying constraints we will later inject inconsistency via a noise generator that we developed. Query-aware Noise Generator. General-purpose noise generators exist in the literature; see, e.g., <ref type="bibr" target="#b16">[17]</ref>. However, all such tools are query-oblivious, i.e., do not take into account any query: if we generate noise by randomly adding facts to the database, considering only the primary keys, it is likely that we will not affect the evaluation of the query (and hence the synopses), as in general, databases are much larger than the actual synopses.</p><p>Given 𝐷, Σ, 𝑄(𝑥 ̄), a percentage 0 &lt; 𝑝 &lt; 1 and an integer interval [𝑙, 𝑢], our noise generator constructs the set syn Σ,𝑄 (𝐷), and for each (𝑡 ̄, (ℋ, ℬ)), randomly chooses ℓ = ⌈𝑝 • |ℬ|⌉ blocks 𝐵 1 , . . . , 𝐵 ℓ , and for each 𝐵 𝑖 , adds to 𝐷 a random number 𝑡 ∈ [𝑙, 𝑢] of tuples having the same key value as the one of 𝐵 𝑖 . These tuples are constructed by reusing existing tuples from 𝐷, and by changing their key value. This ensures that the new tuples preserve the join patterns of 𝐷. Query Generator. To generate our stress test queries we exploit a recent query generator <ref type="bibr" target="#b17">[18]</ref>, which we call static query generator (SQG) since it can control static query parameters, such as the number of joins, and the number of free variables (i.e., output variables). As we also want to precisely control the size of the synopsis, we also consider some dynamic query parameters.</p><p>Letting syn Σ,𝑄 (𝐷) = {(𝑡 ̄𝑖, (ℋ 𝑖 , ℬ 𝑖 ))} 𝑖∈[𝑛] for some 𝑛 ≥ 1, we consider the homomorphic size of 𝑄 w.r.t. 𝐷 defined as | ⋃︀ 𝑛 𝑖=1 ℋ 𝑖 |, which measures how large is the portion of 𝐷 that can affect any (𝐷, Σ)-synopsis in syn Σ,𝑄 (𝐷). Since synopses in syn Σ,𝑄 (𝐷) can have different sizes, it is natural to consider the average size of a (Σ, 𝑄)-synopsis in syn Σ,𝑄 (𝐷) as a key parameter, which is given by</p><formula xml:id="formula_6">| ⋃︀ 𝑛 𝑖=1 ℋ 𝑖 | |syn Σ,𝑄 (𝐷)|</formula><p>. We normalize the above to a number in the interval (0, 1) by considering its inverse. We call this number the balance of 𝑄 w.r.t. 𝐷. The closer the balance to 1 (resp., 0) is, the smaller (resp., larger) the average size of a (Σ, 𝑄)-synopsis is.</p><p>We developed a query generator, called dynamic, that modifies a query generated by the SQG so that it has a desired balance w.r.t. the given database 𝐷.</p><p>Test Scenarios. For our test scenarios we consider the TPC-H schema, denoted S H , and the set Σ H of primary keys over S H coming with the TPC-H benchmark. We generated a large set of database-query pairs 𝑃 H as follows: (1) First, we generated a (consistent) database 𝐷 H over S H , using scale factor 1GB. The database contains roughly 9M tuples. ( <ref type="formula">2</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hardware and execution.</head><p>For the experiments we used two Desktop PCs each with an Intel(R) Core(TM) i5-8500 CPU@3.00GHz processor, 16GB RAM, 500GB mechanical drive, running Xubuntu 19.04 64-bit. All databases are stored in a PostgreSQL 11.5 instance. The approximation schemes and the preprocessing step are implemented in C++ and SQL, respectively.</p><p>We fixed 𝛿 = 0.25 and 𝜖 = 0.1, i.e., 75% confidence and 10% error. We report the time required to execute each algorithm over the synopses of all output tuples, excluding the time of the preprocessing step since it is the same for all the approximation schemes. Each data point  in a plot of Fig. <ref type="figure" target="#fig_2">1</ref> is the average runtime over all the CQs for that X-axis value; due to space constraints, here we report only some representative noise and balance scenarios (see Fig. <ref type="figure" target="#fig_2">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Take-home Messages</head><p>Our analysis reveals a striking difference between Boolean and non-Boolean CQs.</p><p>The Boolean Case. Here, Natural is the best performer, no matter the noise and the number of joins in the query, whereas Cover is the worst. Only in the case of CQs with many joins Cover is comparable to KL(M), but in any case, Natural is the way to go. This is mainly due to the fact that for Boolean queries, the (only) synopsis is very large in general, and thus the relative frequency is very high. Hence, since Natural directly estimates the value of the relative frequency via the natural sampling space, requires less samples to obtain a good estimate.</p><p>The Non-Boolean Case. In this case, KLM is the way to go in almost all the scenarios, i.e., for any level of noise and for any level of (non-zero) balance of the query. Only for CQs with many joins, and high noise, KL is comparable to KLM. Nevertheless, KL is never going to outperform KLM in practice. The worst algorithms for non-Boolean CQs are Natural and Cover. They perform similarly for low levels of noise, balance and joins, but, in general, Natural is the slowest. The reson for this outcome is that for higher values of balance, the average size of a synopsis is low, which then implies that the relative frequency of the corresponding tuple is usually close to 0, hence, sampling from the symbolic space helps avoiding to perform a large number of samples like Natural does. Moreover, the low-variance sampler of KLM lets the algorithm find the optimal number of samples earlier than KL, on average. Practical Applicability. In most of our experiments, the preprocessing step took less than 30 seconds. Furthermore, for modest scenarios, which is what we expect to face in practice, the running time of the best performing approximation scheme is in the order of seconds. We also considered larger databases, up to 120M tuples (15GBs), for which the overall running time of the preprocessing step is encouraging (considering the weak machines used): it is in the order of minutes, while the best approximation algorithm for each case always performs in the order of seconds. Thus, we believe applying approximate CQA in practice is not an unrealistic goal.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>) For each number of joins 𝑗 ∈<ref type="bibr" target="#b4">[5]</ref>, we generated 5 base CQs 𝑄 𝑖 𝑗 , with 𝑖 ∈<ref type="bibr" target="#b4">[5]</ref>, having 𝑗 joins, using SQG, producing a set 𝒬 of 25 CQs. Using our query-aware noise generator, for each 𝑄 ∈ 𝒬, we generated 10 inconsistent databases 𝐷 𝑄 [𝑝], using noise levels 𝑝 ∈ {0.1, 0.2, . . . , 1}, and the same block interval<ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b4">5]</ref>. (4) Finally, for each 𝑄 ∈ 𝒬, and for each 𝐷 𝑄 [𝑝], we generated 11 variations of 𝑄, denoted 𝑄 𝑝 [𝑞], having balance 𝑞 ∈ {0, 0.1, 0, 2, . . . , 1} w.r.t. 𝐷 𝑄 [𝑝], where balance 𝑞 = 0 means the query 𝑄 𝑝 [𝑞] is Boolean. 𝑃 H is then the set of pairs {(𝐷 𝑄 [𝑝], 𝑄 𝑝 [𝑞]) | 𝑄 ∈ {𝑄 𝑖 𝑗 } 𝑖,𝑗∈[5] , 𝑝 ∈ {0.1, . . . , 1} and 𝑞 ∈ {0, 0.1, . . . , 1}}, containing 2750 database-query pairs. We considered different scenarios (subsets of 𝑃 H ) by fixing all parameters but one. Noise scenarios: The sets Noise[𝑞, 𝑗] of all pairs of 𝑃 H , where the balance level (𝑞) and the number of joins (𝑗) are fixed. Join scenarios: The sets Join[𝑝, 𝑞] of all pairs of 𝑃 H , where the noise level (𝑝) and the balance level (𝑞) are fixed. Balance scenarios: The sets Balance[𝑝, 𝑗] of all pairs of 𝑃 H , where the noise level (𝑝) and the number of joins (𝑗) are fixed.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>(a) Noise[0, 3] (b) Noise[0.5, 3] (c) Balance[0.2, 3] (d) Balance[0.4, 3]</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Noise and balance scenarios -Noise[balance, joins], and Balance[noise, joins]</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Constraints.A key constraint (or key) 𝜅 over a schema S is an expression key(𝑅) = {1, . . . , 𝑚}, where 𝑅 ∈ S has arity 𝑛, and 𝑚 ≤ 𝑛; we also call it an 𝑅-key. A database 𝐷 satisfies 𝜅 if, for every 𝑅(𝑡 ̄), 𝑅(𝑠 ̄) ∈ 𝐷, 𝑡 We say that 𝐷 is consistent w.r.t. a set Σ of keys, written 𝐷 |= Σ, if 𝐷 satisfies each key in Σ; otherwise, it is inconsistent. A set of primary keys Σ is a set of keys such that, for each predicate 𝑅, Σ has at most one 𝑅-key. For 𝛼 = 𝑅(𝑐 1 , . . . , 𝑐 𝑛 ), the key value of 𝛼 w.r.t. Σ, denoted key Σ (𝛼), is defined as ⟨𝑅, ⟨𝑐 1 , . . . , 𝑐 𝑚 ⟩⟩, if key(𝑅) = {1, . . . , 𝑚} ∈ Σ, and ⟨𝑅, ⟨𝑐 1 , . . . , 𝑐 𝑛 ⟩⟩ otherwise. Let block Σ (𝛼, 𝐷) = {𝛽 ∈ 𝐷 | key Σ (𝛽) = key Σ (𝛼)}, and block Σ (𝐷) = {block Σ (𝛼, 𝐷) | 𝛼 ∈ 𝐷}. A repair of 𝐷 w.r.t. Σ is a database obtained by choosing one fact from each 𝐵 ∈ block Σ (𝐷). We denote the set of repairs of 𝐷 w.r.t Σ as rep(𝐷, Σ). Queries. A conjunctive query 𝑄(𝑥 ̄) over a schema S is a first-order expression of the form ∃𝑦 ̄(︀𝑅 1 (𝑧 ̄1) ∧ • • • ∧ 𝑅 𝑛 (𝑧 ̄𝑛) )︀ that mentions only atoms over S, and has free variables 𝑥 ̄. For a tuple 𝑡 ̄∈ C |𝑥 ̄|, 𝑄(𝑡 ̄) denotes 𝑄 after replacing the 𝑖-th variable in 𝑥 ̄with the 𝑖-th constant in 𝑡 ̄. A homomorphism from 𝑄 to a database 𝐷 is a function ℎ from the set of variables and constants in 𝑄 to dom(𝐷) that is the identity over C such that 𝑅 𝑖 (ℎ(𝑧 ̄𝑖)) ∈ 𝐷 for every 𝑖 ∈ [𝑛]. 𝐷 to say that ℎ is a homomorphism from 𝑄 to 𝐷. The answer to 𝑄(𝑥 ̄) over a database 𝐷, denoted 𝑄(𝐷), is the set of tuples {ℎ(𝑥 ̄) ∈ dom(𝐷) |𝑥 ̄| | 𝑄(𝑥 ̄) ℎ → 𝐷}. Consider a database 𝐷, a set Σ of primary keys, and a CQ 𝑄(𝑥 ̄). The relative frequency of a tuple 𝑡 ̄∈ dom(𝐷) |𝑥 ̄| w.r.t. 𝐷, Σ and 𝑄 is the ratio</figDesc><table><row><cell>𝑡 ̄= 𝑠 ̄. Conjunctive We use 𝑄 Consistent Query Answering.</cell><cell>̄[𝑖] = 𝑠 ̄[𝑖] for each 𝑖 ∈ {1, . . . , 𝑚} implies</cell></row></table><note>ℎ →</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>}︁. Our second sampler, called SampleKL, performs the following: 1) sample (𝑖, 𝐼) ∈ 𝒮 • ℋ,ℬ u.a.r., 2) if there is no 𝑗 &lt; 𝑖 s.t. (𝑗, 𝐼) ∈ 𝒮 • ℋ,ℬ , then output 1, otherwise output 0. We can prove that SampleKL enjoys (*) and that its expected value is 𝑟/|𝒮 • ℋ,ℬ |, where 𝑟 is the numerator of R (ℋ,ℬ) . Hence, we consider the algorithm KL, that constructs the (𝐷, Σ)-synopsis (ℋ, ℬ) of 𝑡 ̄, runs the algorithm MonteCarlo[SampleKL] with input (ℋ, ℬ), and multiplies its output by |𝒮 • ℋ,ℬ |/|db(ℬ)|. By Lemma 2, KL is an approximation scheme for RelativeFreq. One can devise a slightly different sampler, called SampleKLM, that has a lower variance in general w.r.t. SampleKL. Using SampleKLM in place of SampleKL, as discussed above, we obtain the approximation scheme for RelativeFreq called KLM.</figDesc><table /><note>• ℋ,ℬ = {︁ (𝑖, 𝐼) | 𝑖 ∈ [𝑛] and 𝐼 ∈ ℐ 𝑖 ℋ,ℬ</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">An extended version of this paper has been accepted to PODS</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2021" xml:id="foot_1">, and can be found at https://tinyurl.com/ f8sev5pj. Source code and test scenarios can be found at https://gitlab.com/mcalautti/cqabench.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2">As usual, for a syntactic object 𝑜, we write ||𝑜|| for its size.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3">In case ℋ = ∅, we let R (ℋ,ℬ) = 0.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4">In a data-efficient randomized approximation scheme for EV[Sample] the output to be approximated is E[Sample((ℋ, ℬ))], and the running time must be polynomial in ||ℋ, ℬ|| (and 1/𝜖 and log(1/𝛿)), as these are the parameters of a synopsis that depend on the database.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Part of Calautti's and Console's work was done while they were research associates at the University of Edinburgh. Console has been partially supported by MIUR under the PRIN 2017 project "HOPE" (prot. 2017MMJJRE), and by the EU under the H2020-EU.2.1.1 project TAILOR, grant id. 952215. Pieris was supported by the EPSRC grant EP/S003800/1 "EQUID".</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Consistent query answers in inconsistent databases</title>
		<author>
			<persName><forename type="first">M</forename><surname>Arenas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">E</forename><surname>Bertossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chomicki</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>PODS</publisher>
			<biblScope unit="page" from="68" to="79" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Counting database repairs under primary keys revisited</title>
		<author>
			<persName><forename type="first">M</forename><surname>Calautti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Console</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pieris</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>PODS</publisher>
			<biblScope unit="page" from="104" to="118" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Consistent query answering for primary keys in logspace</title>
		<author>
			<persName><forename type="first">P</forename><surname>Koutris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wijsen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICDT</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page">19</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">On the data complexity of consistent query answering</title>
		<author>
			<persName><forename type="first">B</forename><surname>Cate</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fontaine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Kolaitis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Theory Comput. Syst</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="843" to="891" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Computing approximate query answers over inconsistent knowledge bases</title>
		<author>
			<persName><forename type="first">S</forename><surname>Greco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Molinaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Trubitsyna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1838" to="1846" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A sat-based system for consistent query answering</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Dixit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Kolaitis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SAT</title>
		<imprint>
			<biblScope unit="page" from="117" to="135" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Conquer: Efficient management of inconsistent databases</title>
		<author>
			<persName><forename type="first">A</forename><surname>Fuxman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fazli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<publisher>SIGMOD</publisher>
			<biblScope unit="page" from="155" to="166" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Probabilistic query answering over inconsistent databases</title>
		<author>
			<persName><forename type="first">S</forename><surname>Greco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Molinaro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Ann. Math. Artif. Intell</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="185" to="207" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">An operational approach to consistent query answering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Calautti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Libkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pieris</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>PODS</publisher>
			<biblScope unit="page" from="239" to="251" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A dichotomy in the complexity of counting database repairs</title>
		<author>
			<persName><forename type="first">D</forename><surname>Maslowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wijsen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Comput. Syst. Sci</title>
		<imprint>
			<biblScope unit="volume">79</biblScope>
			<biblScope unit="page" from="958" to="983" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">ACID: A system for computing approximate certain query answers over incomplete databases</title>
		<author>
			<persName><forename type="first">N</forename><surname>Fiorentino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Greco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Molinaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Trubitsyna</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>SIGMOD</publisher>
			<biblScope unit="page" from="1685" to="1688" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Approximation algorithms for querying incomplete databases</title>
		<author>
			<persName><forename type="first">S</forename><surname>Greco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Molinaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Trubitsyna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">86</biblScope>
			<biblScope unit="page" from="28" to="45" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Coping with incomplete data: Recent advances</title>
		<author>
			<persName><forename type="first">M</forename><surname>Console</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Guagliardo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Libkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Toussaint</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>PODS</publisher>
			<biblScope unit="page" from="33" to="47" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Monte-carlo approximation algorithms for enumeration problems</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Karp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Luby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Madras</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Algorithms</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="429" to="448" />
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">An optimal algorithm for monte carlo estimation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Dagum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Karp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Luby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Ross</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIAM J. Comput</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="1484" to="1496" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Not all FPRASs are equal: Demystifying FPRASs for DNF-counting</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">S</forename><surname>Meel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Shrotri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Y</forename><surname>Vardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Constraints</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="page" from="211" to="233" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Messing up with BART: Error generation for evaluating data-cleaning algorithms</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">C</forename><surname>Arocena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Glavic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mecca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Papotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Santoro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="36" to="47" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Benchmarking the chase</title>
		<author>
			<persName><forename type="first">M</forename><surname>Benedikt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Konstantinidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mecca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Motik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Papotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Santoro</surname></persName>
		</author>
		<author>
			<persName><surname>Tsamoura</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>PODS</publisher>
			<biblScope unit="page" from="37" to="52" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
