A System Prototype for Approximate Query
          Answering over Incomplete Data
                              (DISCUSSION PAPER)


    Nicola Fiorentino, Sergio Greco, Cristian Molinaro, and Irina Trubitsyna

                            DIMES, University of Calabria
                            {lastname}@dimes.unical.it


        Abstract. Many database applications face the problem of querying
        incomplete data. In such scenarios, certain answers are a principled se-
        mantics of query answering. Unfortunately, the computation of certain
        query answers is a coNP-hard problem. To make query answering feasible
        in practice, recent research has focused on developing polynomial time
        algorithms computing a sound (but possibly incomplete) set of certain
        answers.
        In this paper we present a system prototype implementing a suite of
        algorithms to compute sound sets of certain answers. The central tools
        used by our system are conditional tables and the conditional evaluation
        of relation algebra. Different evaluation strategies can be applied, with
        more accurate ones having higher complexity, but returning more certain
        answers, thereby enabling users to choose the technique that best meets
        their needs in terms of balance between efficiency and quality of the
        results.


1     Introduction
Incomplete information arises in many database applications, such as ontological
reasoning [4, 5], inconsistency management [2, 3, 11, 15], data integration [7, 16],
and many others.
    A principled semantics of query answering over incomplete databases are
certain answers, which are query answers that are obtained from all the com-
plete databases represented by an incomplete database [17, 6, 18]. The following
example illustrates the notion of a certain answer.
Example 1. Consider the database D consisting of the three unary relations P
(Person), S (Student) and E (Employee) reported below, where ⊥ is a null value.

                                P           E         S
                              john        john       mary
                              mary         ⊥         bob

    Copyright c 2019 for the individual papers by the papers authors. Copying permit-
    ted for private and academic purposes. This volume is published and copyrighted by
    its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
Under the missing value interpretation of nulls (i.e., a value for ⊥ exists but
is unknown), D represents all the databases obtained by replacing ⊥ with an
actual value.
    A certain answer to a query is a tuple that is an answer to the query for
every database represented by D. For instance, consider the query asking for the
people who are not employees and students, which can be expressed in relational
algebra as P − (E ∩ S). The certain answers to the query are {hjohni}, because
no matter how ⊥ is replaced, hjohni is always a query answer.
    For databases containing (labeled) nulls, certain answers to positive queries
can be easily computed in polynomial time as follows: first a “standard” evalu-
ation (that is, treating nulls as standard constants) is applied; then tuples with
nulls in the result of the first step are discarded and the remaining tuples are the
certain answers to the query. However, for more general queries with negation
the problem of computing certain answers becomes coNP-hard.
    To make query answering feasible in practice, one might resort to SQL’s
evaluation, but unfortunately, the way SQL behaves in the presence of nulls
may result in wrong answers.
    Specifically, as evidenced in [18], there are two ways in which certain answers
and SQL’s evaluation may differ: (i) SQL can miss some of the tuples that belong
to certain answers, thus producing false negatives, or (ii) SQL can return some
tuples that do not belong to certain answers, that is, false positives. While the
first case can be seen as an under-approximation of certain answers (a sound but
possibly incomplete set of certain answers is returned), the second scenario must
be avoided, as the result might contain plain incorrect answers, that is, tuples
that are not certain.
    The experimental analysis in [13] showed that false positive are a real problem
for queries involving negation—they were always present and sometimes they
constitute almost 100% of the answers.
Example 2. Consider again the database D of Example 1. There are no certain
answers to the query P − E, as the query answers are the empty set when ⊥ is
replaced with mary.
   Assuming that P and E’s attribute is called name, the same query can be
expressed in SQL as follows:
     SELECT P.name
     FROM P
     WHERE NOT EXISTS (
         SELECT *
         FROM E
         WHERE P.name = E.name )
    The evaluation of the SQL query above returns hmaryi, which is not a certain
answer. The problem with the SQL semantics is that every comparison involving
at least one null evaluates to the truth value unknown, then 3-valued logic is used
to evaluate the classical logical connectives (AND, OR, NOT ), and eventually
only those tuples whose condition evaluates to true are kept.
    Going back to the query above, for the first tuple of P, namely john, the
nested subquery finds the same tuple in E, and thus john is not returned.
    For the second tuple of P, namely mary, the nested subquery gives the empty
set and thus hmaryi is returned by the overall query. The reason why the nested
subquery returns the empty set when mary is considered is that mary is compared
with john and the comparison evaluates to false, and mary is compared with ⊥
and the comparison evaluates to unknown (because a null is involved). Thus,
there is no tuple of E for which the comparison evaluates to true and the nested
subquery returns the empty set.
    Thus, on the one hand, SQL’s evaluation is efficient but flawed, on the other
hand, certain answers are a principled semantics but with high complexity. To
deal with this issue, there has been recent work on evaluation algorithms with
correctness guarantees, that is, techniques providing a sound but possibly in-
complete set of certain answers [13, 17, 18, 12, 8]. The problem of computing
sound (but possibly incomplete) sets of consistent query answers over incon-
sistent databases has been addressed in [9], but databases are assumed to be
complete, while in this paper we consider incomplete databases with no integrity
constraints.
    We have developed novel evaluation algorithms with correctness guarantees
leveraging conditional tables and the conditional evaluation of relational alge-
bra [12, 8]. In conditional tables each tuple is associated with a condition and
the conditional evaluation is a generalization of relational algebra that manipu-
late conditional tables. Conditions keep track of how tuples are derived and how
nulls are used in comparison operators.
    The basic idea is illustrated in the following example.
Example 3. Consider again the database and the query of Example 1. The con-
ditional evaluation of the query is carried out by applying the “conditional”
counterpart of each relational algebra operator. Rather than returning a set of
tuples, the conditional evaluation of a relational algebra operator returns “con-
ditional tuples”, that is, pairs of the form ht, ϕi, where t is a regular tuple and
ϕ is an expression stating under which conditions t can be derived.
    Regarding the query of Example 1, first the conditional evaluation of E ∩ S
is performed, which gives the conditional tuples h⊥, ϕ1 i and h⊥, ϕ2 i, where ϕ1
is the condition (⊥ = mary) and ϕ2 is the condition (⊥ = bob). This intuitively
means that the tuple h⊥i is derived when ⊥ is mary or bob.
    Then, the conditional evaluation of the difference operator is carried out,
yielding the conditional tuples hjohn, ϕ0 i and hmary, ϕ00 i where ϕ0 and ϕ00 are
the following conditions:
      ϕ0 = ¬((john = ⊥) ∧ (⊥ = mary)) ∧ ¬((john = ⊥) ∧ (⊥ = bob)),
      ϕ00 = ¬((mary = ⊥) ∧ (⊥ = mary)) ∧ ¬((mary = ⊥) ∧ (⊥ = bob)).
This is the result of the conditional evaluation of the whole query.
   Conditions are valuable information that can be exploited to determine which
tuples are certain answers. As already mentioned, for a conditional tuple ht, ϕi,
                            Query
                               DB                                        (Approximate)
                                                     GUI
                                                                     Certain Query Answers
               Evalua0on Algorithm


                                      Evalua0on Algorithms’ Engine

                              Naive     Semi-Naive         Lazy       Aware


                                                     DB


                                 Fig. 1. System Architecture.


the expression ϕ says under which condition t can be derived. By condition
evaluation we mean a way of associating ϕ with a truth value (true, false, or
unknown). The aim is to ensure that if ϕ evaluates to true, then t is a certain
answer. For instance, from an analysis of ϕ0 in Example 3 above, one can realize
that the condition is always true (i.e., it holds for every possible value ⊥ stands
for), and thus hjohni is a certain answer.
    Tuples’ conditions can be evaluated in different ways: for instance, an ea-
ger strategy consists in evaluating conditions right after each relational algebra
operator has been evaluated, while an opposite approach consists in evaluating
conditions at the very end, that is, after the entire relational algebra query has
been evaluated.
    We have developed four different strategies leading to different evaluation
algorithms, called naive, semi-naive, lazy, and aware evaluations. They have been
implemented in the ACID system [8], which enables users to query incomplete
databases and get under-approximations of the certain answers, choosing the
evaluation strategy that is most suitable for the application at hand.

2   System Overview
The ACID system has been implemented in Java. The system architecture is
depicted in Figure 1.
   There are three main components: a graphical user interface (GUI), the eval-
uation algorithms’ engine, and the database.
   The GUI allows user to specify the query to be evaluated, the database, and
the type of evaluation to be performed, that is, the approximation algorithm to
be applied. The GUI displays the result of evaluating the specified query over the
provided database according to the chosen evaluation algorithm. Different filters
can be applied to the result (more details are discussed in the next section).
   The system’s engine supports the four evaluation algorithms mentioned in
the previous section, with the naive algorithm being the most efficient but the
least accurate one, and the aware algorithm being the most accurate but the least
efficient one. The basic ideas of the approximation algorithms are as follows:
 – The naive evaluation evaluates tuples’ conditions right after each relational
   algebra operator has been applied, using three-valued logic.
 – The semi-naive evaluation behaves like the naive one, but it better exploits
   equalities in conditions (by propagating values into tuples and conditions)
   to provide more accurate results.
 – The lazy evaluation improves upon the semi-naive one by postponing con-
   ditions’ evaluation until the set difference operator is encountered in the
   query.
 – The aware evaluation provides even more accurate results and behaves as
   follows: it performs the conditional evaluation of the entire query, then it
   uses a set of axioms to “simplify” conditions, and eventually it evaluates
   (simplified) tuples’ conditions.
    The ACID system manages relational databases possibly containing labeled
nulls (in the literature, they have been called naive tables, V-tables, and e-
tables [14, 1, 10]). Thus, the same (labeled) null can occur multiple times—e.g.,
this can be used to express that there are two employees with the same unknown
salary.
    The GUI provides information on the query, the database, and the evalua-
tion strategy to the engine, which computes the approximate certain answers
accessing the database. After the evaluation has been carried out, the engine
returns the result to the GUI.
    The ACID system provides also an API which allow third party applications
to interact with the system.
    We now go into the details of how to interact with the ACID system (cf.
Figure 2).
    A typical interaction with the system involves the following steps:
1. The user specifies the input databases. Specifically, for each table in the
   database, its location in the file system is provided. Tables are supposed to
   be in csv format.
2. The user specifies the query to be evaluated using standard SQL syntax.
   Queries can be loaded from and saved to files.
3. The user specifies the evaluation strategy that has to be applied to evalu-
   ate the query (indeed, the system supports also the “standard” evaluation
   mentioned in the introduction and the conditional evaluation of a query).
4. After the evaluation has been launched and has finished, the result and
   statistics are displayed. Specifically, the result is a set of a tuples, where each
   tuple is associated with a condition that is either true or unknown. Tuples
   associated with true are guaranteed to be certain answers to the input query.
   The result can be filtered with respect to the truth value of the tuples, thus
   displaying only true or only unknown tuples. The total number of true (resp.
   unknown) tuples is displayed as well as the execution time. Results can be
   saved to files.
                           Fig. 2. ACID system’s GUI.


3   Trading-off Quality with Runtime
With the same query and database, moving to more accurate strategie, that is,
from the naive (resp. semi-naive, lazy) evaluation to the semi-naive (lazy, aware)
one, users can see better results, that is, more tuples with condition true (i.e.,
more certain answers), but running times might get higher.
    In general, using the system and analyzing the query syntax, users can figure
out the strategy that is best suited for their purposes.
    As an example, Table 1 reports the execution time, the number of true and
unknown answers for three sample queries over a database with the same schema
of the one in Example 1, with 1000 tuples per relation and 10% of nulls (randomly
generated).
    The queries are:

 – Qsn = E − σ$1=c (S),
 – Qlazy = P − (E ∩ (σ$16=c (S))), and
                        Qsn              Qlazy            Qaware
                 time #true #unk time #true #unk time #true #unk
        Naive     518 741    167   788 710      189 2163 100       731
      Semi-naive 615 763     145 1090 710       189 2347 100       731
         Lazy     661 763    145 1579 737       162 5542 100       731
        Aware    3580 763    145 4350 737       162 10376 231      600
Table 1. Runtime (msecs), number of true and unknown answers for three sample
queries — 10% of nulls.


 – Qaware = P − (E − S),
where c is a value randomly chosen from those in S.
    The purpose of the first scenario is to exhibit a query (namely, Qsn ) that
shows the benefits of going from the naive to the semi-naive evaluation—notice
that, in this case, there is no benefit in applying the lazy or aware evaluation, as
the structure of the query does not have features that can be exploited by them.
    Likewise, the purpose of the second and third scenarios is to show the advan-
tage of using the lazy (resp. aware) evaluation rather than the semi-naive (resp.
lazy) one.

4   Conclusion
Certain answers are a principled manner to answer queries on incomplete databases.
Since their computation is a coNP-hard problem, recent research has focused on
developing polynomial time algorithms providing under-approximations. Lever-
aging conditional tables, we have developed a suite of novel approximation al-
gorithms.
    We have implemented them in the ACID system, which allows users to
query incomplete information and get approximate answers with the flexibil-
ity of choosing the technique that best meets their needs in terms of balance
between efficiency and quality of the result’s approximation.

References
 1. Abiteboul, S., Grahne, G.: Update semantics for incomplete databases. In: Proc.
    Very Large Data Bases (VLDB) Conference. pp. 1–12 (1985)
 2. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent
    databases. In: Proc. Symposium on Principles of Database Systems (PODS). pp.
    68–79 (1999)
 3. Bertossi, L.E.: Database Repairing and Consistent Query Answering. Synthesis
    Lectures on Data Management, Morgan & Claypool Publishers (2011)
 4. Bienvenu, M., Ortiz, M.: Ontology-mediated query answering with data-tractable
    description logics. In: Reasoning Web. pp. 218–307 (2015)
 5. Calı̀, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for
    tractable query answering over ontologies. Journal of Web Semantics 14, 57–83
    (2012)
 6. Console, M., Guagliardo, P., Libkin, L.: Approximations and refinements of certain
    answers via many-valued logics. In: Proc. International Conference on Principles
    of Knowledge Representation and Reasoning (KR). pp. 349–358 (2016)
 7. De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: On reconciling data ex-
    change, data integration, and peer data management. In: Proc. Symposium on
    Principles of Database Systems (PODS). pp. 133–142 (2007)
 8. Fiorentino, N., Greco, S., Molinaro, C., Trubitsyna, I.: ACID: A system for com-
    puting approximate certain query answers over incomplete databases. In: Proc. In-
    ternational Conference on Management of Data (SIGMOD). pp. 1685–1688 (2018)
 9. Furfaro, F., Greco, S., Molinaro, C.: A three-valued semantics for querying and
    repairing inconsistent databases. Annals of Mathematics and Artificial Intelligence
    51(2-4), 167–193 (2007)
10. Grahne, G.: The Problem of Incomplete Information in Relational Databases, Lec-
    ture Notes in Computer Science, vol. 554. Springer (1991)
11. Greco, S., Molinaro, C., Spezzano, F.: Incomplete Data and Data Dependencies in
    Relational Databases. Synthesis Lectures on Data Management, Morgan & Clay-
    pool Publishers (2012)
12. Greco, S., Molinaro, C., Trubitsyna, I.: Computing approximate certain answers
    over incomplete databases. In: Proc. Alberto Mendelzon International Workshop
    on Foundations of Data Management and the Web (AMW) (2017)
13. Guagliardo, P., Libkin, L.: Making SQL queries correct on incomplete databases: A
    feasibility study. In: Proc. Symposium on Principles of Database Systems (PODS).
    pp. 211–223 (2016)
14. Imielinski, T., Jr., W.L.: Incomplete information in relational databases. Journal
    of the ACM 31(4), 761–791 (1984)
15. Koutris, P., Wijsen, J.: The data complexity of consistent query answering for self-
    join-free conjunctive queries under primary key constraints. In: Proc. Symposium
    on Principles of Database Systems (PODS). pp. 17–29 (2015)
16. Lenzerini, M.: Data integration: A theoretical perspective. In: Proc. Symposium
    on Principles of Database Systems (PODS). pp. 233–246 (2002)
17. Libkin, L.: How to define certain answers. In: Proc. International Joint Conference
    on Artificial Intelligence (IJCAI). pp. 4282–4288 (2015)
18. Libkin, L.: Certain answers as objects and knowledge. Artificial Intelligence 232,
    1–19 (2016)