<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Codd Semantics of SQL Nulls</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Guagliardo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonid Libkin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Informatics, University of Edinburgh</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Theoretical models used in database research often have subtle di erences with those occurring in practice. One particular mismatch that is usually neglected concerns the use of marked nulls to represent missing values in theoretical models of incompleteness, while in an SQL database these are all denoted by the same syntactic NULL object. It is commonly argued that results obtained in the model with marked nulls carry over to SQL, because SQL nulls can be interpreted as Codd nulls, which are simply marked nulls that do not repeat. This argument, however, does not take into account that even simple queries may produce answers where distinct occurrences of NULL do in fact denote the same unknown value. For such queries, interpreting SQL nulls as Codd nulls would incorrectly change the semantics of query answers. To use results about Codd nulls for real-life SQL queries, we need to understand which queries preserve the Codd interpretation of SQL nulls. We show, however, that the class of relational algebra queries preserving Codd interpretation is not recursively enumerable, which necessitates looking for su cient conditions for such preservation. Those can be obtained by exploiting the information provided by NOT NULL constraints on the database schema. We devise mild syntactic restrictions on queries that guarantee preservation, do not limit the full expressiveness of queries on databases without nulls, and can be checked e ciently.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Query evaluation is a fundamental task in data management, and very often it
has to be performed on databases that have incomplete information. This is
especially true in applications such as data integration [
        <xref ref-type="bibr" rid="ref4 ref7">7, 4</xref>
        ], data exchange [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
ontology-based data access [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ] that rely on the standard tools o ered by
existing relational database technology, in particular SQL, in order to take advantage
of its e ciency. There is much theoretical research on query answering and its
applications that uses well established models of incompleteness. However, there
is an important and often neglected mismatch between theoretical models and
real-life SQL, which has an impact on the semantics of query answers.
      </p>
      <p>Theoretical work traditionally adopts a model of incompleteness where
missing values in a database are represented by marked (also called naive or labeled)
nulls. In SQL, however, missing values are all represented by the same syntactic
object: the infamous NULL. Marked nulls are more expressive than SQL nulls:
two unknown values can be asserted to be the same just by denoting them with
the same null. For example, in a relation R with attributes name and age, one
can put tuples (\Alice"; ?1) and (\Bob"; ?1) to express the fact that Alice and
Bob have the same age, even though their actual age, represented by the null ?1,
is unknown. However, there is a standard argument often found in the literature:
SQL nulls are modeled by Codd nulls, that is, non-repeating marked nulls. Then
each NULL is interpreted as a fresh marked null that does not appear anywhere
else in the database.</p>
      <p>
        But do Codd nulls properly model SQL nulls? It is sometimes argued (e.g.,
in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) that they are di erent, but this assumes a special type of evaluation {
called naive [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] { under which marked nulls are simply viewed as new constants,
e.g., ?1 = ?1 but ?1 6= ?2 etc. Under naive evaluation, the conjunctive query
q(x) :{ R(x; y); R(z; y0); y = y0 will produce f\Alice"; \Bob"g, while the same
query in SQL, SELECT R1.name FROM R R1, R R2 WHERE R1.age = R2.age, will
produce the empty set.
      </p>
      <p>This mismatch is due to the SQL's use of 3-valued logic (3VL): comparisons
involving nulls in SQL result in the truth value unknown, even if a null is
compared to itself, which happens in the above example. But this is easy to x by
using the same comparison semantics for Codd nulls: every condition x = y or
x 6= y evaluates to unknown if x or y is a null. So, is this su cient to reconcile
the mismatch between Codd nulls and SQL nulls?</p>
      <p>The answer is still negative, because we have not taken into account the role
of queries. If SQL nulls are to be interpreted as Codd nulls, this interpretation
should apply to input databases as well as query answers, which are incomplete
databases themselves. To explain this point, let codd(R) be the result of
replacing SQL nulls in R with distinct marked nulls, e.g., for R = f(\Alice"; NULL);
(\Bob"; NULL)g, we would have codd(R) = f(\Alice"; ?1); (\Bob"; ?2)g.
Technically, codd(R) is the set of such relations, because we can choose distinct marked
nulls arbitrarily (e.g., f?3; ?4g instead of f?1; ?2g), but they are all isomorphic.
To ensure that Codd nulls faithfully represent SQL nulls for a query Q, we need
to enforce the condition in the diagram below:</p>
      <p>D
codd</p>
      <p>D0</p>
      <p>Q
Q</p>
      <p>Q(D)
Q(D0)
codd</p>
      <p>Intuitively, it says the following: take an SQL database D, and compute the
answer to Q on it, i.e., Q(D). Now take some D0 2 codd(D) and compute Q(D0).
Then Q(D0) must be in codd Q(D) , i.e., there must be a way of assigning Codd
nulls to SQL nulls in Q(D) that will result in Q(D0).</p>
      <p>Does this condition hold for queries in some su ciently expressive language
like relational algebra? Unfortunately, the answer is negative, even for very
simple queries. Take for example a database D with T = fNULLg and S = f1; 2g, and
consider the query Q that computes the Cartesian product of T and S. The
answer to Q on D is Q(D) = f(NULL; 1); (NULL; 2)g. The Codd interpretation of D is
a database D0 where T = f?1g and S = f1; 2g, and Q(D0) = f(?1; 1); (?1; 2)g.
Since ?1 is repeated, Q(D0) cannot be obtained by replacing SQL nulls in Q(D)
with Codd nulls.</p>
      <p>The culprit in the above example is that the query, evaluated on the Codd
interpretation of the original SQL database, produces an answer that is not a
Codd table, as it contains repeated nulls. While in this particular example it is
easy to detect such an undesirable behavior, we show that in general the class of
relational algebra queries that transform SQL databases into Codd databases is
not recursively enumerable, and thus it is impossible to capture it by a syntactic
fragment of the language.</p>
      <p>Given the lack of an e ective syntax for queries that preserve Codd semantics
in query answers, we can only hope for su cient conditions, i.e., nding
reasonable syntactic fragments that guarantee this property. However, simply choosing
a subset of relational algebra operations will not give us a useful restriction: as
we saw before, one of the problematic constructs is Cartesian product, and we
obviously do not want a fragment that prohibits joins. Thus, we must look for a
more re ned solution.</p>
      <p>The idea is to consider database schemas with NOT NULL constraints, which
are very common in practice: base relations in real SQL databases will almost
always have a PRIMARY KEY declared on them. We then exploit these constraints
to come up with mild restrictions on queries, which have the following properties:
{ they guarantee the preservation of Codd semantics of SQL nulls in query
answers on all incomplete databases;
{ they can be checked e ciently; and
{ they do not restrict the full expressiveness of relational algebra queries on
databases without nulls, where all attributes are declared as NOT NULL.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>We assume three countably in nite and pairwise disjoint sets of elements: names
(Names), constants (Const), and nulls (Null). Elements of Null are denoted by ?,
possibly with sub- or superscripts, and we call a signature any nite subset of
Names. A naive record is a map from some signature to Const [ Null. To be as
close as possible to SQL's data model, we de ne a naive table as a bag of naive
records (which may occur multiple times) over the same signature.
Schemas and databases. A relational schema consists of a signature of
relation names and a function sign that maps each relation name R to a signature
sign(R), whose elements are called the attributes of R. A naive database D
associates each relation name R with a naive table RD over sign(R). We denote the
sets of constants and nulls occurring in D by Const(D) and Null(D), respectively.
A Codd database (respectively, table) is a naive one in which nulls do not repeat,
that is, there can be at most one occurrence of each element from Null. Note
that a Codd database is a set of Codd tables, but the converse need not hold.
SQL databases. We use the special symbol N 62 Const [ Null to denote SQL's
NULL value. SQL databases (as well as tables and records) are de ned in a similar
way as naive ones, but they are populated with elements of Const [ fNg rather
than Const [ Null.</p>
      <p>
        Query language. We use relational algebra on bags, with the usual operations
of projection , selection , Cartesian product , union [, di erence ,
intersection \, renaming , and duplicate elimination ". These are all de ned in the
standard way [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Selection conditions are Boolean combinations of comparisons
x = y and null(x), where each x and y is either a constant or an attribute. The
condition null(x), corresponding to IS NULL in SQL, tests whether the value of
x is null. We use x 6= y and const(x) for :(x = y) and : null(x), respectively.
      </p>
      <p>This language captures the basic fragment of SQL: SELECT
[DISTINCT]-FROMWHERE queries, with (correlated) subqueries preceded by possibly negated IN and
EXISTS, combined by UNION, INTERSECT and EXCEPT, possibly with ALL.</p>
      <p>The answer to a query Q on a (naive or SQL) database D is denoted by Q(D).
Selection conditions are evaluated using SQL's three-valued logic: (in)equality
comparisons result in unknown if at least one of the arguments is null, and truth
values are propagated through the connectives ^, _, : following the well known
truth tables of Kleene logic. Then, for a table T , the operation (T ) returns all
occurrences of each record r in T for which the condition evaluates to true.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Codd Interpretation of SQL Nulls</title>
      <p>Di erently from naive databases, where nulls are elements of Null, missing values
in SQL are all denoted by the special symbol N. To reconcile this mismatch, the
occurrences of N in an SQL database are typically interpreted as non-repeating
elements of Null. That is, an SQL database can be seen as a Codd database where
each occurrence of N is replaced by a fresh distinct marked null. Obviously, these
nulls can be chosen arbitrarily as long as they do not repeat, so an SQL database
may admit in nitely many Codd interpretations in general.</p>
      <p>To make this notion more precise, let us denote by sql(D) the SQL database
obtained from D by replacing each null (either N or an element of Null) with N.
Note that sql(D) is uniquely determined for any given (naive or SQL) database
D. Then, for an SQL database D, we de ne
codd(D) = fD0 j D0 is a Codd database such that sql(D0) = Dg
(1)
Even though this set may be in nite, its elements are all isomorphic. Thus, with
some abuse of terminology, we can speak of the Codd interpretation of an SQL
database, which is unique up to renaming of nulls.</p>
      <p>Answers to queries are tables, which may of course contain nulls. So, if SQL
nulls are interpreted as Codd nulls, this should apply also to query answers, not
just input databases. More precisely, when computing the answer to a query Q
on an SQL database D, there should always be a way of assigning Codd nulls to
SQL nulls in Q(D) so as to obtain Q(D0), where D0 is the Codd interpretation of
D. In other words, the Codd interpretation of Q(D) should be indistinguishable
from Q(D0), that is, isomorphic to it. This requirement was shown in the diagram
in the introduction and it is formally de ned below.</p>
      <p>De nition 1. A query Q preserves Codd semantics if, for every SQL database
D and for every D0 2 codd(D), it holds that</p>
      <p>The above de nition can also be equivalently formulated as follows: a query
Q preserves Codd semantics if, for every Codd database D,</p>
      <p>Q(D0) 2 codd Q(D) :
sql Q(D) = Q sql(D) and</p>
      <sec id="sec-3-1">
        <title>Q(D) is a Codd table. Intuitively, in order to preserve Codd semantics a query must (2a) preserve the occurrences of nulls, regardless of repetitions, and (2b) avoid repetitions of nulls in the answers.</title>
        <p>These two requirements are in fact independent of each other. For example,
the query Q = R S satis es (2a) but not (2b), because on the Codd database
D where R = f?1g and S = f?2; ?3g the answer Q(D) = f(?1; ?2); (?1; ?3)g
contains repeated nulls. On the other hand, the query Q0 = R \ S satis es (2b)
but not (2a), because sql Q0(D) = ? 6= fNg = Q0 sql(D) .</p>
        <p>Can we syntactically capture the class of queries preserving Codd semantics?
Unfortunately, the answer is no.</p>
        <p>Proposition 1. For every schema with at least one binary relation symbol, the
set of queries that preserve Codd semantics is not recursively enumerable.</p>
        <p>To prove this, we can show that the set of preserving queries is not recursive,
by reduction to the undecidability of emptiness of relational algebra expressions;
since the complement of the set is clearly r.e., the result follows. Thus, we should
look for syntactic restrictions that provide su cient conditions for the
preservation of Codd semantics. This is what we do in the next section.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Queries Preserving Codd Semantics</title>
      <p>Since the set of queries that preserve Codd semantics cannot be captured
syntactically, we can only hope for syntactic restrictions that are su cient to guarantee
this property. However, simply choosing a subset of relational algebra operations
would be too restrictive to be useful: as we have seen in the examples, Cartesian
product is one of the operations causing problems with the preservation of Codd
semantics, and we certainly do not want a fragment that forbids joins altogether.</p>
      <p>This suggests that, to come up with useful restrictions, we need some
additional information beyond the query itself. Real databases typically must satisfy
some integrity constraints, and when it comes to incomplete SQL databases the
most basic form of constraint amounts to declaring some of the attributes in the
(2)
(2a)
(2b)
schema as NOT NULL. The use of this kind of constraints is extremely common in
practice, because each base table in a real-life SQL database has almost always
a subset of its attributes declared as PRIMARY KEY, which implies NOT NULL.</p>
      <p>We model NOT NULL constraints as follows: the signature sign(R) of each base
relation R is partitioned into two sets signnull(R) and sign:null(R) of nullable and
non-nullable attributes, so that sign:null(R)(RD) contains only elements of Const.
That is, nulls are not allowed as values of the attributes in sign:null(R), while
there are no restrictions on the values of the attributes in signnull(R). We extend
these notions to relational algebra queries as follows:
signnull Q1 ? Q2 = signnull(Q1) [ signnull(Q2) for ? 2 f ; [g
signnull Q1 \ Q2 = signnull(Q1) \ signnull(Q2)
signnull Q1</p>
      <p>Q2 = signnull(Q1)
signnull
signnull
(Q) = signnull(Q) \
(Q) = signnull "(Q) = signnull(Q)
signnull A!B(Q) =
( signnull(Q)
signnull(Q)
fAg [ fBg if A 2 signnull(Q)
otherwise
sign:null(Q) = sign(Q)</p>
      <p>signnull(Q)</p>
      <sec id="sec-4-1">
        <title>From the above de nition we immediately obtain:</title>
        <p>Proposition 2. Let Q be a query and D be a database. Then, sign:null(Q) Q(D)
consists only of elements of Const.</p>
        <p>That is, independently of the data, the answer to a query Q can only have null
values for the attributes in signnull(Q). We say that a query Q is non-nullable if
signnull(Q) = ?.</p>
        <p>To explain the restrictions that ensure preservation of Codd semantics, we
need two de nitions.</p>
        <p>De nition 2. The parse tree of a query Q is constructed as follows:
{ Each relation symbol R is a single node labeled R.
{ For each unary operation op1, the parse tree of op1(Q) has root labeled op1
and the parse tree of Q rooted at its single child.
{ For each binary operation op2, the parse tree of Q op2 Q0 has root labeled
op2 and the parse trees of Q and Q0 rooted at its left child and right child,
respectively.</p>
        <p>Note that each node in the parse tree of Q de nes a subquery of Q, so we can
associate properties of such queries with properties of parse tree nodes.
De nition 3. The base of a query is the set of relation names appearing in it.
A node in the parse tree of a query satis es:</p>
        <p>NNC (non-nullable child) if it is not a leaf and at least one of its children is a
non-nullable query;
NNA (non-nullable ancestor) if either itself or one of its ancestors is a query
that is non-nullable;
DJB (disjoint base) if it corresponds to a binary operation and the base of its
left child is disjoint with the base of its right child.</p>
      </sec>
      <sec id="sec-4-2">
        <title>We now state the main result.</title>
        <p>Theorem 1. Let Q be a relational algebra query whose parse tree is such that:
(a) each ", \, and node satis es NNC;
(b) each node satis es NNA;
(c) each [ node satis es NNC or DJB or NNA.</p>
        <p>Then Q preserves Codd semantics.</p>
        <p>These conditions do not restrict the full expressiveness of relational algebra
on databases without nulls: when all attributes in the schema are non-nullable,
every query is such, hence the restrictions are trivially satis ed.
Corollary 1. On database schemas without nullable attributes, every relational
algebra query satis es the conditions of Theorem 1.
tu</p>
      </sec>
      <sec id="sec-4-3">
        <title>Also, the restrictions are easy to check:</title>
        <p>Proposition 3. Deciding whether a query satis es the conditions of Theorem 1
can be done in linear time w.r.t. the number of nodes in its parse tree.</p>
        <p>We will now outline the main ideas behind the restrictions. If a query is
nonnullable, then it trivially satis es (2b). But there exist non-nullable queries for
which (2a) does not hold: e.g., A "(R) when R has attributes A; B, of which
only A is non-nullable. Thus, non-nullability needs to be used more carefully, on
the subqueries of Q.</p>
        <p>Duplicate elimination, intersection and di erence cause problems with (2a).
What these operations have in common is that they match nulls syntactically,
i.e., as if they were constants. Now, SQL nulls are all syntactically the same, but
this is not the case for Codd nulls. So, for these operations, it makes a di erence
whether we are using Codd nulls or SQL nulls. On the other hand, the remaining
operations are not a ected by this: projection, union and Cartesian product do
not rely on syntactic matching of nulls, and nulls in selection conditions are not
compared syntactically (equality and inequality conditions involving at least one
null result in unknown). Thus, we only need to take care of ", and \, and here
is where the NNC condition comes into play. Requiring it for ", \ and nodes
is enough to ensure that the query satis es (2a). Intuitively, when at least one
of the input subqueries to each of these operations is non-nullable, no syntactic
matching of nulls can occur, simply because one of the operands (the only one
for ") will not have nulls at all.</p>
        <p>Proposition 4. Let Q be a query such that each ", \ and
tree satis es NNC. Then Q satis es property (2a).
node in its parse</p>
        <p>If in addition to the condition of Proposition 4 we also required the query
itself to be non-nullable, then preservation of Codd semantics would be
guaranteed. However, this is too restrictive, as it would forbid even the simple retrieval
of a base relation whenever some of its attributes are nullable. So, we need more
re ned ways of ensuring (2b) by restricting only the problematic operations.</p>
        <p>Duplicate elimination, di erence, selection and intersection always produce
a table that is contained in at least one of the input tables, so their output may
have repeated nulls only if these repetitions were already in the input. The same
holds for projection, for a similar reason. The problematic operations w.r.t. (2b)
are [ and . Both can create repetitions of nulls across records, and the latter
can also create repetitions of nulls within a record. To restrict these operations
appropriately, we use the NNA condition. Requiring this condition for each and
[ node is enough to ensure that the query satis es (2b). Intuitively, repetitions
of nulls that may be created by [ and are allowed in the intermediate results
of a query, but only as long as they will be eventually discarded, at the latest
when the nal output is produced.</p>
        <p>In fact, the NNA condition for [ can be relaxed. We want to restrict a union
operation when it may create \new" repetitions of nulls, but we need not do so
when the only repetitions it may produce are those inherited from the input,
which were introduced by the application of previous operations. There are two
cases in which this happens: when one of the operands is non-nullable, or when
the input tables are built from disjoint sets of base relations. The rst is captured
by the NNC condition, and the second by the DJB condition. We can then allow
for these additional possibilities by requiring each [ node to satisfy NNA or NNC
or DJB.</p>
        <p>This explains the need for the conditions of De nition 3. As an example, let
us consider the query B!C (R [ S) [ D!A "(T ) A;C A=1_B=C (R T ) ,
whose parse tree is shown below:
fA; Cg
fA; Cg [</p>
        <p>A;C fA; Cg
B!C fA; Cg</p>
        <p>D!A fA; Cg</p>
        <p>A=1_B=C fA; B; C; Dg
[ fA; Bg
" fC; Dg
fA; B; C; Dg
R fA; Bg</p>
        <p>S fA; Bg</p>
        <p>T fC; Dg</p>
        <p>R fA; Bg</p>
        <p>T fC; Dg
Next to each node we indicate the corresponding signature, where non-nullable
attributes are underlined (for the leaves, this information is given by the schema).
On the left side of the tree, we see that the innermost [ node (lower in the tree)
satis es DJB (but not NNC), the " node satis es NNC and so is non-nullable, and
in turn the outermost [ node satis es NNC too. On the right side, we see that the
node satis es NNA because its ancestor is non-nullable, which also ensures
that the root node \ " satis es NNC. Thus, the query preserves Codd semantics.</p>
        <p>We conclude this section with an outline of the proof that checking the
conditions of Theorem 1 can be done in linear time. Let n be the number of nodes in
the binary parse tree of the query. First, we compute the base and the nullable
attributes of each node by scanning the tree bottom up (n steps). We then mark
all ", \, [, nodes satisfying NNC, all [ nodes satisfying DJB, all , , nodes,
and all leaves; for each node, we visit up to two child nodes, so this requires
at most 2 n steps. Next, by traversing the tree breadth- rst from the root, we
remove the subtrees rooted in non-nullable nodes; in the worst case (when all
nodes are nullable) this takes n steps. Finally, the query satis es the conditions
of Theorem 1 i all nodes in the resulting tree are marked, which can be checked
in one more pass (n steps).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Derived Operations and Further Re nements</title>
      <p>In our query language we have explicitly included intersection, even though this
operation is expressible in terms of di erence. While it may seem redundant to
have \ in the syntax of queries, this is not the case w.r.t. our restrictions for the
preservation of Codd semantics. To see why, suppose we want the intersection of
base relations R and S, but we do not have \ available as a primitive operation
in our query language. Of course, we can always write R (R S) or S (S R);
the problem is that, even though these queries are equivalent, the former satis es
the conditions of Theorem 1 i R is non-nullable, while the latter satis es them
i S is non-nullable. In either case, the requirement is stronger than the one for
R \ S, which is in fact given by their disjunction: R \ S satis es the conditions
of Theorem 1 i R is non-nullable or S is non-nullable.</p>
      <p>This suggests that adding some derived operations explicitly to the syntax of
queries may lead to milder restrictions, allowing us to capture more queries
preserving Codd semantics. We will show two such re nements: constant selections
and semi/antijoins.</p>
      <p>Selections. Looking at may seem counterintuitive as this operation did not
need to be restricted in any way. However, consider the query const(A)(R) \ S,
for unary relation symbols R and S over a nullable attribute A. This query does
not satisfy the conditions of Theorem 1, because neither const(A)(R) nor S are
non-nullable. But even though signnull const(A)(R) = fAg, we can rest assured
that in fact no nulls will appear in const(A)(RD). Here the idea is to include in
our query language the operation of constant selection const( ) which, for a table
T and sign(T ), returns all occurrences of each record r in T such that r(A)
is a constant for every attribute A 2 . Then, we de ne signnull const( )(Q) =
signnull(Q) .</p>
      <p>Semijoins and antijoins. The theta-join of two tables T1; T2 with disjoint sets
of attributes is T1 on T2 = T1 T2 . Adding on to the language does not
change anything, but two operations derived from it can be added: semijoin n
and antijoin n . These essentially correspond to EXISTS and NOT EXISTS in
SQL. For tables T1; T2 with disjoint sets of attributes, they are de ned as:
T1 n T2 = T1 \ sign(T1)(T1 on T2)
T1 n T2 = T1</p>
      <p>T1 n T2
That is, T1 n T2 returns all occurrences of each record r in T1 for which there
is a record s in T2 such that is true on r [ s. Similarly, T1 n T2 returns all
occurrences of each record r in T1 for which there is no record s in T2 s.t. is
true on r [ s. We de ne signnull(Q1 n Q2) = signnull(Q1 n Q2) = signnull(Q1).</p>
      <p>How do we restrict these operations to ensure (2a) and (2b)? Observe that
n is an intersection and n is a di erence, and syntactic matching of nulls must
be prevented in general for these operations. Here, however, one of the operands
to this intersection/di erence is always contained in the other, independently of
whether SQL nulls or Codd nulls are used, so syntactic matching of nulls is not
a problem in this case. In turn, we need no restriction on n and n to guarantee
(2a). As for (2b), observe that the output of both n and n is always contained
in their left input, so no new repetitions of nulls can be created and, in turn, we
do not need any restriction in order to ensure (2b) either. Repetitions of nulls
that might have been created in the right operand of n or n are discarded. Thus
we can use an alternative to NNA, called RDS: a node in a parse tree satis es this
condition if it is the right-descendant of a n or a n operation.</p>
      <p>Theorem 2. Let Q be a query (extended with n, n, const) such that: (a) each
", \, node satis es NNC; (b) each node satis es NNA or RDS; (c) each [ node
satis es NNC or DJB or NNA or RDS. Then, Q preserves Codd semantics.</p>
      <p>An analog of Corollary 1 follows, and simple modi cations to the algorithm
of Proposition 3 show that these conditions are still linear-time testable.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>The notion of Codd database we adopted in this paper only allows for constant
tuples to occur multiple times, as nulls cannot repeat. Relaxing this requirement
to also allow for duplicates of non-constant tuples raises several questions: How
do we interpret multiple occurrences of a tuple with nulls in an SQL database?
What is preservation of Codd semantics in this context? Do our restrictions still
guarantee it?</p>
      <p>Another interesting direction for future investigation concerns set semantics.
Observe that a query Q can always be rewritten into a query Q0 where duplicate
elimination is suitably applied in each subexpression so that, on set databases
(i.e., in which relations are sets), evaluating Q under set semantics is equivalent
to evaluating Q0 under bag semantics. If Q0 satis es our restrictions, preservation
of Codd semantics is guaranteed on all (bag and set) databases and, in turn, Q
preserves Codd semantics on set databases. However, this is too restrictive for
even simple queries (e.g., R [ S) and weaker conditions may be devised to deal
speci cally with set semantics.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>Work partially supported by EPSRC grants EP/N023056/1 and EP/M025268/1.
The authors would like to thank the anonymous reviewers for their useful
comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barcelo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Libkin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murlak</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Foundations of Data Exchange</article-title>
          . Cambridge University Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Tractable reasoning and e cient query answering in description logics: The DL-Lite family</article-title>
          .
          <source>J. Autom. Reasoning</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <volume>385</volume>
          {
          <fpage>429</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
          </string-name>
          , J.:
          <article-title>Database systems: The complete book</article-title>
          .
          <source>Pearson Education</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ordille</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Data integration: The teenage years</article-title>
          .
          <source>In: VLDB</source>
          . pp.
          <volume>9</volume>
          {
          <issue>16</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Imielinski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Incomplete information in relational databases</article-title>
          .
          <source>Journal of the ACM</source>
          <volume>31</volume>
          (
          <issue>4</issue>
          ),
          <volume>761</volume>
          {
          <fpage>791</fpage>
          (
          <year>1984</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kontchakov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zakharyaschev</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The combined approach to ontology-based data access</article-title>
          .
          <source>In: IJCAI</source>
          . pp.
          <volume>2656</volume>
          {
          <issue>2661</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Data integration: a theoretical perspective</article-title>
          .
          <source>In: ACM Symposium on Principles of Database Systems (PODS)</source>
          . pp.
          <volume>233</volume>
          {
          <issue>246</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Libkin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>SQL's three-valued logic and certain answers</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .
          <volume>41</volume>
          (
          <issue>1</issue>
          ), 1:
          <issue>1</issue>
          {1:
          <issue>28</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>