-

0 Free University of Bozen-Bolzano , Italy lastname @inf.unibz.it

114 128

The logic of nulls in databases has been subject of investigation since their introduction in Codd's Relational Model, which is the foundation of the SQL standard. In the logic based approaches to modelling relational databases proposed so far, nulls are considered as representing unknown values. Such existential semantics fails to capture the behaviour of the SQL standard. We show that, according to Codd's Relational Model, a SQL null value represents a non-existing value; as a consequence no indeterminacy is introduced by SQL null values. In this paper we introduce an extension of rst-order logic accounting for predicates with missing arguments. We show that the domain independent fragment of this logic is equivalent to Codd's relational algebra with SQL nulls. Moreover, we prove a faithful encoding of the logic into standard rst-order logic, so that we can employ classical deduction machinery.

1 2 R : a b b N

SELECT * FROM R WHERE R.1 = R.1 AND R.2 = R.2 ;

In SQL, the query above returns the table R if and only if the table R does not have any null value, otherwise it returns just the tuples not containing a null value, i.e., in this case only the rst tuple ha; bi. Informally this is due to the fact that a SQL null value is never equal (or not equal) to anything, including itself. How can we formally capture this behaviour?

In this paper we introduce a formal semantics for SQL null values in order to capture exactly the behaviour of SQL queries and SQL constraints in presence of null values. We restrict our attention to the rst-order fragment of SQL { e.g., we do not consider aggregates, and both queries and constraints should be set based (as opposed to bag/multi-set based): this fragment can be expressed, in absence of null values, into the standard relational algebra RA. Note that the standard relational algebra RA is more expressive than the null-free rst-order fragment of SQL, since it can express relations of arity zero [ 1 ].

To the best of our knowledge, there has been no attempt so far to formalise in logic a relational algebra with proper SQL null values. It is well known that SQL null values require a special semantics. Indeed, if they were treated as standard database constants, the direct translation in the standard relational algebra RA of the above SQL query would be equivalent to the identity expression for R: 1=1 2=2 R, giving as an answer the table R itself, namely the tuples fha; bi; hb; Nig. However, we have seen above that the expected answer to this query is di erent.

The most popular semantics in the literature for null values is the one interpreting null values with an existential meaning, namely a null value denotes an object which exists but has an unknown identity: this is the semantics of naive tables (see [ 2 ]). It is easy to see that also with this semantics, the direct translation of the above SQL query in the standard relational algebra RA over naive tables would be equivalent to the identity expression for R: 1=1 2=2 R, giving as an answer the table R itself, namely the tuples fha; bi; hb; Nig. And again, we have seen above that the expected answer to this query is di erent.

This paper is organised as follows. We rst introduce three equivalent di erent representations of databases with null values: they will be used in di erent contexts to prove the equivalence of the di erent semantics given to SQL null values. Then, we introduce the relational algebra with null values RAN, as an obvious extension to the standard relational algebra RA and compatible with the documents de ning SQL. We introduce an extension to standard rst-order logic FOL, called FOL", which takes into account the possibility that some of the arguments of a relation might not exist. We then show that FOL" is equally expressive in a strong sense to the standard rst-order logic FOL, which allows us to devise simple implementations of FOL" by relying to standard rst-order provers. At the end, we prove our main result: RAN is equally expressive to the domain independent fragment of FOL" with the standard name assumption, a result which parallels in an elegant way the classical equivalence result by Codd for the standard relational algebra RA and the domain independent fragment of rst-order logic FOL with the standard name assumption. 2

Database Instances with Null Values

We introduce here the notions of tuple, of relation, and of database instance in presence of null values. We consider the unnamed (positional) perspective for the attributes of tuples: elements of a tuple are identi ed by their position within the tuple.

Given a set of domain values , an n-tuple is a total function from integers from 1 up to n (the position of the element within the tuple) into the set of domain values augmented by the special term N (the null value): if we denote the set fi 2 N j 1 i^i ng, possibly empty if n = 0, as [1 n], then an n-tuple is a total function [1 n] 7! [ fNg. Note that the zero-tuple is represented by a constant zero-ary function that we call fg. For example, the tuple hb; Ni is (a) instance IN (b) instance I"

Ref1;2g=2 : ff1 7! a; 2 7! bgg

Ref1g=1 : ff1 7! bgg Ref2g=1 : fg Refg=0 : fg (c) instance I} represented as f1 7! b; 2 7! Ng, while the zero-tuple hi is represented as fg. A relation of arity n is a set of n-tuples; if we want to specify that n is the arity of a given relation R, we write the relation as R=n. Note that a relation of arity zero is either empty or it includes only the zero-tuple fg. A relational schema R includes a set of relation symbols with their arities and a set of constant symbols C. A database instance IN associates to each relation symbol R of arity n from the relational schema R a set of n-tuples IN(R), and to each constant symbol in C a domain value in . Usually, in relational databases all constant symbols are among the domain values and are associated in the database instance to themselves { this is called the Standard Name Assumption.

As an example of a database instance IN consider Figure 2(a).

In our work we consider two alternative representations of a given a database instance IN, where null values do not appear explicitly: I" and I}. We will show that they are isomorphic to IN. The representations share the same constant symbols, domain values, and mappings from constant symbols to domain values. The di erences are in the way null values are encoded within a tuple.

Compared with a database instance IN, a corresponding database instance I" di ers only in the way it represents n-tuples: an n-tuple is a partial function from integers from 1 up to n into the set of domain values - the function is unde ned exactly for those positional arguments which are otherwise de ned and mapped to a null value in IN.

That is, if R 2 R is an n-ary relation, then

I"(R) = ft 2 ([1 n] 7! C) j 9t0 2 IN(R): t0(i) 6= N ! t(i) = t0(i) ^ t0(i) = N ! t(i) unde ned g

Obviously, also the inverse correspondence exists:

IN(R) = ft0 2 ([1 n] 7! C [ fNg) j 9t 2 I"(R): t(i) de ned ! t0(i) = t(i) ^ t(i) unde ned ! t0(i) = Ng As an example of a database instance I" consider Figure 2(b). From the gure is clear that this representation takes a di erent approach w.r.t. the standard relational model: i.e., relations may include tuples of dishomogeneous arity. We will introduce the third kind of representation in order to demonstrate that this approach can be embedded into a standard relational model by considering an extended vocabulary.

Compared with a database instance I", a corresponding database instance I} di ers in the way relation symbols are interpreted: a relation symbol of arity n is associated to a set of novel relation symbols having the arity less or equal to n. The main idea is that a tuple with null arguments ends up in the corresponding relation which excludes those arguments (see Figure 2(c)). The concept is analogous to the notion of lossless horizontal decomposition as described, e.g., in [ 1 ].

Given a database instance IN de ned over a relational schema R, the corresponding database instance I} is de ned over the decomposed relational schema Re: for each relation symbol R 2 R of arity n and for each (possibly empty) subset of its positional arguments A [1 n], the decomposed relational schema Re includes a predicate ReA with arity jAj. The correspondence between IN and I} is based on the fact that each jAj-tuple in the relation I}(ReA) corresponds exactly to the n-tuple in IN(R) having non-null values only in the positional arguments in A, with the same values and in the same relative positional order. That is, if R 2 R is an n-ary relation, then Abusing the notation for the sake of simplicity, we assume that if k = 0 then f1; : : : ; kg = fi1; : : : ; ikg = ;.

In absence of null values, clearly the IN and I" representations of a database instance coincide. The third representation cannot be directly compared because of the di erent signature, but it can be noted that, in absence of null values, for every n-ary relation R in R, I}(ReA) = ; if jAj < n. Therefore for each n-ary relation in R, IN(R) = I"(R) = I}(Re[1 n]).

Given the discussed isomorphisms, in the following { whenever the di erence in the representation of null values does not generate ambiguity { we will denote as I the database instance represented in any of the above three forms IN, I", or I}.

Constant singleton - hvi - (where v 2 C)

R(I) = IN(R): hvi(I) = f1 7! vg: Selection - i=v e, i=j e - (where v 2 C, ` is the arity of e, and i; j `) i=v e(I) = fs is a `-tuple j s 2 e(I) ^ s(i) = vg; i=j e(I) = fs is a `-tuple j s 2 e(I) ^ s(i) = s(j)g: Projection - i1;:::;ik e - (where ` is the arity of e, and fi1; : : : ; ikg [1 : : : `]) i1;:::;ik e (I) = fs is a k-tuple j exists s0 2 e(I) s.t. for all 1 j k: s(j) = s0(ij)g: Cartesian product - e

e0 - (where n; m are the arities of e; e0) (e e0)(I) = fs is a (n + m)-tuple j exists t 2 e(I); t0 2 e0(I) s.t. for all 1 j n: s(j) = t(j) ^ for all 1 + n j (n + m): s(j) = t0(j n)g:

Union/Di erence - e [ e0 , e

e0 - (where ` is the arity of e and e0) (e [ e0)(I) = fs is a `-tuple j s 2 e(I) _ s 2 e0(I)g, (e e0)(I) = fs is a `-tuple j s 2 e(I) ^ s 62 e0(I)g.

Derived operators - (where v 2 C, ` i; j; i1; : : : ; i` m, and k1; : : : ; k` n) min(m; n), m; n are the arities of e; e0, i<>v e We introduce in this Section the formal semantics of the relational algebra dealing with null values, corresponding (modulo the zero-ary relations) to the rstorder fragment of SQL.

Let's rst recall the notation of the standard relational algebra RA (see, e.g., [ 3 ] for details). Standard relational algebra expressions over a relational schema R are built according to the inductive formation rules in the boxed expressions of Figure 3. Also in Figure 3 the semantics of an algebra expression e is inductively de ned as the transformation of database instances I { with the Standard Name Assumption { to a set of tuples e(I).

Null singleton - hNi

Selection - i=j e - (where ` is the arity of e, and i; j `) hNi(I) = f1 7! Ng: i=j e(I) = fs is a `-tuple j s 2 e(I) ^ s(i) = s(j) ^ s(i) 6= N ^ s(j) 6= Ng: Derived operators - (where v 2 C, ` i; j; i1; : : : ; i` m, and k1; : : : ; k` n) min(m; n), m; n are the arities of e; e0,

We now extend the standard relational algebra to deal with SQL nulls; we refer to it as Null Relational Algebra (RAN).

Null values have been introduced in the relational model since its inception. In order to deal with null values, Codd [ 4 ] included the special null value in the domain and adopted a three-valued logic having the third truth value unknown together with true and false. The comparison expressions in Codd's algebra are evaluated to unknown if they involve a null value, while in set operations, tuples otherwise identical but containing null values are considered to be di erent. This view is exactly what we nd in SQL where three-valued logic is used for the evaluation of WHERE clauses but independent occurrences of the null value may be identi ed in set operations { see, e.g., [ 5 ].

In order to de ne RAN from the standard relational algebra, we adopt the IN representation of a database instance where null values are explicitly present as possible elements of tuples, and with the Standard Name Assumption. Then, it is enough to let equality (and inequality) fail whenever null values are involved, since its evaluation would be unknown.

All the RA expressions are valid RAN expressions, and maintain the same semantics, with the only change in the semantics of the selection expressions i=j e and i<>j e with equality or inequality among elements in a tuple: in these cases the semantic de nitions make sure that the elements to be tested for equality (or inequality) are both di erent from the null value in order for the equality (or inequality) to succeed. The syntax of RAN expressions extends the standard RA syntax only for the additional null singleton expression hNi (and for the derived operators involving null values). Figure 4 introduces the syntax and semantics of RAN expressed in terms of its di erence with RA, including the derived operators which are added or de ned di erently in RAN. It is easy to show that the de nition of the null relational algebra exactly matches the (informal) de nition given to SQL with null values, and it generates the same practical behaviour [ 5 ]. Given a tuple t and a RAN expression e, we call models of the expression e with the answer t all the database instances I such that t 2 e(I).

Sometimes an n-ary algebra expression is intended to express a boolean statement over a database instance in the form of a denial constraint : this is done by checking whether its evaluation over the database instance returns the empty n-ary relation or not. An n-ary denial constraint e can always be reduced into a zero-ary constraint, whose boolean value is meant to be true if its evaluation contains the zero-tuple fg and false if it is empty, by considering its zero-ary projection ( ;e).

Example 1. Consider the relational schema fR=2; S=2g with the following data and the following constraints expressed in RAN: { UNIQUE constraint for R:1: 1=3 26=4 (R R) = ;; { NOT-NULL constraint for R:1: isNull(1) R = ;; { UNIQUE constraint for S:1: 1=3 26=4 (S S) = ;, { FOREIGN KEY constraint from S:2 to R:1: 2 isNotNull(2) S 1 isNotNull(1) R= ;.

It is easy to see that these constraints are all satis ed: they behave in the same way as the corresponding SQL constraints with null values.

Similarly, the query considered in the previous Section ( 1=1 2=2 R), now behaves as in SQL and it returns correctly fha; big. 4

First-order Logic with Null Values

The Null Relational Calculus (FOL") is a rst-order logic language with an explicit symbol N representing the null value, and where predicates denote tuples over subsets of the arguments instead of just their whole set of arguments. It extends classical rst-order logic in order to take into account the possibility that some of the arguments of a relation might not exist.

Given a set of predicate symbols each one associated with an arity together with the special equality binary predicate =, and a set C of constants { together forming the relational schema (or signature) R { and a set of variable symbols, terms of FOL" are variables, constants, and the special null symbol N, and formulae of FOL" are de ned by the following rules: 1. if t1; : : : ; tn are terms and R is a predicate symbol in R (di erent from the equality) of arity n, R(t1; : : : ; tn) is an atomic formula; 2. if t1; t2 are terms di erent from N, =(t1; t2) is an atomic formula; 3. if ' and '0 are formulae, then :', ' ^ '0 and ' _ '0 are formulae; 4. if ' is a formula and x is a variable, then 9x' and 8x' are formulae. Note that the syntax of FOL" corresponds to a classical rst-order logic with equality, with the only addition of the special null symbol N which may appear in atomic formulae (but not in equalities). As usual, a formula is ground if it does not contain any variable symbol.

The semantics of FOL" formulae is given in terms of database instances of type I", also called interpretations. As usual, an interpretation I" includes an interpretation domain and it associates each relation symbol R of arity n in the signature to a set of n-tuples I"(R) { i.e., a set of partial functions with range in { and each constant symbol in C to a domain value in (we do not consider here the Standard Name Assumption). The equality predicate is interpreted as the classical equality over the domain . It is easy to see that if n-tuples were just total functions, then an interpretation I" would correspond exactly to a classical rst-order interpretation.

The de nition of satisfaction and entailment in FOL" is exactly the same as in classical rst-order logic with equality over the signature R, with the only di erence lying on the truth value of atomic formulae which involve partial functions instead of total functions because of the possible presence of null values in atomic formulae. As usual, an interpretation I" satisfying a formula is called a model of the formula.

An interpretation I" and a valuation function for variable symbols satisfy an atomic formula { written I"; j=FOL" R(t1; : : : ; tn) { i there is an n-tuple 2 I"(R) such that for each i 2 [1 n]: (i) = I"(c) if ti is a constant symbol c, (i) = (ti) if ti is a variable symbol, and (i) is unde ned if ti = N. It is easy to see that the satis ability of a FOL" formula without any occurrence of the null symbol N doesn't depend on partial tuples; so its models can be characterised by classical rst-order semantics: in each model the interpretation of predicates would include only tuples represented as total functions. Example 2. The models of the FOL" formula R(a; b) ^ R(b; N) are all the interpretations I" such that I"(R) includes at least the tuples f1 7! a; 2 7! bg and f1 7! bg. 4.1

Characterisation in classical First-order Logic Given a signature R, let's consider a classical rst-order logic language with equality (FOL) over the decomposed signature Re, as it has been de ned in Section 2; FOL has a classical semantics with models of type I} and it does not deal with null values directly. In this Section we show that FOL" over R and FOL over Re are equally expressive, namely that for every formula in FOL" over the signature R there is a corresponding formula in FOL over the decomposed signature Re, such that the two formulae have isomorphic models, and that for every formula in FOL" oovveerr tthhee sdiegcnoamtuproeseRd,sisguncahtutrheatReththeetrweois faorcmourrlaesephonavdeing formula in FOL isomorphic models.

As we discussed before, the isomorphism between the interpretations IN and I} is based on the fact that each jAj-tuple in the relation I}(ReA) corresponds exactly to the n-tuple in IN(R) having non-null values only in the positional arguments speci ed in A, with the same values and in the same relative positional order.

In order to relate the two logics, we de ne a bijective translation function f ( ) (and its inverse f 1( )) which maps FOL" formulae into FOL formulae (and vice-versa).

8 De nition 1 (Bijective translation f ).

f ( ) is a bijective function from FOL" formulae over the signature R to FOL formulae over the signature Re, de ned as follows.

{ f (R(t1; : : : ; tn)) = Refi1;:::;ikg(ti1 ; : : : ; tik ), where R 2 R an n-ary relation, fi1; : : : ; ikg = fj 2 [1 n] j ti is not Ng. Obviously: f 1(Refi1;:::;ikg(ti1 ; : : : ; tik )) = R(t01; : : : ; t0n), where fi1; : : : ; ikg [1 n] and t0ij = tj for j = 1; : : : ; k , and t0j is N for j 2 [1 n] n fi1; : : : ; ikg.

In both the direct and inverse cases we assume i1; : : : ; ik in ascending order. { The translation of equality atoms and non atomic formulae is the identity transformation inductively de ned on top of the above translation of atomic formulae.

It is easy to see that the size of a translated formula (via either the f ( ) or the f 1( ) translation) is linearly bounded by the size of the input formula, assuming the signatures R and Re xed; the signature Re can be exponentially larger than the signature R, the exponent being the maximum arity of the relations in R. Example 3. The FOL" formula 9x:R(a; x)^R(x; N)^R(N; N) over the signature R is translated as the FOL formula 9x:Ref1;2g(a; x) ^ Ref1g(x) ^ Refg over the decomposed signature Re, and vice-versa.

Let's show now that the above bijective translation preserves the models of the formulae, modulo the isomorphism among models presented in Section 2. Theorem 1. Let ' be a FOL" formula over the signature R, and 'e a FOL formula over the signature Re. Then for any (database) instance I: I"; j=FOL" ' the formulae ' and"'ea.nSdinFceOsLy,nwtaexsahnodwstehmaatntthicessotaftneomne-nattohmoilcd ffoorrmtuhleaeatiso mthiec same for both FOL cases. The case of equality can be easily proved by noticing that N is not allowed to appear among its arguments.

We focus on ground atomic formulae, since the valuation function for variable symbols is the same in the pre- and the post-conditions. 1. Let R(t1; : : : ; tn) be a FOL" ground atomic formula where A = fi1; : : : ; ikg [1 n] is the set of its positional arguments for which ti is not N. Then f (R(t1; : : : ; tn)) = Re(ti1 ; : : : ; tik ).

By the assumption I"; (j)j==FOLI""(Rtj()t1f;o:r: :j; tn2); Ath,earnefdore there is a tuple 2 I"(R) such that is unde ned for j 62 A. By de nition of I}, ( (i1); : : : ; (ik)) 2 I}(Re) therefore I}; j= ReA(ti1 ; : : : ; tik ).

9 2. Let Refi1;:::;ikg(t1; : : : ; tk) be a FOL" ground atomic formula with fi1; : : : ; ikg [1 n]; then f 1(Refi1;:::;ikg(t1; : : : ; tk)) = R(t01; : : : ; t0n), where fi1; : : : ; ikg [1 n] and t0ij = tj for j = 1; : : : ; k and t0j = N for j 2 [1 n] n fi1; : : : ; ikg. By assumption I}; j= Refi1;:::;ikg(t1; : : : ; tk), therefore (I}(t1); : : : ; I}(tk)) 2 I}(Refi1;:::;ikg). By de nition of I" there is a tuple 2 I"(R) s.t. (ij ) = I}(tj ) for j 2 fi1; : : : ; ikg and unde ned otherwise. Therefore I"; j=FOL" R(t01; : : : ; t0n). tu

Domain Independent fragment of FOL" with Standard Names In this Section we introduce the domain independent fragment of FOL" with the Standard Name Assumption, and we analyse its properties.

De nition 2 (Domain Independence). A FOL" closed formula ' is domain independent if for every two interpretations I = h I ; I( )i and J = h J ; J ( )i, which agree on the interpretation of relation symbols and constant symbols { i.e. I( ) = J ( ) { but disagree on the interpretation domains I and J : I j= ' The domain independent fragment of FOL" includes only the domain independent formulae of FOL".

It is easy to see that the domain independent fragment of FOL" can be characterised with the safe-range syntactic fragment of FOL": intuitively, a formula is safe-range if and only if its variables are bounded by positive predicates or equalities { for the exact syntactical de nition see, e.g., [ 3 ]. Due to the strong semantic equivalence expressed in Theorem 1 and due to the fact the the bijection f ( ) preserves the syntactic structure of the formulae, we can reuse the results about safe-range transformations and domain independence usually holding for classical FOL. To check whether a formula is safe-range, the formula is transformed into a logically equivalent safe-range normal form and its range restriction is computed according to a set of syntax based rules; the range restriction of a formula is a subset of its free variables, and if it coincides with the free variables then the formula is said to be safe-range [ 3 ]. In addition to that, the following theorem on domain independence holds.

Theorem 2. Any safe-range FOL" formula is domain independent, and any domain independent FOL" formula can be transformed into a logically equivalent safe-range FOL" formula.

We focus now on the domain independent fragment of FOL" with the Standard Name Assumption. We observe that an interpretation is a model of a formula in the domain independent fragment of FOL" with the Standard Name Assumption if and only if the interpretation which agrees on the interpretation of relation and constant symbols but with the interpretation domain equal to the set of standard names C is a model of the formula. Therefore, in the following when dealing with the domain independent fragment of FOL" with the Standard 10 Name Assumption we can just consider interpretations with the interpretation domain equal to C.

We now show that we can weaken the Standard Name Assumption by just assuming Unique Names, without changing the certain answers. An interpretation I satis es the Unique Name Assumption if I(a) 6= I(b) for any di erent a; b 2 C. We observe that an interpretation is a model of a FOL" formula with the Standard Name Assumption if and only if the interpretation obtained by homomorphically transform the standard names with arbitrary domain elements is a model of the formula; this latter interpretation clearly satis es the Unique Name Assumption. This observation allows us to freely interchange the Standard Name and the Unique Name Assumptions; this is of practical advantage, since we can easily encode the Unique Name Assumption in a classical rst-order logic reasoner, and most description logics reasoners do have a native Unique Name Assumption. 5

Equivalence of Algebra and Calculus

In this Section we prove the strong relation between RAN and FOL". Theorem 3. The RAN relational algebra with nulls and the domain independent fragment of FOL" with the Standard Name Assumption are equally expressive.

The notion of equal expressivity is captured by the following two propositions. Proposition 1. Let e be an arbitrary RAN expression of arity n, and t an arbitrary n-tuple as a total function with values taken from the set C [ fNg. There is a function (e; t) translating e with respect to t into a closed safe-range FOL" formula, such that for any instance I with the Standard Name Assumption: t 2 e(IN) if and only if

I" j=FOL" (e; t): Proposition 2. Let ' be an arbitrary safe-range FOL" closed formula. There is a RAN expression e, such that for any instance I with the Standard Name Assumption:

I" j=FOL" '

if and only if e(IN) 6= ;:

This means that there exists a reduction from the membership problem of a tuple in the answer of a RAN expression over a database instance with null values into the satis ability problem of a closed safe-range FOL" formula over the same database (modulo the isomorphism among database instances presented in Section 2); and there exists a reduction from the satis ability problem of a closed safe-range FOL" formula over a database instance with null values into the emptiness problem of the answer of a RAN expression over the same database (modulo the isomorphism among database instances).

Example 4. Given some database instance, checking whether the tuple ha; bi or the tuple hb; Ni are in the answer of the RAN query 1=1 2=2 R (discussed in Section 1) corresponds to check the satis ability over the database instance of 11 the FOL" closed safe-range formula R(a; b) in the former case, or of the formula false in the latter case. You can observe that, as expected, also when translated in rst-order logic, it turns out that the tuple hb; Ni is not in the answer of the query 1=1 2=2 R. A constructive way to get the FOL" formulae is discussed in Section 5.1.

Example 5. Let's consider the "UNIQUE constraint for R:1" from Example 1 expressed in RAN as 1=3 26=4 (R R) = ;. The RAN expression is translated as the closed safe-range FOL" formula: 8x; y; z: R(x; y) ^ R(x; z) ! y = z, which corresponds to the classical way to express a unique constraint in rst-order logic. A constructive way to get the FOL" formula is discussed in Section 5.1. Example 6. The safe-range closed FOL" formula 9x:R(x; N) is translated as the zero-ary RAN statement isNotNull(1) isNull(2) R 6= ;. A constructive way to get the RAN statement is discussed in Section 5.2.

Example 7. Let's consider what would be the classical way to express the \FOREIGN KEY from S:2 to R:1" constraint from Example 1 in rst-order logic: 8x; y: S(x; y) ! 9z: R(y; z) :9y: (9x: S(x; y)) ^ :(9z: R(y; z)). The formula is translated as the RAN statement: 2 isNotNull(2) S 1 isNotNull(1) R = ;, which is the RAN statement we considered in Example 1. A constructive way to get the RAN statement is discussed in Section 5.2. 5.1

From Algebra to Calculus To show that the safe-range fragment of FOL" with the Standard Name Assumption captures the expressivity of RAN queries, we rst de ne explicitly a translation function which maps RAN expressions into safe-range FOL" formulae. De nition 3 (From RAN to safe-range FOL"). Let e be an arbitrary RAN expression, and t an arbitrary tuple of the same arity as e, as a total function with values taken from the set C [ fNg [ V, where V is a countable set of variable symbols. The function (e; t) translates e with respect to t into a FOL" formula according to the following inductive de nition: { for any R=` 2 R,

(=(t(1); v) if t(1) 6= N { (hvi; t) ; false otherwise

(T; t) ; R(t(1); : : : ; t(`)) { { { (hNi; t) ; ( i = v e; t) ; ( i = j e; t) ; (false if t(1) 6= N true otherwise ( (e; t) ^ =(t(i); v) if t(i) 6= N

false otherwise ( (e; t) ^ =(t(i); t(j)) if t(i); t(j) 6= N false otherwise

12 { {

( i1 ik e; t) ; 9x1 xn WH f1 ngnfi1 ikg (e; tH ) where xi are fresh variable symbols and tH is a sequence of n terms de ned as:

: <>8t(i) if i 2 fi1; : : : ; ikg tH (i) = N if i 2 H

>:xi otherwise (e1 e2; t) ; (e1; t) ^ (e2; t0) where n1; n2 are the arity of e1; e2 respectively, and t0 is a n2-ary tuple function s.t. t0(i) = t(n1 + i) for 1 i n2 (e1 [ e2; t) ; (e1 e2; t) ; (e1; t) _ (e2; t) (e1; t) ^ : (e2; t)

We can now proceed with the proof of Proposition 1.

Proof (Proposition 1). We prove the proposition by induction on the expression e that t 2 e(IN) i I" j=FOL" (e; t). For the sake of brevity we demonstrate the arguments for two of the signi cant cases.

Let e be R 2 R of arity n and t a tuple s.t. t 2 R(IN) where fi1; : : : ; ikg are the indexes for which t(i) = N. By De nition 3, (R; t) ; R(t). Since t 2 R(IN), then t 2 IN(R), so there is a partial function t0 2 I"(R) s.t. t0(i) is unde ned for i 2 fi1; : : : ; ikg and t0(i) = t(i) otherwise. Therefore I" j=FOL" R(t). On the other hand, if I" j=FOL" R(t) then there is a partial function t0 2 I"(R) s.t. t0(i) is unde ned for i 2 fi1; : : : ; ikg and t0(i) = I"(t(i)) otherwise. Since I is an interpretation with the Standard Name Assumption, I"(t(i)) = t(i); therefore t 2 IN(R) and t 2 R(IN).

Let t 2 ( i1 ik e)(IN), then there is t0 of arity n s.t. t0 2 e(IN) and t(j) = t0(ij ) for i = 1 : : : k. By the recursive hypothesis, I" j=FOL" (e; t0). Let H0 [1 n] n fi1; ; ikg be the sets of indexes for which t0( ) is N; then I" j=FOL" 9x1 xn (e; tH0 ), so I" j=FOL" 9x1 xn WH f1 ngnfi1 ikg (e; tH ) because H0 is one of these disjuncts. For the other direction, let us assume that I" j=FOL" ( i1 ik e; t), then I" j=FOL" 9x1 xn WH [1 n]nfi1 ikg (e; tH ), so there is H0 [1 n] n fi1 ikg and assignment for variables in tH0 s.t. I" j=FOL" (e; (tH0 )). By the inductive hypothesis there is a tuple t0 corresponding to (tH0 ) s.t. t0 2 e(IN); by construction of tH0 , t(i) = (tH0 (i)) for i 2 fi1; : : : ; ikg therefore t 2 i1 ik e (IN). tu 5.2

From Calculus to Algebra To show that RAN queries capture the expressivity of the safe-range fragment of FOL" with the Standard Name Assumption, we rst establish two intermediate results supporting that FOL" queries can be expressed in (standard) relational algebra.

Lemma 1. Let e be an expression in the standard relational algebra RA over the e decomposed relational schema Re. There is a RAN expression e over the relational schema R, such that for any instance I: ee(I}) = e(IN)

13 Proof. The RAN expression e is the same expression ee where each basic relation Refi1;:::;ikg with R of arity n is substituted by the expression

Refi1;:::;ikg ;

i1;:::;ik ( isNotNull(fi1;:::;ikg) ( isNull([1 n]nfi1;:::;ikg) R)) where isNull(i1;:::;ik) e is a shorthand for isNull(i1) : : : isNull(ik) e.

The proof is by induction on the expression e where the basic cases are the unary singleton - which is the same as RAN since there are no nulls - and any basic relation Refi1;:::;ikg where R 2 R an n-ary relation. Compound expressions satisfy the equality because of the inductive assumption and the fact that the algebraic operators (in absence of null values) in RAN and RA have the same de nition.

Therefore we just need to show that

Refi1;:::;ikg(I}) = i1;:::;ik ( isNotNull(i1;:::;ik) ( isNull(j1;:::;j`) R))(IN) however, is true by the de nition of I} and IN as shown in Section 2: Lemma 2. Let be a safe-range FOL closed formula. There is an RA expression e of arity zero, such that for any instance I} with the Standard Name e Assumption:

I} j=FOL

if and only if ee(I}) 6= ; Proof. Since null values are not present neither in FOL nor RA, the lemma derives from the equivalence between safe-range relational calculus queries and relational algebra (see e.g. [ 3 ]). tu

Now we can prove Proposition 2 using the above lemmata.

Proof (Proposition 2). Given a safe-range FOL" closed formula ', by Theorem 1, there is a safe-range FOL formula 'e such that I" j=FOL" ' i I} j=FOL 'e. By restricting to instances with the Standard Name Assumption, we can use Lemma 2 to conclude that there is an RA expression ee of arity zero such that I} j=FOL 'e i ee(I}) 6= ;; therefore I" j=FOL" ' i ee(I}) 6= ; Finally we use Lemma 1 to show that there is an RAN expression e such that ee(I}) 6= ; i e(IN) 6= ;; which enables us to conclude that I" j=FOL" ' i e(IN) 6= ;. tu 6

Conclusions

Since their inception, SQL null values have been at the centre of long discussions about their real meaning and their formal semantics (see, e.g., the recent thread 14 in SIGMOD Record [ 6,7 ]). The vast majority of logic based approaches consider nulls as values with an unknown interpretation (i.e., a value exists but it is not known) and they model them as existential variables (e.g. naive tables [ 2 ] and all the works inspired by [ 8 ]). In spite of the fact that these works have their merits and provide a well founded characterisation of incomplete information in databases, they diverge from SQL standard so they are outside the scope of our approach. We have shown that null values { when de ned as in SQL with the three-valued logic { should be interpreted as nonexistent values, and that such null values do not introduce any incompleteness in the data.

To the best of our knowledge this is the rst proposal to characterise the SQL semantics of null values by means of relational calculus. Moreover, we do properly extend Codd's theorem, stating the equivalence of the relational algebra with the relational calculus, to the case in which null values with the original SQL semantics are added to the languages.

It is worth to notice that the use of sub-relations to model the absence of values (shown in Section 4.1) is similar in spirit to the horizontal decomposition summarised e.g. in [ 1 ]. However, classically horizontal decomposition is more a modelling paradigm and must be performed \by hand". In our case, the decomposition is transparent to the modeller/user; in fact it is rather a technical mean in order to reduce our novel logic FOL" to standard rst-order logic FOL.

The SQL semantics for null values is also re ected in the satis ability of integrity constraints (e.g. foreign keys), expressed usually as denial constraints using the query algebra itself. However, SQL:1999 introduced the possibility of specifying alternative semantics for the constraints (not based on the actual evaluation of SQL queries). In particular, foreign keys may be given optionally a semantic where null values would be interpreted as unknown values as opposed to nonexistent (see [ 9 ]). The formalisation of this alternative semantics of null values only within constraints (but not queries) will require some more work. Note that commercial relational database systems do not support this extension.

1. Date , C.J. , Darwen , H.: Database Explorations: Essays on the Third Manifesto and

Topics . Tra ord Publishing ( 2010 )

2. Imielinski , T. , Lipski , Jr., W.: Incomplete information in relational databases . J. ACM 31 ( 4 ) ( September 1984 ) 761 { 791

3. Abiteboul , S. , Hull , R. , Vianu , V. : Foundations of Databases. Addison-Wesley ( 1995 )

4. Codd , E.F. : Extending the database relational model to capture more meaning . ACM Trans. Database Syst . 4 ( 4 ) ( 1979 ) 397 { 434

5. Date , C. : An Introduction to Database Systems. 8 edn . Addison-Wesley ( 2003 )

6. Rubinson , C. : Nulls, three-valued logic, and ambiguity in SQL: critiquing Date's critique . SIGMOD Record 36 ( 4 ) ( December 2007 ) 13 { 17

7. Grant , J. : Null values in SQL . SIGMOD Rec . 37 ( 3 ) ( September 2008 ) 23 { 25

8. Zaniolo , C. : Database relations with null values . Journal of Computer and System Sciences 28 ( 1 ) ( 1984 ) 142 { 166

9. Turker, C. , Gertz , M. : Semantic integrity support in SQL:1999 and commercial (object-)relational database management systems . The VLDB Journal 10 ( 4 ) ( 2001 ) 241 { 269