=Paper= {{Paper |id=Vol-2373/paper-26 |storemode=property |title=Connecting Databases and Ontologies: A Data Quality Perspective |pdfUrl=https://ceur-ws.org/Vol-2373/paper-26.pdf |volume=Vol-2373 |authors=Horacio Tellez Perez,Jef Wijsen |dblpUrl=https://dblp.org/rec/conf/dlog/PerezW19 }} ==Connecting Databases and Ontologies: A Data Quality Perspective== https://ceur-ws.org/Vol-2373/paper-26.pdf
         Connecting Databases and Ontologies:
             A Data Quality Perspective

                       Horacio Tellez Perez and Jef Wijsen

             Département d’Informatique, University of Mons, Belgium
                {horacio.tellezperez, jef.wijsen}@umons.ac.be



      Abstract. Taking a database-theoretic perspective on the problem of
      mapping relational databases to ontologies, we come up with a new map-
      ping language that is inspired by the semijoin algebra. We illustrate the
      user friendliness of the mapping language by examples, and prove the de-
      cidability of some important reasoning problems by embedding our map-
      ping language into the guarded fragment of first-order logic. We argue
      that these reasoning problems are relevant in data quality explorations.


Keywords: data quality · guarded fragment · ontology-based data access ·
OBDA · semijoin algebra


1   Motivation
The literature contains many proposals for mapping relational databases to on-
tologies. The major motivation for these proposals is probably ontology-based
data access (OBDA) [20], i.e., the capability of interrogating databases by using
an ontological vocabulary. The current study, however, started with a different
purpose, which can be coined as ontology-based database repairing or ontology-
based database cleaning. Database repairing and cleaning are approaches for deal-
ing with dirty data, where dirtiness refers to the violation of integrity constraints
or, more abstractly, the non-conformity to rules that the data should obey. Ide-
ally, all such data rules should be declared at database design time and subse-
quently enforced by the database management system. In practice, however, we
seldom dispose of an exhaustive declaration of all data rules: some rules were
overlooked when the database schema was conceived, while others were hidden
in procedural programming code. Moreover, in the course of time, new rules
may emerge because of new legislation (e.g., GDPR), while existing rules may
be invalidated. Now let us assume that we have access to an ontology that talks
about objects and relations that also exist in some presumably dirty database.
Our hypothesis is that data quality problems may become more visible when
we succeed in connecting or mapping the database to the ontology, enabling us
to confront the stored data with the ontological “ground truth.” It should be
mentioned here that an ontologically based approach to data quality is not a
new idea: it already appeared in [19], was formalized in [10], and is mentioned
in [20] as an important direction for future research.
2       H. Tellez Perez and J. Wijsen

    In the problem of mapping relational databases to ontologies, we are given
a relational database schema (i.e., a set of relation names), a description logic
vocabulary (i.e., a set of unary and binary predicate names, called concept names
and role names), and a TBox in some description logic. Moreover, we are given
a database-to-ontology mapping, which defines a computable function from the
set of database instances (over the fixed database schema) to the set of ABoxes
in the description logic. If M denotes such a mapping and db denotes a database
instance that serves as input to M, then we write M(db) for the resulting ABox.
In a data cleaning context, we may be interested to know, for example, whether
the knowledge base (T , M(db)) is consistent, and if not, what data in db causes
inconsistency.
    In this paper, we introduce and study a language for specifying such map-
pings M, seeking a good balance between expressiveness and complexity. Four
design considerations are as follows.
    • First, we work in a perspective where columns in relations are not only
numbered, as in mathematical logic, but also named with attributes. We will not
assume that real-world entities have unique identifiers. Instead, we will use tuples
with attributes to identify entities. This allows us, for example, to distinguish
between the actress {Lastname : Hilton, Firstname : Paris} and the entity
{Hotel : Hilton, City : Paris}, which is a hotel in Paris.
    • Second, the language for mapping databases to ontologies will be a subset of
relational algebra. This leads to a succinct syntax without first-order variables.
A major convenience for end-users is that any syntactically correct combination
of the algebra operators is allowed in our mapping language. This would not be
achievable in predicate logic, where end-users would be troubled with syntactic
restrictions like safeness and guardedness.
    • Third, like with description logics, a major consideration in the design of
our mapping language is the balance between expressiveness and complexity.
For expressiveness considerations, we allow negation in our mapping language,
which is often considered useful [4]. On the other hand, the full expressive power
of predicate logic would result in the undecidability of some basic reasoning
problems.
    • Fourth, relational database schemas are often obtained from a conceptual
schema expressed in the Entity-Relationship model [9] or some variant of it.
In such database schemas, most database tables are in 3NF and correspond to
either an entity type or a relationship type in the conceptual schema. Intuitively,
concept names and role names in description logics also correspond, respectively,
to entity types and relationship types. Therefore, a plausible assumption is that
in a well-designed database, the same real-world entity will generally not be
spread out over multiple database tables, thus reducing the need for arbitrary
joins in the mapping rules of M. On the other hand, negation may be commonly
needed (for example, to compute foreign students as all students except Belgian
citizens).
    The above considerations have brought us to the semijoin algebra [11], a
fragment of the relational algebra which can be translated in GF , the guarded
          Connecting Databases and Ontologies: A Data Quality Perspective        3

fragment of first-order logic. In terms of expressiveness, our mapping language
is incomparable with the commonly used language of GLAV mappings.
    This paper is organized as follows. The next section discusses related work.
Section 3 illustrates the concepts of this paper by means of a simple example.
Section 4 introduces some preliminary definitions. Section 5 introduces Entity-
expressions and Relationship-expressions, which are the building blocks for our
mapping rules that are introduced in Section 6. The decidability of some im-
portant reasoning problems is established in Section 7. Section 8 concludes the
paper.


2   Related Work
Starting with the seminal work by Poggi et al. [13], recent years have seen active
research on disclosing relational databases to ontologies or the semantic Web [7,
14–16, 20]. The most commonly used rules used for mapping relational databases
to ontologies have the form

                            ∀x (ϕ(x) → ∃yψ(x, y)) ,                            (1)

where the left-hand side ϕ is a conjunction of atoms over the database schema,
and the right-hand side ψ is a conjunction of atoms over the vocabulary (concept
names and role names) of the ontology. A closed formula of the form (1) is called
a GLAV mapping or, in the database literature, a tuple-generating dependency
(tgd). A tgd is full if no existential quantifier occurs in it. A GAV tgd is a full
tgd whose right-hand side is a single atom. A LAV tgd is a tgd whose left-hand
side is a single atom. Bienvenue [4] uses GAV¬,6= tgds, which extend GAV tgds
by allowing negated atoms and inequalities in the left-hand side. In [13], the
left-hand side is allowed to be an arbitrary SQL query. Most studies in OBDA
have adopted the relational database model; a recent notable exception is [5]
which also considers NoSQL databases.
    As explained in the introduction, our incentive for studying OBDA is that
it can provide an ontologically based approach to data quality. This involves
identifying inconsistency and redundancy in OBDA mappings, as well as testing
for other (un)desirable properties [10, 12, 13]. A recent survey on OBDA [20]
mentions data quality as an important research direction.
    When mapping relational databases to ontologies, a difficulty is that the re-
lational database model uses value-based primary keys to identify tuples, while
description logics use abstract individual names to refer to objects, possibly in
combination with the Unique Name Assumption. This difficulty is nicely dis-
cussed in [13, p. 149], where a solution is proposed that uses Herbrand terms
of the form f (a) as individual names, where f is a function symbol and a is a
sequence of database values whose length is the arity of f . By allowing multiple
function symbols, this solution can distinguish between fperson (Hilton, Paris)
and fhotel (Hilton, Paris). Our approach resembles the latter solution, with one
significant difference: instead of using function symbols, we use attribute names
to distinguish between two sequences that contain the same data values. Such
4      H. Tellez Perez and J. Wijsen

attribute-based representation has recently appeared in the description logic
DLR+ [2]. We do not address the problem that the same entity may be identi-
fied by different identifiers [8, 21].


3   Introductory Example
Before starting the technical development, we introduce our mapping language
by means of a simple example. A fact ENROLLED(c, f, `, p, y) in our example
database means that student (f, `) is currently enrolled in course c and took the
prerequisite course p in the year y. A fact TAUGHT -BY (c, f, `, h, s) means that
the course c is taught by (f, `) and takes place at every hour h during semester s.
The same course can be taught more than once in a week.
         ENROLLED       Course     First   Last     Prerequisite   Year
                        CS402      Tom     Jones      CS311        2008
                        CS402      Tom     Jones      CS401        2009

         TAUGHT -BY       Course    First    Family       Hour        Semester
                          CS402     David    Maier      Mon. 10am     Spring
                          CS402     David    Maier      Tue. 10am     Spring
We will identify all persons by their first and last names, using the attributes
First and Last. The operator πFirst,Last takes the projection on First and Last.
Since the table TAUGHT -BY uses the attribute Family for last names, we
rename that attribute by means of the renaming operator δFamily→Last . Let

S := πFirst,Last ENROLLED and T := πFirst,Last (δFamily→Last TAUGHT -BY ).

Thus, S is the set of persons that are students, and T is the set of persons
that are teachers. We will identify all courses by the attribute Course, which
necessitates the use of the renaming operator δPrerequisite→Course . Let

     C := πCourse ENROLLED ∪ πCourse (δPrerequisite→Course ENROLLED)
                           ∪ πCourse TAUGHT -BY
Thus, C is the set of all courses. We are now ready to give three mapping rules
for populating concept names Student, Teacher, and Course:

                  S : Student,       T : Teacher,       C : Course.

Our mapping language captures negation by means of the difference operator −.
For example, one could declare

                        S − T : PersonWhoDoesNotTeach.

Finally, we show a mapping rule for roles. Assume we are in the spring semester,
and are interested in who attends which course in the current semester. We show
a mapping rule for the role name attends:
                                                              
        S, C n σSemester =Spring TAUGHT -BY , ENROLLED : attends             (2)
           Connecting Databases and Ontologies: A Data Quality Perspective                5

S gets all students. Next, C n σSemester =Spring (TAUGHT -BY ) gets all courses
that take place in the spring semester. Technically, n is the semijoin operator,
whose effect is to return those courses in C that join with some tuple in the
selection σSemester =Spring TAUGHT -BY . Then, the third argument ENROLLED
specifies that a student in the first argument has to be related to a course in the
second argument if they occur together in a same tuple of ENROLLED. The last
argument, attends, specifies the role name for student-course pairs so obtained.


4    Preliminaries

Preliminaries from database theory. We assume a denumerable set att of at-
tributes, a denumerable set dom of constants, and a denumerable set relname
of relation names. We assume a total order ≤att on att. We assume a total
function sort with domain relname that maps every relation name to a finite
set of attributes.
    Let U be a finite set of attributes. A tuple over U is a total mapping
from U to dom. A relation over U is a finite set of tuples over U . An at-
tribute renaming for U is a total injective function from U to att. We write
A1 , A2 , . . . , An → B1 , B2 , . . . , Bn for the attribute renaming f such that f (Ai ) =
Bi for i ∈ {1, . . . , n} and f is the identity on other attributes. If t is a tuple over U
and f is an attribute renaming, then f (t) denotes the tuple s over {f (A) | A ∈ U }
such that for every A ∈ U , s(f (A)) = t(A). For example, if t = {A : a, B : b,
C : c} and f = AB → BD, then f (t) = {B : a, D : b, C : c}.
    A database schema is a finite set of relation names. The following definitions
are relative to a fixed database schema. A database instance db associates, to
each relation name R, a finite relation over sort(R), denoted Rdb . A database
instance is also called a database.

Relational algebra. The operations of the relational algebra [1] are selection σ,
                             n, semijoin n, renaming δ, union ∪, and difference −.
projection π, (natural) join o
Conditions in selections can be equalities between attribute values and constants,
that is, σA=B and σA=c . A projection πX E takes the projection of E on the set
X of attributes. A renaming δf E, where f is an attribute renaming, applies f
to all tuples in E. A join E on F returns all tuples that can be constructed by
taking the union of two tuples, one from E and one from F , that agree on their
common attributes. A semijoin E n F returns every tuple of E that agrees with
some tuple of F on their common attributes. In the full relational algebra, n is
not a primitive operator, because it can be expressed as a projection of a join: En
F ≡ πsort(E) (E on F ). However, semijoin is a primitive operator in the semijoin
algebra, which allows semijoins but disallows joins. The formal semantics of all
operators can be found in [1]; they define eval (E, db), the relation to which an
algebra expression E on a database db evaluates.

The guarded fragment of first-order logic. We define GF as the following restric-
tion of predicate calculus, with equality:
6       H. Tellez Perez and J. Wijsen

 – every quantifier-free formula belongs to GF ;
 – if ϕ(x, y) belongs to GF and R(x, y) is a relation atom in which all free
    variables of ϕ actually occur, then the formulas ∃y (R(x, y) ∧ ϕ(x, y)) and
    ∀y (R(x, y) → ϕ(x, y)) belong to GF ; and
 – GF is closed under ∧, ∨, ¬, →, ↔.
A first-order formula is called guarded if it belongs to GF .
    When we use formulas in predicate logic as database queries, we will make
sure that these formulas are domain-independent [1, Definition 5.3.7], and we
assume that constant symbols occurring in these formulas are interpreted as
themselves, which incorporates the Unique Name Assumption.


5    Entity-Expressions and Relationship-Expressions

In this section, we introduce Entity-expressions and Relationship-expressions,
which will be used in Section 6 to construct mapping rules. The following defi-
nitions are relative to a fixed database schema and description logic vocabulary
(i.e., a finite set of concept names and role names).

Definition 1 (Entity-Expression). Entity-expressions (EEs) are recursively
defined as follows:
 1. Every relation name is an EE.
 2. If E is an EE and X ⊆ sort(E), then πX E is an EE with sort(πX E) = X.
 3. If E is an EE and f is an attribute renaming for sort(E), then δf E is an
    EE with sort(δf E) = {f (A) | A ∈ sort(E)}.
 4. If E is an EE, A, B ∈ sort(E), and c ∈ dom, then σA=c E and σA=B E are
    EEs with sort(σA=c E) = sort(σA=B E) = sort(E).
 5. If E1 and E2 are EEs such that sort(E1 ) = sort(E2 ), then E1 ∪ E2 and
    E1 − E2 are EEs with sort(E1 ∪ E2 ) = sort(E1 − E2 ) = sort(E1 ).
 6. If E1 and E2 are EEs, then E1 nE2 is an EE with sort(E1 n E2 ) = sort(E1 ).

    Note that Entity-expressions cannot use the join operator o  n. The fragment of
relational algebra that replaces the join operator o n with the semijoin operator n
is known as the semijoin algebra. An important result by Leinders et al. [11]
states that the semijoin algebra is contained in GF . Our setting slightly differs
from this earlier work because we have attribute renamings δf and selections of
the form σA=c , both of which are not present in [11]. The proof of the following
Theorem 1 translates Entity-expressions in domain-independent formulas in the
guarded fragment. It differs from [11] in that it uses constants and does not
use the formulas Gk (x1 , . . . , xk ) introduced by Leinders et al. for defining the
guarded k-tuples of a structure.

Theorem 1. For every Entity-expression E with sort(E) = {A1 , . . . , An }, it is
possible to construct a domain-independent formula ϕ(x1 , . . . , xn ) in GF such
that for every database db, for all a1 , . . . , an ∈ dom, {A1 : a1 , . . . , An : an } ∈
eval (E, db) if and only if db |= ϕ(a1 , . . . , an ). Furthermore, ϕ can be constructed
as a disjunction of formulas in GF , all of the form ∃y (R(x, y) ∧ ψ(x, y)).
          Connecting Databases and Ontologies: A Data Quality Perspective        7

   The join operator o
                     n can be used in Relationship-expressions, which captures
a common intuition that relationships are places where entities “join.”

Definition 2 (Relationship-Expression). Relationship-expressions (REs) are
recursively defined as follows:
  – Every Entity-expression is an RE.
  – If E1 and E2 are REs, then E1 o n E2 is an RE.
  – The set of REs is closed under the operators σA=c , σA=B , δf , ∪, and −.

    Note that the set of Relationship-expressions is not closed under projection
or semijoin; for example, T n (R o n S) and πAB (R o  n S) are not Relationship-
expressions. In this way, Theorem 1 remains valid if we replace “Entity-expression”
with “Relationship-expression” in its statement.


6   The Mapping Language
We will now introduce the notion of Database-to-ABox Dependency (DAD).
From here on, all definitions are relative to a database schema S, a descrip-
tion logic vocabulary C ∪ R (i.e., a set of concept names and role names), and
a description logic DL. A DAD can be of two sorts: a Concept DAD (CDAD)
takes as input a database and returns a set of concept assertions; a Role DAD
(RDAD) takes as input a database and returns a set of role assertions.

CDAD The following definition introduces the syntax and semantics for a Con-
cept DAD. The semantics of CDAD relies on a function ι from the set of all tuples
to I, the set of individual names.

Definition 3 (Concept DAD). A Concept DAD (CDAD) is an expression
E : C where E is an Entity-expression (over the database schema S) and C is a
concept name in C.
    We assume a denumerable set I of individual names. We assume an injective
function ι from the set of all tuples (taken over all finite subsets of att) to the
set of individual names. Let db be a database. The set of concept assertions
generated by E : C from db is the following:

                         { ι(t) : C | t ∈ eval (E, db) }.

   Note that if t1 = {Lastname : Hilton, Firstname : Paris} and t2 = {Hotel :
Hilton, City : Paris}, then ι(t1 ) 6= ι(t2 ), because ι is injective and t1 6= t2 .

RDAD The syntax for a Role DAD is slightly more complex: it is a sequence of
two Entity-expressions, one Relationship-expression, and a role name r in R. In-
formally, given a database, such a Role DAD generates a role assertion (a, b) : r
whenever a and b belong, respectively, to the result of the first and the sec-
ond Entity-expression, and together fit the Relationship-expression. We give an
example, and then provide a formal definition.
8       H. Tellez Perez and J. Wijsen

Example 1. Consider the following data from the Mathematics Genealogy Project
at http://www.genealogy.ams.org.
           PHD      First       Last      Year     AdvisorFirst   AdvisorLast
                    Jan      Chomicki     1990       Tomasz        Imielinski
                   Tomasz    Imielinski   1981       Witold          Lipski
                   Witold      Lipski     1968       Wiktor         Marek

    Assume that no two distinct persons in this database agree on their first
and last names. Let f be a renaming such that f (First) = AdvisorFirst and
f (Last) = AdvisorLast. Thus, the inverse of f , denoted f −1 , maps AdvisorFirst
and AdvisorLast to, respectively, First and Last. The following Entity-expression
P gets first and last names of all persons in the database:
                                                                        
          P := π{First,Last} PHD ∪ δf −1 π{AdvisorFirst,AdvisorLast} PHD .

It is significant to note that Wiktor Marek will be added with attributes First
and Last, even though he does not appear with these attributes in the PHD
table. The following RDAD populates the role name SupervisedBy:

                            [P, P/f , PHD] : SupervisedBy.                              (3)

Given a database db, this rule will add a role assertion (ι(s), ι(t)) : SupervisedBy
to the ABox whenever s, t ∈ eval (P, db) such that some tuple of eval (PHD, db)
includes both s and f (t). For example, if s0 = {First : Witold, Last : Lipski}
and t0 = {First : Wiktor, Last : Marek}, then (ι(s0 ), ι(t0 )) : SupervisedBy
is added to the ABox because s0 ∪ f (t0 ) = {First : Witold, Last : Lipski,
AdvisorFirst : Wiktor, AdvisorLast : Marek} is included in the last tuple of the
PHD table.
Definition 4 (Role DAD). A Role DAD (RDAD) is an expression of the form

                                 [E1 /f1 , E2 /f2 , E] : r

where E1 and E2 are Entity-expressions, f1 and f2 are attribute renamings, E
is a Relationship-expression such that sort(δf1 E1 ) ∪ sort(δf2 E2 ) ⊆ sort(E), and
r is a role name in R.
    If f1 or f2 is the identity, it can be omitted. Such an RDAD is called join-free
if o
   n does not occur in it (but n can occur). For example, the RDADs (2) and (3)
are both join-free. As for the semantics, the set of role assertions generated by
[E1 /f1 , E2 /f2 , E] : r from a database db is the following:

    { (ι(t1 ), ι(t2 )) : r | t1 ∈ eval (E1 , db), t2 ∈ eval (E2 , db),
                             and f1 (t1 ) ∪ f2 (t2 ) ⊆ t for some t ∈ eval (E, db) }.

7    Reasoning Problems
We now move from a single CDAD or a single RDAD to sets of CDADs and
RDADs, and introduce some reasoning problems.
          Connecting Databases and Ontologies: A Data Quality Perspective       9

Definition 5. Let db be a database. Let M be a set of CDADs and RDADs. We
write M(db) for the smallest ABox that contains all concept and role assertions
generated from db by the CDADs and RDADs in M.
   A CDAD is said to be active on db if it generates at least one concept
assertion from db; an RDAD is active on db if it generates at least one role
assertion from db.

    In the following definition, one may think of Σ as a set of database con-
straints. However, when studying problems like Satisfiability (see below), we
may add to Σ some desirable properties, like ∃xR(x) if we are asking for a
satisfying database in which R is nonempty.

Definition 6. A DB2KB (or OBDA specification) is a triple (Σ, M, T ) where
 – Σ is a set of closed first-order formulas over the database schema;
 – M is a set of CDADs and RDADs; and
 – T is a DL TBox.

    Console and Lenzerini [10] introduced the notions of faithfulness and protec-
tion for characterizing data quality in OBDA. These notions are recalled next,
together with the well-known notion of satisfiability. For aesthetic reasons, we
state all questions in the form “Is there a database such that. . . ?”. Therefore,
we are asking for the complement of faitfulness and protection as defined in [10].
Finally, the notion of global-consistency appeared in [12].

INPUT: A DB2KB (Σ, M, T ).
QUESTIONS:
   – Satisfiability: Is there a database db such that db |= Σ and the
     knowledge base (T , M(db)) is consistent?
   – Non-Faithfulness: Is there a database db such that the knowledge
     base (T , M(db)) is consistent but db 6|= Σ?
   – Non-Protection: Is there a database db such that db |= Σ but the
     knowledge base (T , M(db)) is inconsistent?
   – Global-Consistency: Is there a database db such that db |= Σ, all
     CDADs and RDADs of M are active on db, and the knowledge base
     (T , M(db)) is consistent?

    Informally, a “yes”-answer to Non-Protection tells us that the ontology
has some constraints not implied by Σ. Recall from Section 1 that the discovery
of such constraints may be significant in data quality assessments. As mentioned
just in front of Definition 6, Σ may contain desirable properties in addition to
database constraints. For example, for Satisfiability, we can use Σ to express
that the database db must be nonempty.
    The above problems can be shown to be undecidable in general [10, Theo-
rem 1]. The following theorem shows their decidability under some restrictions
on the input, which will be discussed after the theorem. A technical crux in the
proof of Theorem 2 concerns the switch from database tuples to DL individual
names.
10      H. Tellez Perez and J. Wijsen

Theorem 2. Satisfiability, Non-Faithfulness, Non-Protection, and
Global-Consistency are decidable problems if their inputs are restricted to
DB2KBs (Σ, M, T ) with the following properties:
 – T can be effectively expressed in GF ;
 – every formula in Σ is in GF ; and
 – all RDADs in M are join-free.
Proof (Sketch). The proof shows that the four mentioned problems can be effec-
tively reduced to satisfiability in GF . We first explain how the reduction deals
with the function ι in Definition 3 and with M. Let U := {A1 , . . . , Am } be
the set of attributes, totally ordered, such that U includes sort(E) for every
E : C in M, and U includes sort(E1 ) ∪ sort(E2 ) for every [E1 /f1 , E2 /f2 , E] : r
in M. Let ε be a fresh constant. For each S ⊆ U , we encode each tuple t
over S as (a1 , a2 , . . . , am ) where for 1 ≤ i ≤ m, ai := t(Ai ) if Ai ∈ S, and
ai := ε otherwise. For example, if U = {Lastname, Firstname, Hotel , City},
then (ε, ε, Hilton, Paris) and (Hilton, Paris, ε, ε) encode, respectively, a hotel
in Paris and an actress. The reduction first uses the construction of Theorem 1 to
translate Entity- and Relationship-expressions into GF . Next, the syntactic form
in Theorem 1 allows us to translate CDADs and RDADs in GF . For example, for
a CDAD E : C, every disjunct ∃y (R(x, y) ∧ ψ(x, y)) in E’s translation further
translates into the guarded formula ∀x∀y (R(x, y) → (ψ(x, y) → C(t1 , . . . , tm ))).
Here, C is a predicate symbol of arity m := |U | that uses the encoding explained
and illustrated before: for 1 ≤ i ≤ m, ti := xi if Ai ∈ sort(E), and ti := ε
otherwise. We next show how the reduction deals with T . By the hypothesis
of the theorem, T can be expressed by a formula ϕT in GF . Of course, in this
formula, concept names are unary, and role names are binary. In our reduction,
predicates for concept names and role names become, respectively, m-ary and
2m-ary. To this extent, the reduction replaces in ϕT every occurrence of every
variable x by x1 , . . . , xm . For example, the guarded formula ∀x (C(x) → D(x))
translates into ∀x1 · · · ∀xm (C(x1 , . . . , xm ) → D(x1 , . . . , xm )), a formula that is
also guarded.                                                                             t
                                                                                          u
    The satisfiability problem for GF with constants is EXPTIME-complete when
the arities of all relation names are fixed [17]. The EXPTIME-hard lower bound
obviously carries over to the problems in Theorem 2. The EXPTIME-upper bound
does not, insofar as Theorems 1 and 2 use exponential translations of CDADs
and RDADs in GF . However, it is a plausible conjecture that membership in
EXPTIME can be obtained along the lines of the proof of [11, Theorem 9].
    We briefly discuss the restrictions in the statement of Theorem 2. The restric-
tion that the input TBoxes T must be expressible in GF may be automatically
fulfilled by the description logic DL under consideration. Indeed, many expres-
sive description logics can be expressed in GF [18], an example being ALC [3,
p. 46]. The requirement that Σ is in GF still allows expressing many inter-
esting properties and database constraints, like non-emptiness of relations and
inclusion dependencies, as well as all Boolean combinations of these (because
GF is closed under Boolean combinations). On the other hand, GF does not in-
clude common database constraints like primary keys or functional dependencies.
          Connecting Databases and Ontologies: A Data Quality Perspective          11

Theorem 2 imposes no restrictions on CDADs, but RDADs are restricted to be
join-free. This restriction is unfortunate, because it means that all Relationship-
expression that occur in RDADs must actually be Entity-expressions. Informally,
join-freeness demands that whenever an RDAD puts two entities together in a
role, then these entities should already occur together in some database relation.
This restriction is plausibly satisfied for database schemas that are obtained from
Entity-Relationship diagrams that already capture such roles by relationships (as
is actually the case for our example RDADs (2) and (3)). The restriction can
be prohibitive though if one wants to combine in a role two entities that are
unrelated in the Entity-Relationship diagram. Anyway, our proof of Theorem 2
fails if we relax one of its hypotheses, because such relaxation would take us
outside GF .
    Finally, we show a result telling us that database constraints can be obtained
from an ontology, given a mapping M. As we argued in Section 1, this may be
of interest in data cleaning applications to infer missing database constraints.

Theorem 3. Let (Σ, M, T ) be a DB2KB such that Σ = ∅ and T is a DL-Lite core
TBox. It is possible to construct a finite set Σ 0 of closed first-order formulas such
that for every database db, db |= Σ 0 if and only if (T , M(db)) is a consistent
knowledge base. Moreover, if every RDAD in M is join-free, then Σ 0 is in GF .

Proof (Sketch). From [6], it follows that unsatisfiability in DL-Lite core can only
arise due to some negative inclusion C v ¬D implied by the TBox that is violated
in the ABox. The negative inclusion C v ¬D can be of four different forms, where
A, B denote concept names, and r, s role names: A v ¬B, A v ¬∃r, ∃r v ¬A, or
∃r v ¬∃s. We show here how to deal with A v ¬B (the other cases are similar):
for all CDADs E : A and F : B in M such that sort(E) = sort(F ), Σ 0 contains
a formula stating emptiness of π{} (E n F ). The latter algebra expression is an
Entity-expression, and thus, by Theorem 1, can be expressed in GF .              t
                                                                                 u



8   Conclusion

The language of CDADs and RDADs allows expressing database-to-ontology
mappings in a user-friendly way. The language is based on the semijoin algebra,
which is embedded in the guarded fragment of first-order logic. This results in
decidability of some important reasoning problems. Since CDADs and RDADs
allow full negation, they can express mappings that are not GLAV mappings. On
the other hand, the GLAV mapping ∀x∀y∀z (R(x, y) ∧ R(y, z) ∧ R(z, x) → C(x))
is not guarded and cannot be expressed as a CDAD. In future research, we plan
to explore in more depth the practice of using ontological knowledge in database
repairing, database cleaning, and consistent query answering. We also plan to
investigate how our framework can be reconciled with DLR+ [2], a description
logic tailored towards relational databases.
12      H. Tellez Perez and J. Wijsen

References
 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley,
    1995.
 2. A. Artale, E. Franconi, R. Peñaloza, and F. Sportelli. A decidable very expressive
    description logic for databases. In International Semantic Web Conference 2017,
    pages 37–52, 2017.
 3. F. Baader, I. Horrocks, C. Lutz, and U. Sattler. An Introduction to Description
    Logic. Cambridge University Press, 2017.
 4. M. Bienvenu. Inconsistency-tolerant ontology-based data access revisited: Taking
    mappings into account. In IJCAI 2018, pages 1721–1729, 2018.
 5. E. Botoeva, D. Calvanese, B. Cogrel, J. Corman, and G. Xiao. A generalized
    framework for ontology-based data access. In AI*IA 2018, pages 166–180, 2018.
 6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable
    reasoning and efficient query answering in description logics: The DL-Lite family.
    J. Autom. Reasoning, 39(3):385–429, 2007.
 7. D. Calvanese and E. Franconi. First-order ontology mediated database querying
    via query reformulation. In S. Flesca, S. Greco, E. Masciari, and D. Saccà, editors,
    A Comprehensive Guide Through the Italian Database Research Over the Last 25
    Years., volume 31 of Studies in Big Data, pages 169–185. Springer International
    Publishing, 2018.
 8. D. Calvanese, M. Giese, D. Hovland, and M. Rezk. Ontology-based integration
    of cross-linked datasets. In International Semantic Web Conference 2015, pages
    199–216, 2015.
 9. P. P. Chen. The Entity-Relationship Model – Toward a unified view of data. ACM
    Trans. Database Syst., 1(1):9–36, 1976.
10. M. Console and M. Lenzerini. Data quality in ontology-based data access: The
    case of consistency. In AAAI 2014, 2014.
11. D. Leinders, M. Marx, J. Tyszkiewicz, and J. V. den Bussche. The semijoin algebra
    and the guarded fragment. Journal of Logic, Language and Information, 14(3):331–
    343, 2005.
12. D. Lembo, J. Mora, R. Rosati, D. F. Savo, and E. Thorstensen. Mapping analysis in
    ontology-based data access: Algorithms and complexity. In International Semantic
    Web Conference 2015, pages 217–234, 2015.
13. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati.
    Linking data to ontologies. J. Data Semantics, 10:133–173, 2008.
14. J. F. Sequeda. Integrating relational databases with the semantic web: A reflection.
    In Reasoning Web 2017, pages 68–120, 2017.
15. J. F. Sequeda, S. H. Tirmizi, Ó. Corcho, and D. P. Miranker. Survey of directly
    mapping SQL databases to the semantic web. Knowledge Eng. Review, 26(4):445–
    486, 2011.
16. D. Spanos, P. Stavrou, and N. Mitrou. Bringing relational databases into the
    semantic web: A survey. Semantic Web, 3(2):169–209, 2012.
17. B. ten Cate and M. Franceschet. Guarded fragments with constants. Journal of
    Logic, Language and Information, 14(3):281–288, 2005.
18. C. Thorne, R. Bernardi, and D. Calvanese. Designing efficient controlled languages
    for ontologies. In Computing Meaning: Volume 4, pages 149–173. Springer Nether-
    lands, Dordrecht, 2014.
19. Y. Wand and R. Y. Wang. Anchoring data quality dimensions in ontological
    foundations. Commun. ACM, 39(11):86–95, 1996.
          Connecting Databases and Ontologies: A Data Quality Perspective       13

20. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za-
    kharyaschev. Ontology-based data access: A survey. In IJCAI 2018, pages 5511–
    5519, 2018.
21. G. Xiao, D. Hovland, D. Bilidas, M. Rezk, M. Giese, and D. Calvanese. Efficient
    ontology-based data integration with canonical IRIs. In ESWC 2018, pages 697–
    713, 2018.