=Paper=
{{Paper
|id=Vol-2373/paper-26
|storemode=property
|title=Connecting Databases and Ontologies: A Data Quality Perspective
|pdfUrl=https://ceur-ws.org/Vol-2373/paper-26.pdf
|volume=Vol-2373
|authors=Horacio Tellez Perez,Jef Wijsen
|dblpUrl=https://dblp.org/rec/conf/dlog/PerezW19
}}
==Connecting Databases and Ontologies: A Data Quality Perspective==
Connecting Databases and Ontologies: A Data Quality Perspective Horacio Tellez Perez and Jef Wijsen Département d’Informatique, University of Mons, Belgium {horacio.tellezperez, jef.wijsen}@umons.ac.be Abstract. Taking a database-theoretic perspective on the problem of mapping relational databases to ontologies, we come up with a new map- ping language that is inspired by the semijoin algebra. We illustrate the user friendliness of the mapping language by examples, and prove the de- cidability of some important reasoning problems by embedding our map- ping language into the guarded fragment of first-order logic. We argue that these reasoning problems are relevant in data quality explorations. Keywords: data quality · guarded fragment · ontology-based data access · OBDA · semijoin algebra 1 Motivation The literature contains many proposals for mapping relational databases to on- tologies. The major motivation for these proposals is probably ontology-based data access (OBDA) [20], i.e., the capability of interrogating databases by using an ontological vocabulary. The current study, however, started with a different purpose, which can be coined as ontology-based database repairing or ontology- based database cleaning. Database repairing and cleaning are approaches for deal- ing with dirty data, where dirtiness refers to the violation of integrity constraints or, more abstractly, the non-conformity to rules that the data should obey. Ide- ally, all such data rules should be declared at database design time and subse- quently enforced by the database management system. In practice, however, we seldom dispose of an exhaustive declaration of all data rules: some rules were overlooked when the database schema was conceived, while others were hidden in procedural programming code. Moreover, in the course of time, new rules may emerge because of new legislation (e.g., GDPR), while existing rules may be invalidated. Now let us assume that we have access to an ontology that talks about objects and relations that also exist in some presumably dirty database. Our hypothesis is that data quality problems may become more visible when we succeed in connecting or mapping the database to the ontology, enabling us to confront the stored data with the ontological “ground truth.” It should be mentioned here that an ontologically based approach to data quality is not a new idea: it already appeared in [19], was formalized in [10], and is mentioned in [20] as an important direction for future research. 2 H. Tellez Perez and J. Wijsen In the problem of mapping relational databases to ontologies, we are given a relational database schema (i.e., a set of relation names), a description logic vocabulary (i.e., a set of unary and binary predicate names, called concept names and role names), and a TBox in some description logic. Moreover, we are given a database-to-ontology mapping, which defines a computable function from the set of database instances (over the fixed database schema) to the set of ABoxes in the description logic. If M denotes such a mapping and db denotes a database instance that serves as input to M, then we write M(db) for the resulting ABox. In a data cleaning context, we may be interested to know, for example, whether the knowledge base (T , M(db)) is consistent, and if not, what data in db causes inconsistency. In this paper, we introduce and study a language for specifying such map- pings M, seeking a good balance between expressiveness and complexity. Four design considerations are as follows. • First, we work in a perspective where columns in relations are not only numbered, as in mathematical logic, but also named with attributes. We will not assume that real-world entities have unique identifiers. Instead, we will use tuples with attributes to identify entities. This allows us, for example, to distinguish between the actress {Lastname : Hilton, Firstname : Paris} and the entity {Hotel : Hilton, City : Paris}, which is a hotel in Paris. • Second, the language for mapping databases to ontologies will be a subset of relational algebra. This leads to a succinct syntax without first-order variables. A major convenience for end-users is that any syntactically correct combination of the algebra operators is allowed in our mapping language. This would not be achievable in predicate logic, where end-users would be troubled with syntactic restrictions like safeness and guardedness. • Third, like with description logics, a major consideration in the design of our mapping language is the balance between expressiveness and complexity. For expressiveness considerations, we allow negation in our mapping language, which is often considered useful [4]. On the other hand, the full expressive power of predicate logic would result in the undecidability of some basic reasoning problems. • Fourth, relational database schemas are often obtained from a conceptual schema expressed in the Entity-Relationship model [9] or some variant of it. In such database schemas, most database tables are in 3NF and correspond to either an entity type or a relationship type in the conceptual schema. Intuitively, concept names and role names in description logics also correspond, respectively, to entity types and relationship types. Therefore, a plausible assumption is that in a well-designed database, the same real-world entity will generally not be spread out over multiple database tables, thus reducing the need for arbitrary joins in the mapping rules of M. On the other hand, negation may be commonly needed (for example, to compute foreign students as all students except Belgian citizens). The above considerations have brought us to the semijoin algebra [11], a fragment of the relational algebra which can be translated in GF , the guarded Connecting Databases and Ontologies: A Data Quality Perspective 3 fragment of first-order logic. In terms of expressiveness, our mapping language is incomparable with the commonly used language of GLAV mappings. This paper is organized as follows. The next section discusses related work. Section 3 illustrates the concepts of this paper by means of a simple example. Section 4 introduces some preliminary definitions. Section 5 introduces Entity- expressions and Relationship-expressions, which are the building blocks for our mapping rules that are introduced in Section 6. The decidability of some im- portant reasoning problems is established in Section 7. Section 8 concludes the paper. 2 Related Work Starting with the seminal work by Poggi et al. [13], recent years have seen active research on disclosing relational databases to ontologies or the semantic Web [7, 14–16, 20]. The most commonly used rules used for mapping relational databases to ontologies have the form ∀x (ϕ(x) → ∃yψ(x, y)) , (1) where the left-hand side ϕ is a conjunction of atoms over the database schema, and the right-hand side ψ is a conjunction of atoms over the vocabulary (concept names and role names) of the ontology. A closed formula of the form (1) is called a GLAV mapping or, in the database literature, a tuple-generating dependency (tgd). A tgd is full if no existential quantifier occurs in it. A GAV tgd is a full tgd whose right-hand side is a single atom. A LAV tgd is a tgd whose left-hand side is a single atom. Bienvenue [4] uses GAV¬,6= tgds, which extend GAV tgds by allowing negated atoms and inequalities in the left-hand side. In [13], the left-hand side is allowed to be an arbitrary SQL query. Most studies in OBDA have adopted the relational database model; a recent notable exception is [5] which also considers NoSQL databases. As explained in the introduction, our incentive for studying OBDA is that it can provide an ontologically based approach to data quality. This involves identifying inconsistency and redundancy in OBDA mappings, as well as testing for other (un)desirable properties [10, 12, 13]. A recent survey on OBDA [20] mentions data quality as an important research direction. When mapping relational databases to ontologies, a difficulty is that the re- lational database model uses value-based primary keys to identify tuples, while description logics use abstract individual names to refer to objects, possibly in combination with the Unique Name Assumption. This difficulty is nicely dis- cussed in [13, p. 149], where a solution is proposed that uses Herbrand terms of the form f (a) as individual names, where f is a function symbol and a is a sequence of database values whose length is the arity of f . By allowing multiple function symbols, this solution can distinguish between fperson (Hilton, Paris) and fhotel (Hilton, Paris). Our approach resembles the latter solution, with one significant difference: instead of using function symbols, we use attribute names to distinguish between two sequences that contain the same data values. Such 4 H. Tellez Perez and J. Wijsen attribute-based representation has recently appeared in the description logic DLR+ [2]. We do not address the problem that the same entity may be identi- fied by different identifiers [8, 21]. 3 Introductory Example Before starting the technical development, we introduce our mapping language by means of a simple example. A fact ENROLLED(c, f, `, p, y) in our example database means that student (f, `) is currently enrolled in course c and took the prerequisite course p in the year y. A fact TAUGHT -BY (c, f, `, h, s) means that the course c is taught by (f, `) and takes place at every hour h during semester s. The same course can be taught more than once in a week. ENROLLED Course First Last Prerequisite Year CS402 Tom Jones CS311 2008 CS402 Tom Jones CS401 2009 TAUGHT -BY Course First Family Hour Semester CS402 David Maier Mon. 10am Spring CS402 David Maier Tue. 10am Spring We will identify all persons by their first and last names, using the attributes First and Last. The operator πFirst,Last takes the projection on First and Last. Since the table TAUGHT -BY uses the attribute Family for last names, we rename that attribute by means of the renaming operator δFamily→Last . Let S := πFirst,Last ENROLLED and T := πFirst,Last (δFamily→Last TAUGHT -BY ). Thus, S is the set of persons that are students, and T is the set of persons that are teachers. We will identify all courses by the attribute Course, which necessitates the use of the renaming operator δPrerequisite→Course . Let C := πCourse ENROLLED ∪ πCourse (δPrerequisite→Course ENROLLED) ∪ πCourse TAUGHT -BY Thus, C is the set of all courses. We are now ready to give three mapping rules for populating concept names Student, Teacher, and Course: S : Student, T : Teacher, C : Course. Our mapping language captures negation by means of the difference operator −. For example, one could declare S − T : PersonWhoDoesNotTeach. Finally, we show a mapping rule for roles. Assume we are in the spring semester, and are interested in who attends which course in the current semester. We show a mapping rule for the role name attends: S, C n σSemester =Spring TAUGHT -BY , ENROLLED : attends (2) Connecting Databases and Ontologies: A Data Quality Perspective 5 S gets all students. Next, C n σSemester =Spring (TAUGHT -BY ) gets all courses that take place in the spring semester. Technically, n is the semijoin operator, whose effect is to return those courses in C that join with some tuple in the selection σSemester =Spring TAUGHT -BY . Then, the third argument ENROLLED specifies that a student in the first argument has to be related to a course in the second argument if they occur together in a same tuple of ENROLLED. The last argument, attends, specifies the role name for student-course pairs so obtained. 4 Preliminaries Preliminaries from database theory. We assume a denumerable set att of at- tributes, a denumerable set dom of constants, and a denumerable set relname of relation names. We assume a total order ≤att on att. We assume a total function sort with domain relname that maps every relation name to a finite set of attributes. Let U be a finite set of attributes. A tuple over U is a total mapping from U to dom. A relation over U is a finite set of tuples over U . An at- tribute renaming for U is a total injective function from U to att. We write A1 , A2 , . . . , An → B1 , B2 , . . . , Bn for the attribute renaming f such that f (Ai ) = Bi for i ∈ {1, . . . , n} and f is the identity on other attributes. If t is a tuple over U and f is an attribute renaming, then f (t) denotes the tuple s over {f (A) | A ∈ U } such that for every A ∈ U , s(f (A)) = t(A). For example, if t = {A : a, B : b, C : c} and f = AB → BD, then f (t) = {B : a, D : b, C : c}. A database schema is a finite set of relation names. The following definitions are relative to a fixed database schema. A database instance db associates, to each relation name R, a finite relation over sort(R), denoted Rdb . A database instance is also called a database. Relational algebra. The operations of the relational algebra [1] are selection σ, n, semijoin n, renaming δ, union ∪, and difference −. projection π, (natural) join o Conditions in selections can be equalities between attribute values and constants, that is, σA=B and σA=c . A projection πX E takes the projection of E on the set X of attributes. A renaming δf E, where f is an attribute renaming, applies f to all tuples in E. A join E on F returns all tuples that can be constructed by taking the union of two tuples, one from E and one from F , that agree on their common attributes. A semijoin E n F returns every tuple of E that agrees with some tuple of F on their common attributes. In the full relational algebra, n is not a primitive operator, because it can be expressed as a projection of a join: En F ≡ πsort(E) (E on F ). However, semijoin is a primitive operator in the semijoin algebra, which allows semijoins but disallows joins. The formal semantics of all operators can be found in [1]; they define eval (E, db), the relation to which an algebra expression E on a database db evaluates. The guarded fragment of first-order logic. We define GF as the following restric- tion of predicate calculus, with equality: 6 H. Tellez Perez and J. Wijsen – every quantifier-free formula belongs to GF ; – if ϕ(x, y) belongs to GF and R(x, y) is a relation atom in which all free variables of ϕ actually occur, then the formulas ∃y (R(x, y) ∧ ϕ(x, y)) and ∀y (R(x, y) → ϕ(x, y)) belong to GF ; and – GF is closed under ∧, ∨, ¬, →, ↔. A first-order formula is called guarded if it belongs to GF . When we use formulas in predicate logic as database queries, we will make sure that these formulas are domain-independent [1, Definition 5.3.7], and we assume that constant symbols occurring in these formulas are interpreted as themselves, which incorporates the Unique Name Assumption. 5 Entity-Expressions and Relationship-Expressions In this section, we introduce Entity-expressions and Relationship-expressions, which will be used in Section 6 to construct mapping rules. The following defi- nitions are relative to a fixed database schema and description logic vocabulary (i.e., a finite set of concept names and role names). Definition 1 (Entity-Expression). Entity-expressions (EEs) are recursively defined as follows: 1. Every relation name is an EE. 2. If E is an EE and X ⊆ sort(E), then πX E is an EE with sort(πX E) = X. 3. If E is an EE and f is an attribute renaming for sort(E), then δf E is an EE with sort(δf E) = {f (A) | A ∈ sort(E)}. 4. If E is an EE, A, B ∈ sort(E), and c ∈ dom, then σA=c E and σA=B E are EEs with sort(σA=c E) = sort(σA=B E) = sort(E). 5. If E1 and E2 are EEs such that sort(E1 ) = sort(E2 ), then E1 ∪ E2 and E1 − E2 are EEs with sort(E1 ∪ E2 ) = sort(E1 − E2 ) = sort(E1 ). 6. If E1 and E2 are EEs, then E1 nE2 is an EE with sort(E1 n E2 ) = sort(E1 ). Note that Entity-expressions cannot use the join operator o n. The fragment of relational algebra that replaces the join operator o n with the semijoin operator n is known as the semijoin algebra. An important result by Leinders et al. [11] states that the semijoin algebra is contained in GF . Our setting slightly differs from this earlier work because we have attribute renamings δf and selections of the form σA=c , both of which are not present in [11]. The proof of the following Theorem 1 translates Entity-expressions in domain-independent formulas in the guarded fragment. It differs from [11] in that it uses constants and does not use the formulas Gk (x1 , . . . , xk ) introduced by Leinders et al. for defining the guarded k-tuples of a structure. Theorem 1. For every Entity-expression E with sort(E) = {A1 , . . . , An }, it is possible to construct a domain-independent formula ϕ(x1 , . . . , xn ) in GF such that for every database db, for all a1 , . . . , an ∈ dom, {A1 : a1 , . . . , An : an } ∈ eval (E, db) if and only if db |= ϕ(a1 , . . . , an ). Furthermore, ϕ can be constructed as a disjunction of formulas in GF , all of the form ∃y (R(x, y) ∧ ψ(x, y)). Connecting Databases and Ontologies: A Data Quality Perspective 7 The join operator o n can be used in Relationship-expressions, which captures a common intuition that relationships are places where entities “join.” Definition 2 (Relationship-Expression). Relationship-expressions (REs) are recursively defined as follows: – Every Entity-expression is an RE. – If E1 and E2 are REs, then E1 o n E2 is an RE. – The set of REs is closed under the operators σA=c , σA=B , δf , ∪, and −. Note that the set of Relationship-expressions is not closed under projection or semijoin; for example, T n (R o n S) and πAB (R o n S) are not Relationship- expressions. In this way, Theorem 1 remains valid if we replace “Entity-expression” with “Relationship-expression” in its statement. 6 The Mapping Language We will now introduce the notion of Database-to-ABox Dependency (DAD). From here on, all definitions are relative to a database schema S, a descrip- tion logic vocabulary C ∪ R (i.e., a set of concept names and role names), and a description logic DL. A DAD can be of two sorts: a Concept DAD (CDAD) takes as input a database and returns a set of concept assertions; a Role DAD (RDAD) takes as input a database and returns a set of role assertions. CDAD The following definition introduces the syntax and semantics for a Con- cept DAD. The semantics of CDAD relies on a function ι from the set of all tuples to I, the set of individual names. Definition 3 (Concept DAD). A Concept DAD (CDAD) is an expression E : C where E is an Entity-expression (over the database schema S) and C is a concept name in C. We assume a denumerable set I of individual names. We assume an injective function ι from the set of all tuples (taken over all finite subsets of att) to the set of individual names. Let db be a database. The set of concept assertions generated by E : C from db is the following: { ι(t) : C | t ∈ eval (E, db) }. Note that if t1 = {Lastname : Hilton, Firstname : Paris} and t2 = {Hotel : Hilton, City : Paris}, then ι(t1 ) 6= ι(t2 ), because ι is injective and t1 6= t2 . RDAD The syntax for a Role DAD is slightly more complex: it is a sequence of two Entity-expressions, one Relationship-expression, and a role name r in R. In- formally, given a database, such a Role DAD generates a role assertion (a, b) : r whenever a and b belong, respectively, to the result of the first and the sec- ond Entity-expression, and together fit the Relationship-expression. We give an example, and then provide a formal definition. 8 H. Tellez Perez and J. Wijsen Example 1. Consider the following data from the Mathematics Genealogy Project at http://www.genealogy.ams.org. PHD First Last Year AdvisorFirst AdvisorLast Jan Chomicki 1990 Tomasz Imielinski Tomasz Imielinski 1981 Witold Lipski Witold Lipski 1968 Wiktor Marek Assume that no two distinct persons in this database agree on their first and last names. Let f be a renaming such that f (First) = AdvisorFirst and f (Last) = AdvisorLast. Thus, the inverse of f , denoted f −1 , maps AdvisorFirst and AdvisorLast to, respectively, First and Last. The following Entity-expression P gets first and last names of all persons in the database: P := π{First,Last} PHD ∪ δf −1 π{AdvisorFirst,AdvisorLast} PHD . It is significant to note that Wiktor Marek will be added with attributes First and Last, even though he does not appear with these attributes in the PHD table. The following RDAD populates the role name SupervisedBy: [P, P/f , PHD] : SupervisedBy. (3) Given a database db, this rule will add a role assertion (ι(s), ι(t)) : SupervisedBy to the ABox whenever s, t ∈ eval (P, db) such that some tuple of eval (PHD, db) includes both s and f (t). For example, if s0 = {First : Witold, Last : Lipski} and t0 = {First : Wiktor, Last : Marek}, then (ι(s0 ), ι(t0 )) : SupervisedBy is added to the ABox because s0 ∪ f (t0 ) = {First : Witold, Last : Lipski, AdvisorFirst : Wiktor, AdvisorLast : Marek} is included in the last tuple of the PHD table. Definition 4 (Role DAD). A Role DAD (RDAD) is an expression of the form [E1 /f1 , E2 /f2 , E] : r where E1 and E2 are Entity-expressions, f1 and f2 are attribute renamings, E is a Relationship-expression such that sort(δf1 E1 ) ∪ sort(δf2 E2 ) ⊆ sort(E), and r is a role name in R. If f1 or f2 is the identity, it can be omitted. Such an RDAD is called join-free if o n does not occur in it (but n can occur). For example, the RDADs (2) and (3) are both join-free. As for the semantics, the set of role assertions generated by [E1 /f1 , E2 /f2 , E] : r from a database db is the following: { (ι(t1 ), ι(t2 )) : r | t1 ∈ eval (E1 , db), t2 ∈ eval (E2 , db), and f1 (t1 ) ∪ f2 (t2 ) ⊆ t for some t ∈ eval (E, db) }. 7 Reasoning Problems We now move from a single CDAD or a single RDAD to sets of CDADs and RDADs, and introduce some reasoning problems. Connecting Databases and Ontologies: A Data Quality Perspective 9 Definition 5. Let db be a database. Let M be a set of CDADs and RDADs. We write M(db) for the smallest ABox that contains all concept and role assertions generated from db by the CDADs and RDADs in M. A CDAD is said to be active on db if it generates at least one concept assertion from db; an RDAD is active on db if it generates at least one role assertion from db. In the following definition, one may think of Σ as a set of database con- straints. However, when studying problems like Satisfiability (see below), we may add to Σ some desirable properties, like ∃xR(x) if we are asking for a satisfying database in which R is nonempty. Definition 6. A DB2KB (or OBDA specification) is a triple (Σ, M, T ) where – Σ is a set of closed first-order formulas over the database schema; – M is a set of CDADs and RDADs; and – T is a DL TBox. Console and Lenzerini [10] introduced the notions of faithfulness and protec- tion for characterizing data quality in OBDA. These notions are recalled next, together with the well-known notion of satisfiability. For aesthetic reasons, we state all questions in the form “Is there a database such that. . . ?”. Therefore, we are asking for the complement of faitfulness and protection as defined in [10]. Finally, the notion of global-consistency appeared in [12]. INPUT: A DB2KB (Σ, M, T ). QUESTIONS: – Satisfiability: Is there a database db such that db |= Σ and the knowledge base (T , M(db)) is consistent? – Non-Faithfulness: Is there a database db such that the knowledge base (T , M(db)) is consistent but db 6|= Σ? – Non-Protection: Is there a database db such that db |= Σ but the knowledge base (T , M(db)) is inconsistent? – Global-Consistency: Is there a database db such that db |= Σ, all CDADs and RDADs of M are active on db, and the knowledge base (T , M(db)) is consistent? Informally, a “yes”-answer to Non-Protection tells us that the ontology has some constraints not implied by Σ. Recall from Section 1 that the discovery of such constraints may be significant in data quality assessments. As mentioned just in front of Definition 6, Σ may contain desirable properties in addition to database constraints. For example, for Satisfiability, we can use Σ to express that the database db must be nonempty. The above problems can be shown to be undecidable in general [10, Theo- rem 1]. The following theorem shows their decidability under some restrictions on the input, which will be discussed after the theorem. A technical crux in the proof of Theorem 2 concerns the switch from database tuples to DL individual names. 10 H. Tellez Perez and J. Wijsen Theorem 2. Satisfiability, Non-Faithfulness, Non-Protection, and Global-Consistency are decidable problems if their inputs are restricted to DB2KBs (Σ, M, T ) with the following properties: – T can be effectively expressed in GF ; – every formula in Σ is in GF ; and – all RDADs in M are join-free. Proof (Sketch). The proof shows that the four mentioned problems can be effec- tively reduced to satisfiability in GF . We first explain how the reduction deals with the function ι in Definition 3 and with M. Let U := {A1 , . . . , Am } be the set of attributes, totally ordered, such that U includes sort(E) for every E : C in M, and U includes sort(E1 ) ∪ sort(E2 ) for every [E1 /f1 , E2 /f2 , E] : r in M. Let ε be a fresh constant. For each S ⊆ U , we encode each tuple t over S as (a1 , a2 , . . . , am ) where for 1 ≤ i ≤ m, ai := t(Ai ) if Ai ∈ S, and ai := ε otherwise. For example, if U = {Lastname, Firstname, Hotel , City}, then (ε, ε, Hilton, Paris) and (Hilton, Paris, ε, ε) encode, respectively, a hotel in Paris and an actress. The reduction first uses the construction of Theorem 1 to translate Entity- and Relationship-expressions into GF . Next, the syntactic form in Theorem 1 allows us to translate CDADs and RDADs in GF . For example, for a CDAD E : C, every disjunct ∃y (R(x, y) ∧ ψ(x, y)) in E’s translation further translates into the guarded formula ∀x∀y (R(x, y) → (ψ(x, y) → C(t1 , . . . , tm ))). Here, C is a predicate symbol of arity m := |U | that uses the encoding explained and illustrated before: for 1 ≤ i ≤ m, ti := xi if Ai ∈ sort(E), and ti := ε otherwise. We next show how the reduction deals with T . By the hypothesis of the theorem, T can be expressed by a formula ϕT in GF . Of course, in this formula, concept names are unary, and role names are binary. In our reduction, predicates for concept names and role names become, respectively, m-ary and 2m-ary. To this extent, the reduction replaces in ϕT every occurrence of every variable x by x1 , . . . , xm . For example, the guarded formula ∀x (C(x) → D(x)) translates into ∀x1 · · · ∀xm (C(x1 , . . . , xm ) → D(x1 , . . . , xm )), a formula that is also guarded. t u The satisfiability problem for GF with constants is EXPTIME-complete when the arities of all relation names are fixed [17]. The EXPTIME-hard lower bound obviously carries over to the problems in Theorem 2. The EXPTIME-upper bound does not, insofar as Theorems 1 and 2 use exponential translations of CDADs and RDADs in GF . However, it is a plausible conjecture that membership in EXPTIME can be obtained along the lines of the proof of [11, Theorem 9]. We briefly discuss the restrictions in the statement of Theorem 2. The restric- tion that the input TBoxes T must be expressible in GF may be automatically fulfilled by the description logic DL under consideration. Indeed, many expres- sive description logics can be expressed in GF [18], an example being ALC [3, p. 46]. The requirement that Σ is in GF still allows expressing many inter- esting properties and database constraints, like non-emptiness of relations and inclusion dependencies, as well as all Boolean combinations of these (because GF is closed under Boolean combinations). On the other hand, GF does not in- clude common database constraints like primary keys or functional dependencies. Connecting Databases and Ontologies: A Data Quality Perspective 11 Theorem 2 imposes no restrictions on CDADs, but RDADs are restricted to be join-free. This restriction is unfortunate, because it means that all Relationship- expression that occur in RDADs must actually be Entity-expressions. Informally, join-freeness demands that whenever an RDAD puts two entities together in a role, then these entities should already occur together in some database relation. This restriction is plausibly satisfied for database schemas that are obtained from Entity-Relationship diagrams that already capture such roles by relationships (as is actually the case for our example RDADs (2) and (3)). The restriction can be prohibitive though if one wants to combine in a role two entities that are unrelated in the Entity-Relationship diagram. Anyway, our proof of Theorem 2 fails if we relax one of its hypotheses, because such relaxation would take us outside GF . Finally, we show a result telling us that database constraints can be obtained from an ontology, given a mapping M. As we argued in Section 1, this may be of interest in data cleaning applications to infer missing database constraints. Theorem 3. Let (Σ, M, T ) be a DB2KB such that Σ = ∅ and T is a DL-Lite core TBox. It is possible to construct a finite set Σ 0 of closed first-order formulas such that for every database db, db |= Σ 0 if and only if (T , M(db)) is a consistent knowledge base. Moreover, if every RDAD in M is join-free, then Σ 0 is in GF . Proof (Sketch). From [6], it follows that unsatisfiability in DL-Lite core can only arise due to some negative inclusion C v ¬D implied by the TBox that is violated in the ABox. The negative inclusion C v ¬D can be of four different forms, where A, B denote concept names, and r, s role names: A v ¬B, A v ¬∃r, ∃r v ¬A, or ∃r v ¬∃s. We show here how to deal with A v ¬B (the other cases are similar): for all CDADs E : A and F : B in M such that sort(E) = sort(F ), Σ 0 contains a formula stating emptiness of π{} (E n F ). The latter algebra expression is an Entity-expression, and thus, by Theorem 1, can be expressed in GF . t u 8 Conclusion The language of CDADs and RDADs allows expressing database-to-ontology mappings in a user-friendly way. The language is based on the semijoin algebra, which is embedded in the guarded fragment of first-order logic. This results in decidability of some important reasoning problems. Since CDADs and RDADs allow full negation, they can express mappings that are not GLAV mappings. On the other hand, the GLAV mapping ∀x∀y∀z (R(x, y) ∧ R(y, z) ∧ R(z, x) → C(x)) is not guarded and cannot be expressed as a CDAD. In future research, we plan to explore in more depth the practice of using ontological knowledge in database repairing, database cleaning, and consistent query answering. We also plan to investigate how our framework can be reconciled with DLR+ [2], a description logic tailored towards relational databases. 12 H. Tellez Perez and J. Wijsen References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. A. Artale, E. Franconi, R. Peñaloza, and F. Sportelli. A decidable very expressive description logic for databases. In International Semantic Web Conference 2017, pages 37–52, 2017. 3. F. Baader, I. Horrocks, C. Lutz, and U. Sattler. An Introduction to Description Logic. Cambridge University Press, 2017. 4. M. Bienvenu. Inconsistency-tolerant ontology-based data access revisited: Taking mappings into account. In IJCAI 2018, pages 1721–1729, 2018. 5. E. Botoeva, D. Calvanese, B. Cogrel, J. Corman, and G. Xiao. A generalized framework for ontology-based data access. In AI*IA 2018, pages 166–180, 2018. 6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. Autom. Reasoning, 39(3):385–429, 2007. 7. D. Calvanese and E. Franconi. First-order ontology mediated database querying via query reformulation. In S. Flesca, S. Greco, E. Masciari, and D. Saccà, editors, A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years., volume 31 of Studies in Big Data, pages 169–185. Springer International Publishing, 2018. 8. D. Calvanese, M. Giese, D. Hovland, and M. Rezk. Ontology-based integration of cross-linked datasets. In International Semantic Web Conference 2015, pages 199–216, 2015. 9. P. P. Chen. The Entity-Relationship Model – Toward a unified view of data. ACM Trans. Database Syst., 1(1):9–36, 1976. 10. M. Console and M. Lenzerini. Data quality in ontology-based data access: The case of consistency. In AAAI 2014, 2014. 11. D. Leinders, M. Marx, J. Tyszkiewicz, and J. V. den Bussche. The semijoin algebra and the guarded fragment. Journal of Logic, Language and Information, 14(3):331– 343, 2005. 12. D. Lembo, J. Mora, R. Rosati, D. F. Savo, and E. Thorstensen. Mapping analysis in ontology-based data access: Algorithms and complexity. In International Semantic Web Conference 2015, pages 217–234, 2015. 13. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. Data Semantics, 10:133–173, 2008. 14. J. F. Sequeda. Integrating relational databases with the semantic web: A reflection. In Reasoning Web 2017, pages 68–120, 2017. 15. J. F. Sequeda, S. H. Tirmizi, Ó. Corcho, and D. P. Miranker. Survey of directly mapping SQL databases to the semantic web. Knowledge Eng. Review, 26(4):445– 486, 2011. 16. D. Spanos, P. Stavrou, and N. Mitrou. Bringing relational databases into the semantic web: A survey. Semantic Web, 3(2):169–209, 2012. 17. B. ten Cate and M. Franceschet. Guarded fragments with constants. Journal of Logic, Language and Information, 14(3):281–288, 2005. 18. C. Thorne, R. Bernardi, and D. Calvanese. Designing efficient controlled languages for ontologies. In Computing Meaning: Volume 4, pages 149–173. Springer Nether- lands, Dordrecht, 2014. 19. Y. Wand and R. Y. Wang. Anchoring data quality dimensions in ontological foundations. Commun. ACM, 39(11):86–95, 1996. Connecting Databases and Ontologies: A Data Quality Perspective 13 20. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za- kharyaschev. Ontology-based data access: A survey. In IJCAI 2018, pages 5511– 5519, 2018. 21. G. Xiao, D. Hovland, D. Bilidas, M. Rezk, M. Giese, and D. Calvanese. Efficient ontology-based data integration with canonical IRIs. In ESWC 2018, pages 697– 713, 2018.