-

Synthesizing Extensional Constraints in Ontology-based Data Access

Marco Console

Maurizio Lenzerini

Riccardo Mancini

Riccardo Rosati

Marco Ruzzi

0 0 Dipartimento di Ing. Informatica, Automatica e Gestionale “Antonio Ruberti” S

Several recent techniques and tools for Ontology-based Data Access (OBDA) make use of the so-called extensional constraints (a.k.a. ABox dependencies). So far, extensional constraints have been mainly considered in a setting where data are represented in an ABox, instead of external data sources connected to the ontology through declarative mappings. Moreover, the issue of how to generate extensional constraints in practice has not been addressed yet. In this paper we first provide a formal account of the notion of extensional constraints in a full-fledged OBDA setting, where an ontology is connected to the data sources of the information system by means of mappings, and then present an approach to the automatic generation of extensional constraints in such a setting. The technique we propose is based on the use of a first-order theorem prover that checks validity of relevant formulas built over the mapping views. The experiments we have carried out in real-world OBDA projects show the effectiveness of our approach in discovering large collections of extensional constraints entailed by the OBDA specification.

Ontology-based data management [ 11 ] is a novel paradigm for managing, governing and querying the data sources of an organization by means of an ontology. One of the key aspects of ontology-based data management is ontology-based data access (OBDA), where the ontology provides a formal representation of the domain of interest, and is used as a tool provided to the user to express queries. A suitable system is able to answering such queries by reasoning on the ontology and by extracting the appropriate data from the data sources.

Query answering in OBDA is currently a hot research topic (see e.g., [ 5, 12, 4, 9, 3, 8 ]). One of the outcomes of this research is a detailed study of the complexity of query answering. In particular, the study of OBDA has focused on understanding which ontology languages allow query answering to be performed with reasonable computational complexity with respect to the size of the data. It is now well-known that efficiency of query answering cannot be guaranteed if the ontology is expressed in a logic whose expressive power exceeds the one of lightweight language [ 6 ]. With such languages, in particular with the languages of the DL-Lite family, conjunctive queries are first-order rewritable, i.e., answering (unions of) conjunctive queries, (U)CQs, expressed over the ontology can be reduced to the evaluation of a suitable first-order query (called the perfect rewriting of the original query) expressed over the data sources.

Recent works on this subject go beyond the mere rewritability property, and study optimization techniques aiming at minimizing the size of the rewritten query. Many proposals of this type assume that the query answering algorithm make use of additional knowledge with respect to the usual input, which is constituted by the TBox representing the ontology and the data at the sources. In [ 13, 15, 14 ] such additional knowledge is expressed in terms of the so-called extensional constraints, where an extensional constraint is an axiom that has the same form of a TBox assertion, but is interpreted as an integrity constraint for the ABox (i.e., the set of asserted instances). An example of an extensional constraint is an inclusion assertion A1 vext A2 specifying that the instances that have been asserted for one concept A1 in the ABox form a subset of the instances that have been asserted for the concept A2. By using such assertions, the query answering algorithm is able to prune the space of query rewriting, for example by removing those queries that are contained into other queries under the extensional constraints [ 13, 15 ]. For a very simple example, if we know that the extension of the concept ForeignStudent is a subset of the extension of the concept Student, and if the query asks for all instances of Student, then the query answering algorithm can simply avoid accessing the instances of ForeignStudent.

So far, extensional constraints have been mainly considered in the simplified setting where data in the OBDA system are represented in an ABox, i.e., they are stored in the form of a set of facts (possibly managed by a Database Management System DBMS) in the alphabet of the ontology. Conversely, in a full-fledged OBDA system, the ABox does not really exist. Rather, the TBox is mapped (through mapping assertions) to existing data sources that are tipically managed by external systems (often, DBMSs). In our experience, full-fledged OBDA is a very effective paradigm in practice, because it allows mapping the ontology to the existing data sources of the organization (which, therefore, need not to be modified). Since the data sources and the ontology are not necessarily mutually coherent, the mapping mechanisms should be able to connect any query over the data sources to the elements of the ontology. In this sense, OBDA can be seen as a form of information integration, where data are not duplicated or moved to an ABox, but remain in the data sources of the organization, continuing to serve their purposes in the applications. Once we accept that data remain at the sources, we also accept that they are subject to updates, according to the applications using them. It follows that deriving extensional constraints by inspecting the current state of data might be too expensive, or even unfeasible at all. For this reason, we follow the approach alread implicit in [ 13 ], based on static analysis of the OBDA specification. In other words, we aim at deriving the extensional constraints from the OBDA specification, independently of the specific state (i.e., on the specific data) of the current database.

We also observe that the issue of how to generate extensional constraints has not been addressed extensively. For example, while [ 13 ] discusses the idea of inducing extensional constraints from mapping assertions, it does not present any systematic techniques for doing so.

Based on the above considerations, in this paper we provide the following contributions.

– We present (Section 3) a formal account of the notion of extensional constraints in a full-fledged OBDA setting, where an ontology is connected to the data sources of the information system by means of mappings. In particular, we define the notion of valid extensional constraint in an OBDA specification. – We observe that, due to the high expressive power of the mapping language, deriving all the extensional constraints that are valid in an OBDA specification is undecidable. Therefore, we propose (Section 4) a “best effort” approach to the automatic generation of extensional constraints in such setting. The technique we present is based on the use of a first-order theorem prover that checks validity of relevant formulas built over the mapping views. – We report (Section 5) on a first set of experiments we have carried out in real-world OBDA projects. Such experiments demonstrate the effectiveness of our approach in discovering large collections of extensional constraints entailed by the OBDA specification, and prove that the use of theorem provers for this kind of task is very promising. 2

Preliminaries

An OBDA system specification B = hT ; M; Si is characterized by three components: the ontology T , the source schema S and the mapping M between the source schema and the ontology.

In this work, we assume that T is specified as a TBox in DL-LiteA [ 12 ], which is a logic of the DL-Lite family1 allowing for UCQ-rewritability of query answering, i.e., answering unions of conjunctive queries (UCQ) posed over it can be reduced to evaluation of a UCQ over the underlying data. DL-LiteA allows for specifying concepts, representing sets of objects, roles, representing binary relations between objects, and attributes, representing binary relations between objects and values. The syntax of concept, role and attribute expressions in DL-LiteA is as follows:

B C ! A j 9Q j (U ) ! B j :B

R Q ! Q j :Q ! P j P

E ! (U ) F ! T1 j V ! U j :U j Tn

In such rules, A, P , and U denote a concept name, a role name, and an attribute name, respectively. P denotes the inverse of a role. B and R are called basic concept and basic role, respectively. :B (resp. :Q, :U ) denotes the negation of a basic concept B (resp. basic role, or attribute). The concept 9Q, also called unqualified existential restriction, denotes the domain of a role Q, i.e., the set of objects that Q relates to some object. Similarly, the concept (U ) denotes the domain of an attribute U , i.e., the set of objects that U relates to some value. Conversely, (U ) denotes the range of an attribute U , i.e., the set of values to which U relates some object. T1; : : : ; Tn are unbounded pairwise disjoint predefined value-domains.

A DL-LiteA TBox T is a finite set of assertions of the form

B v C

Q v R

E v F

U v V (funct Q) (funct U ) 1 Not to be confused with the set of DLs studied in [ 2 ], which form the extended DL-Lite family, also called the DL-Litebool.

From left to right, assertions denote inclusions between concepts, roles, value-domains, attributes, and functionality on roles and on attributes, respectively. Notice that in DL-LiteA TBoxes we impose that roles and attributes occurring in functionality assertions cannot be specialized (i.e., they cannot occur in the right-hand side of inclusions).

The source schema S is a relational database schema, specifying both the structure of the relational tables in the database, and the integrity constraints that the tables should obey. A database D for S is called legal if it satisfies all the integrity constraints in S [ 1 ]. To express queries over the source schema, we use SQL. Given an n-tuple of variables x = x1; : : : ; xn, we use the notation QDB (x) for an SQL query of arity n, where xi denotes the i-th attribute in the target list of the query. To simplify notation, we assume that query attribute names are obtained by renaming attributes in the target list of the SQL query with variable symbols. When the specification of the attributes x is not necessary, we simply refer to the symbol QDB .

As for the mapping M between the ontology and the source schema, we will use a form of mapping assertion that slightly generalizes the well-known GAV mappings [ 10 ]. An extended GAV mapping assertion m is an expression of the form QDB (x) cq (x) where QDB (x) is an SQL query over S (called the view associated to m), and cq (x) is a conjunctive query over T whose variables are those in x. The intuitive meaning of a mapping assertion m of the above form is that every tuple satisfying view QDB also satisfies the ontology query cq . Note that the extended GAV mapping language has the same expressive power as ordinary GAV. However, in practice, the possibility of using more than one atom in the head of a mepping assertion greatly simplifies the task of the ontology engineer.

As for the semantics of the ontology we adopt the standard semantics based on firstorder interpretations, and we refer the reader to [ 7 ] for a formal, detailed presentation of the semantics of the whole OBDA specification.

Example 1. We now present an OBDA system specification which is an excerpt of the one we used in a real-world experimental project carried out in collaboration with the Ministry of Economy and Finance.

PublicOrg v Organization 9worksWith v Organization PublicDep v PublicOrg 9worksWith v Organization

(funct name) (funct address) The concepts in the this fragment of the ontology are Organization (denoting organizations), PublicOrg (denoting public organizations), and PublicDep (denoting public departments). The inclusion assertions in the first row state that public organizations are particular organizations, and public departments are particular public organizations. The role worksWith relates organizations that work together (second row). The attributes of the ontology are name, address, prjName. The axioms in the third column specify that name and address are functional.

The source schema S contains four relational tables.

Dept MinistryA(dep id,dep name) and Dept MinistryB(dep id,dep addr)

store data about departments belonging to MinistryA and MinistryB respectively. Works On(dep id,proj name) store data about projects carried out by departments and Cooperate(dept1,dept2) specify pairs of cooperating departments.

M1 SELECT dep idASx; dep nameASy

F ROM Dept M inistryA M2 SELECT dep idASx; dep addrASy

F ROM Dept M inistryB M3 SELECT w1:dep idasx; w2:dep idasy; w2:proj nameaszF ROM W orks Onw1; W orks Onw2; Dept M inistryAd1; Dept M inistryAd2 W HEREd1:dep id = w1:dep id AN Dd2:dep id = w2:dep id AN Dw1:proj = w2:proj

AN Dw1:dep id <> w2:dep id M3 SELECT d1:dep idasx; d2:dep idasy

F ROM Cooperatec; Dept M inistryBd1; Dept M inistryBd2 W HEREc:dept1 = d1:dep id AN Dc:dept2 = d2:dep id fx; y j PublicDep(x)^ name(x; y)g fx; y j PublicDep(x)^ address(x; y)g fx; y; z j worksWith(x; y)^ prjName(x; z)^ prjName(y; z)g fx; y j worksWith(x; y)g

The mapping M between the sources and the ontology contains the mapping assertions illustrated in Figure 1. In words, M1 relates table Dept MinistryA to the instances of concept PublicDep and their names. M2 relates table Dept MinistryB to the instances of PublicDep and their addresses; M3 specifies that departments from MinistryA that work on the same project actually work together, and therefore are mapped to worksWith, together with the indication of the project they work for (attribute prjName). Finally, M4 maps the notion of cooperation represented in table Cooperate to the role worksWith in the ontology.

Query answering in current OBDA systems [ 4 ] is performed by means of a twophase query rewriting technique: the user query is first rewritten taking into account the ontology, and then considering the mappings. The first phase, called ontology rewriting phase, encodes part of the knowledge expressed by the TBox in the rewriting. More precisely, given an OBDA specification B = hT ; M; Si and a query Q over T , an ontology rewriting of Q under B is a query Q0 over T such that the certain answers to Q0 with respect to h;; M; Si are exactly the certain answers to Q with respect to B. Once ontology rewriting has been performed, the TBox can be disregarded and the second phase, called mapping rewriting phase, can consider only mapping assertions. The outcome of this second step produces an SQL query that is evaluated directly over the sources: given and OBDA specification B = hT ; M; Si and a query Q0 over T , a mapping rewriting is a query Q00 over S such that, for every legal database D for S, the answers to Q00 over D are exactly the certain answers of Q0 with respect to h;; M; Si. For further details on OBDA query rewriting we refer the reader to e.g. [ 7 ]. 3

Extensional constraints

In this section we formally define the notion of extensional constraint in an OBDA specification.

Definition 1. Let B = hT ; M; Si be an OBDA specification. An extensional constraint for B is a statement of one of the following forms

B1 vext B2 U1 vext U2

B1 vext :B2 U1 vext :U2

Q1 vext Q2 (functext Q)

Q1 vext :Q2 (functext U ) where B1; B2 are basic concepts, Q; Q1; Q2 are roles, and U; U1; U2 are attributes.

Notice that, by virtue of the characteristics of DL-LiteA, the set of all possible extensional constraints for an OBDA specification is finite. Notice also that the syntax of extensional constraints is essentially the same as the syntax of TBox assertions in DL-LiteA. Intuitively, the difference between the two types of assertion is that, while a TBox assertion states a general condition that is to be satisfied in all the models of the TBox, an extensional constraint states a condition that the instances that have been explicitly asserted in the knowledge base must satisfy. For example, an extensional inclusion assertion between concepts (first type of assertions in the above list) specifies that the instances that have been asserted for the left-hand side concept form a subset of the instances that have been asserted for the right-hand side concept. In other words, an extensional constraint is a sort of meta-level constraint regarding the instances explicitly asserted in the knowledge base.

The question now is to specify what does it mean for an instance to be explicitly asserted in the knowledge base. Recent papers dealing with extensional constraints assume that the knowledge base is constituted by a TBox and an ABox. In that context, the explicitly asserted instances of an element of the knowledge base are those appearing in the ABox. In this paper, the notion of OBDA specification is such that the knowledge base is constituted by a TBox, a set of mappings and a database schema, and this has to be taken into account in specifying the semantics of extensional constraints.

To formally define the semantics of an extensional constraint, we make use of several notions, that we now introduce. First, let us remind the reader about the meaning of an extensional constraint in the case where we have a DL knowledge base hT ; Ai constituted by a TBox T and an ABox A. If is an extensional constraint, let us indicate with int the TBox axiom obtained by dropping the subscript ext from . For example, if is A1 vext A2, then int is simply A1 v A2. We define the A-interpetation IA for T as the interpretation for T such that a fact is true in IA if and only if 2 A. It is now immediate to define the notion of satisfaction for extensional constraints: an extensional constraint is said to hold in an ABox A if int is satisfied by IA.

Let us now turn our attention to extensional constraints in an OBDA specification. Since an extensional constraint says something about the extension of predicates, the notion of satisfaction requires to fix such extensions. In an OBDA system, the extension of predicates depends upon the data at the sources. It follows that in order to check whether an extensional constraint is satisfied or not, we should resort to a database D for S, and then verify whether the condition corresponding to the extensional constraint is satisfied by the instances that the mapping retrieves from the data sources. To make this idea precise, we introduce the notion of “ABox retrieved from a database through a mapping”.

Given an OBDA specification B = hT ; M; Si, and a database D that is legal for S, a mapping m : QDB (x) cq (x) in M is said to generate a fact P (t) (where P is a predicate in T and t is a tuple of constants of the same arity as P ) from D if (i) t is a tuple in QDDB , (ii) cq (t) is the set of facts obtained by substituting t for x in cq(x), (iii) P (t) is an atom in cq (t).

The retrieved ABox associated to B and D, denoted M(D), is the set of ABox assertions defined as fP (t) j there is m 2M that generates P (t) from Dg. With the notion of retrieved ABox, we can now specify when an extensional constraint is satisfied by an OBDA specification and a database.

Definition 2. Let B = hT ; M; Si be an OBDA specification, D a legal database for S, and an extensional constraint for B. Then is satisfied by (B; D) if holds in M(D).

The notion of satisfaction of an extensional constraint that we have just illustrated is relative to a specific database D. In OBDA, the underlying database may be subject to frequent changes. Therefore, basing our knowledge about the extensional constraints on the specific database D at hand would imply to compute the set of relevant constraints every time the database is modified. Since this is unfeasible, our approach is based on the idea of considering the problem of deriving extensional constraints as a static analysis problem, i.e., as a reasoning task over the OBDA specification. More specifically, the extensional constraints that we will use for the optimization of query rewriting will be those that are satisfied “independently” of the specific database at hand. To capture this intuition, we introduce the notion of valid extensional constraints. Definition 3. Let B = hT ; M; Si be an OBDA specification, and an extensional constraint for B. Then is valid in B (or entailed by B) if for every legal database D for S, we have that is satisfied by (B; D).

The main problem addressed in this paper is defined as follows.

Definition 4. The problem of computing the extensional closure of an OBDA specification is the following: given an OBDA specification B = hT ; M; Si, compute the set of all extensional constraints for B that are valid in B.

Obviously, at the basis of the problem of computing the extensional closure of an OBDA specification, there is an associated decision problem, defined as follows. Definition 5. Extensional constraint entailment is the following decision problem: given an OBDA specification B = hT ; M; Si, and an extensional constraint for B, check whether is valid in B.

Based on the fact the deciding containment between SQL queries is undecidable, it is immediate to verify that the extensional constraint entailment problem is undecidable. Theorem 1. Extensional constraint entailment is undecidable.

So, on the one hand we aim at computing all the extensional constraints that are valid in an OBDA specification; on the other hand, this task is undecidable. To address this problem, we could in principle decide to resort to a less powerful query language in the mapping M. For example, if we use conjunctive queries in M, then it is easy to see that computing the closure of extensional constraints becomes feasible (since query containment becomes decidable). Unfortunately, real world OBDA projects show that one needs the full expressive power of SQL (or, at least, the fragment of SQL corresponding to first-order logic queries) in modeling the mapping between the data sources and the ontology. In order to provide a solution to this problem, we propose to use a “best effort” approach, with the goal of deriving as many valid extensional constraints as possible. In the rest of this paper, we show an approach based on the use of a first-order theorem prover. 4

A method for synthesizing extensional constraints

We now describe our method for synthesizing valid extensional constraints from an OBDA specification.

Translation of source schema constraints Let S be a relational schema with integrity constraints. We translate (a subset of) the integrity constraints of S into FO-sentences. For ease of exposition, we consider the most used integrity constraints in relational databases, i.e., key constraints and foreign key constraints [ 1 ] (other kinds of constraints that can be captured by first-order sentences could be considered as well): – every key constraint key(r) = 1; : : : ; k is translated into the sentence 8x1; : : : ; xn; yk+1; : : : ; yn

(r(x1; : : : ; xn) ^ r(x1; : : : ; xk; yk+1; : : : ; yn) ! xk+1 = yk+1 ^ : : : ^ xn = yn) – every foreign key constraint r1[a1; : : : ; ak] r2[b1; : : : ; bk] is translated into the sentence 8x1; : : : ; xn(r1(x1; : : : ; xn) ! 9y1; : : : ; yn k(r2(t1; : : : ; tm))), where, for every j 2 f1; : : : ; kg, tbj = xaj (and ti = yi for the other positions). Let ic(S) be the first-order sentence corresponding to the conjunction of all the above sentences.

Translation of extensional constraints In the following, and without loss of generality, we will consider a normalized, “pure GAV” version of the set of mappings, in which the head of every mapping assertion is constituted by a single atom (and the SQL query is projected onto the attributes mentioned in such an atom). In fact, given an extended GAV mapping M, it is easy to compute (in polynomial time) a pure GAV mapping split (M) that is equivalent to M.

Given a first-order formula whose free variables are x1; : : : ; xn, and given an ntuple of terms t, we denote by (t) the formula obtained from by replacing the free variables x1; : : : ; xn with the terms t.

Given an SQL query QDB (x) used in the mapping m 2 M, there are two cases: (i) QDB (x) corresponds to a first-order query, i.e., a first-order formula with free variables x. In this case, let fol (QDB (x)) be such a first-order query; (ii) QDB (x) does not correspond to any first-order query (e.g., because of aggregation operations). In this case, we make use of a new relation name rm (i.e., a relation name not used in S) and define fol (QDB (x)) = rm(x). The intuition behind this forced translation of nonfirst-order SQL queries is to encode them with new, generic predicates (using a different predicate for every such query) that have no special property and that are used to encode queries whose semantics is “unknown”. Then, given a mapping assertion of the form QDB (x) A(x), we define fol ( ) as fol (QDB (x)).

Given a set of mappings M and a concept name A, the set of relevant mappings for A (denoted by RelMap(A; M)) is the set of mapping assertions 2 M such that the head of has the form A(x). Analogously, given a role name P , RelMap(P; M)) is the set of mapping assertions 2 M such that the head of has the form P (x1; x2), and given an attribute name U , RelMap(U; M)) is the set of mapping assertions 2 M such that the head of has the form U (x1; x2).

Given a concept A, the FOL-unfolding unf ( ; M) of a basic concept atom, basic role atom, or attribute atom with respect to a set of mappings M is defined as follows: unf (A(x); M) = W unf (9P (x); M) = W 2RelMap(A;M) fol ( )(x) unf (9P (x); M) = W 2RelMap(P;M) 9y( fol ( )(x; y)) unf (P (x1; x2); M) = W 2RelMap(P;M) 9y( fol ( )(y; x)) unf (P (x1; x2); M) = W 2RelMap(P;M) fol ( )(x1; x2) unf (U (x1; x2); M) = W 2RelMap(P;M) fol ( )(x2; x1)

2RelMap(U;M) fol ( )(x1; x2)

Then, we define the function ec that translates DL-LiteA extensional constraints with respect to a set of mappings M as follows.

– inclusion axioms between basic concepts: let be the axiom B1 v B2. Then, ec ( ; M) = 8x( unf (B1(x); M) ! unf (B2(x); M)); – disjointness axioms between basic concepts: let be the axiom B1 v :B2. Then, ec ( ; M) = 8x( unf (B1(x); M) ! : unf (B2(x); M)); – inclusion axioms between basic roles (and attributes): let be the axiom R1 v R2.

Then, ec ( ; M) = 8x; y( unf (R1(x; y); M) ! unf (R2(x; y); M)) (analogously for inclusions axioms between attributes); – disjointness axioms between basic roles (and attributes): let be the axiom R1 v :R2. Then, ec ( ; M) = 8x; y( unf (R1(x; y); M) ! : unf (R2(x; y); M)) (analogously for disjointness axioms between attributes); – role (and attribute) functionality axioms: let be the axiom (funct R). Then, ec ( ; M) = 8x; y; z( unf (R(x; y); M) ^ unf (R(x; z); M) ! y = z) (analogously for attribute functionality axioms).

Verification of extensional constraints The above first-order translation of both database integrity constraints and extensional constraints is related to the entailment of extensional constraints by the following theorem.

Theorem 2. Let B = hT ; M; Si be an OBDA specification and let be an extensional constraint defined over the alphabet of T . If ic(S) ! ec ( ; M) is a valid first-order sentence, then is valid in B.

Notice that the converse of above theorem holds in the case when: (i) all constraints on the database schema S are keys or foreign key constraints; (ii) all the SQL queries used in the mapping M correspond to first-order queries.

An (incomplete) algorithm for synthesizing extensional constraints Based on Theorem 2, we are now able to define a technique for automatically deriving a set of extensional constraints entailed by an OBDA specification B = hT ; M; Si.

The algorithm for the automatic processing of the OBDA specification and generation of the extensional constraints is the following.

Algorithm computeObdaEC

Input: an OBDA specification B = hT ; M; Si Output: a set of extensional constraints entailed by B begin

= ;; compute M0 = split (M); compute ic(S); for each extensional constraint over the alphabet of T do begin compute ec( ; M0); ask a FO theorem prover whether the sentence ic(S) ! ec( ; M0) is valid; if the FO theorem prover answers YES before a fixed timeout then = [ f g end; return end

The algorithm computeObdaEC starts by computing the pure GAV version M0 of the mapping Mough the function Split . Then, it generates all the possible extensional constraints that can be formulated in DL-LiteA over the alphabet of the TBox T ; such a set is finite and can be computed in polynomial time with respect to the size of T . For every such extensional constraint , the algorithm builds the first-order sentence ic(S) ! ec( ; M0) that encodes the entailment of with respect to B, and verifies the validity of such a formula. To solve the first-order validity problems mentioned in Theorem 2, the algorithm resorts to an external first-order theorem prover.

Of course, Theorem 1 implies that we cannot hope to actually derive the whole extensional closure of the OBDA specification: however, we can compute a sound approximation (i.e., a subset) of such an extensional closure. This incomplete solution is certainly acceptable, since extensional constraints are used to optimize reasoning about the OBDA, and even an incomplete set of such constraints can effectively be used for such a goal. 5

Experiments

We have implemented the algorithm computeObdaEC in a tool for discovering extensional constraints entailed by an OBDA specification. This tool provides a multithreading implementation which takes advantage of the modern multi-core architectures, performing several entailment tests at the same time. With respect to the algorithm presented in Section 4, our implementation only considers extensional inclusions.

We experimented the tool within a joint project between Sapienza University of Rome and the Italian Ministry of Economy and Finance. The project includes an ontology defined over an alphabet containing 164 concept names, 47 role names, 86 attribute names and a set of 263 mapping assertions involving around 60 relational tables. The full list of mapping assertions used for the experiments, along with the source database schema, are available at http://hyper.dis.uniroma1.it/ ˜quonto/dl2013/. In the same page it is possible to find the whole set of newly discovered extensional constraints (expressed in OWL functional-style syntax). We remark that no constraints are asserted on the original database. Therefore the discovery of extensional constraints is based only on mapping assertions.

We ran the tests over an 8 Core CPU machine equipped with 8 GB RAM: such an architecture allowed us to run 8 threads simultaneously. The theorem prover used was E-Prover2, configured with 1024 MB of maximum memory and a time-out of 30 seconds. We used the computeObdaEC algorithm to generate all the possible entailment problems for inclusion dependencies and submitted them to the theorem prover. The algorithm ran 798336 tests in total, and only in the 5,45% of the cases the proof exceeded the 30 seconds time out and was terminated, whereas in the 0.019% of cases the presence of non-FOL operators (e.g., use of the group by clause) prevented the formulation of the containment problem.

The experimental results show the effectiveness of the proposed methodology with respect to the goal of discovering extensional constraints: 177 extensional inclusion dependencies were discovered. The whole analysis took 46h 47m 51s. We consider this amount of time reasonable, although considerably high, because this formal analysis of mappings is conducted off-line; moreover, the set of derived extensional constraints can be easily adapted to changes in the set of mappings, by adding incrementally new entailed constraints when new mapping assertions are added. Similarly, extensional constraints can be deleted when mappings that generated it are removed from the set of mappings. 6

Conclusions

We have introduced the notion of extensional constraints in a full-fledged OBDA setting, and we have illustrated a technique for synthesizing a set of extensional constraints that are valid in an OBDA specification. Our technique is based on the use of a firstorder theorem prover, and the experiments we have carried out in real-world projects show that the approach is very promising.

In the future, we aim at continuing our work along several directions. First, we want to extend our approach to deal with the integrity contraints defined in the database schema, based on the observation that such constraints may help in deriving new extensional constraints. We also plan to check how the extensional constraints derived by our method affect the whole query answering process (using, e.g., the Prexto query rewriting algorithm [ 15 ]). The first experiments in this respect are promising. For example the rewriting of the query q(x) : senza underlying(x); ISIN (x; y) computed without taking into account extensional constraints contains 186 disjuncts, while such number drops to 18 if we consider the derived extensional constraints (the two rewritings are shown in http://hyper.dis.uniroma1.it/˜quonto/dl2013/). Finally, we want to consider even more features of SQL, and study mechanisms for using first-order theorem provers for approximating the reasoning tasks needed to derive extensional constraints.

Acknowledgments. This research has been partially supported by the EU under FP7 project Optique – Scalable End-user Access to Big Data (grant n. FP7-318338). 2 http://www.eprover.org

Abiteboul ,

Hull , and

Vianu . Foundations of Databases. Addison Wesley Publ. Co. , 1995 .

Artale ,

Calvanese ,

Kontchakov , and

Zakharyaschev . The DL-Lite family and relations . J. of Artificial Intelligence Research , 36 : 1 - 69 , 2009 .

Cal

`ı, G. Gottlob, and

Pieris . New expressive languages for ontological query answering . In Proc. of AAAI 2011 , pages 1541 - 1546 , 2011 .

Calvanese , G. De Giacomo,

Lembo ,

Lenzerini ,

Poggi ,

Rodriguez-Muro ,

Rosati ,

Ruzzi , and

D. F.

Savo . The Mastro system for ontology-based data access . Semantic Web J. , 2 ( 1 ): 43 - 53 , 2011 .

Calvanese , G. De Giacomo,

Lembo ,

Lenzerini , and

Rosati . Tractable reasoning and efficient query answering in description logics: The DL-Lite family . J. of Automated Reasoning , 39 ( 3 ): 385 - 429 , 2007 .

Calvanese , G. De Giacomo,

Lembo ,

Lenzerini , and

Rosati . Conceptual modeling for data integration . In A. T. Borgida , V.

Chaudhri , P.

Giorgini , and E. Yu, editors, Conceptual Modeling: Foundations and Applications - Essays in Honor of John Mylopoulos , volume 5600 of LNCS , pages 173 - 197 . Springer, 2009 .

Di Pinto ,

Lembo ,

Lenzerini ,

Mancini ,

Poggi ,

Rosati ,

Ruzzi , and

D. F.

Savo . Optimizing query rewriting in ontology-based data access . In Proc. of EDBT 2013 , 2013 .

Gottlob , G.

Orsi, and

Pieris . Ontological queries: Rewriting and optimization . In Proc. of ICDE 2011 , pages 2 - 13 , 2011 .

Kontchakov ,

Lutz ,

Toman ,

Wolter , and

Zakharyaschev . The combined approach to query answering in DL-Lite . In Proc. of KR 2010 , pages 247 - 257 , 2010 .

10.

Lenzerini . Data integration: A theoretical perspective . In Proc. of PODS 2002 , pages 233 - 246 , 2002 .

11.

Lenzerini . Ontology-based data management . In Proc. of CIKM 2011 , pages 5 - 6 , 2011 .

12.

Poggi ,

Lembo ,

Calvanese , G. De Giacomo,

Lenzerini , and

Rosati . Linking data to ontologies . J. on Data Semantics , X: 133 - 173 , 2008 .

13. M. Rodr´ıguez-

Muro and D.

Calvanese . Dependencies: Making ontology based data access work in practice . In Proc. of AMW 2011 , volume 749 of CEUR, ceur-ws.org , 2011 .

14. M. Rodriguez-Muro and D. Calvanese. High performance query answering over DL-Lite ontologies . In Proc. of KR 2012 , pages 308 - 318 , 2012 .

15.

Rosati . Prexto: Query rewriting under extensional constraints in DL-Lite . In Proc. of ESWC 2012 , 2012 .