=Paper=
{{Paper
|id=None
|storemode=property
|title=The Combined Approach to OBDA: Taming Role Hierarchies using Filters
|pdfUrl=https://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper2.pdf
|volume=Vol-943
|dblpUrl=https://dblp.org/rec/conf/semweb/LutzSTW12
}}
==The Combined Approach to OBDA: Taming Role Hierarchies using Filters==
The Combined Approach to OBDA: Taming Role Hierarchies using Filters Carsten Lutz1 , İnanç Seylan1 , David Toman2 , and Frank Wolter3 1 Universität Bremen, Germany {clu,seylan}@informatik.uni-bremen.de 2 Cheriton School of CS, University of Waterloo, Canada david@cs.uwaterloo.ca 3 University of Liverpool, United Kingdom wolter@liverpool.ac.uk Abstract. There are several approaches to implementing query answer- ing over instance data in the presence of an ontology that target conven- tional relational database systems (SQL databases) as their back-end. They all share the limitation that, for ontologies formulated in OWL2 QL or versions of the description logic DL-Lite that admit both role hi- erarchies and inverse roles, it seems impossible to avoid an exponential blowup of the query (and sometimes this is even provable). We consider the combined approach and propose to replace its query rewriting part with a filtering technique. This is natural from an implementation per- spective and allows us to handle both inverse roles and role hierarchies without an exponential blowup. We also carry out an experimental eval- uation that demonstrates the scalability of this approach. 1 Introduction In recent years, ontology-based data access (OBDA) has emerged as a promising and challenging application of ontologies. The idea is to enrich instance data with a ‘semantic layer’ in the form of an ontology, used as an interface for querying and to derive additional answers. A central research problem in this area is to design query answering engines that can deal with sufficiently expressive ontology languages yet scale to large data sets. The most popular ontology languages that have been considered for OBDA include the three profiles OWL2 RL, OWL2 QL, and OWL2 EL, as well as various description logics and datalog variants related to these profiles [2, 3, 5, 11, 15]. Currently, there are two major methodologies for answering queries in an OBDA setting: rewriting-based approaches (also called backward chaining) and materialization-based approaches (also called forward chaining). In the former, one compiles the ontology T and the query q into a new query qT that contains the relevant knowledge from the ontology, i.e., the answers to q over A and T coincide with the answers to qT over A. One can thus store A in a relational database management system RDBMS and execute qT over A. In materialization 16 2 The Combined Approach to OBDA: Taming Role Hierarchies using Filters approaches, the instance data A is completed with the relevant knowledge from the ontology T , i.e. for any query q, the answers given to q over A and T coincide with the answers given to q over the completed instance data AT ⊇ A without any ontology. Thus, one can store AT in a relational database management system (RDBMS) and execute q over AT . A technical problem that arises in materialization approaches is that the completed data AT easily becomes infinite; in particular, this may happen when the ontology expresses cyclic dependencies and has existential quantifiers in the heads of its concept inclusions, which is allowed in most ontology languages in- cluding the ones mentioned above. To overcome this problem, an economic way of reusing individuals introduced for existential quantifiers has been proposed in [8, 9] for the case where ontologies are formulated in description logics from the EL and DL-Lite families, which are the logical cores of the OWL2 EL and OWL2 QL ontology languages. While the resulting completed data sets are finite and give correct answers to instance queries, they can give spurious answers to conjunctive queries (CQs). To recover soundness for CQs, it is thus necessary to include an additional step, resulting in the combined approach to query answer- ing: the original query is rewritten in a way that eliminates spurious answers. In sharp contrast to pure rewriting, the auxiliary query rewriting required in the combined approach turns out to be rather simple—an additional selection condition applied to the results of the original CQ over the completed data— and often of polynomial size. Indeed, experiments indicate that the combined approach admits very efficient query execution for expressive variants of EL and DL-Lite [8, 9]. Unfortunately, there are certain combinations of logical operators that are important from an application perspective, but for which an exponential blowup of the query seems to be unavoidable both in the query rewriting approach and in the combined approach. In particular, this is the case for the combination of inverse roles and role hierarchies as found in DL-LiteR [3], the extension of basic DL-Lite with role hierarchies that underpins OWL2 QL. It has been shown that, in the query rewriting approach, an exponential blowup of the query size is unavoidable when the ontology is formulated in DL-LiteR [7]. For the combined approach, an auxiliary query rewriting strategy for DL-LiteR ontologies and CQs is presented in [8], but it incurs an exponential blowup and it seems unlikely that the rewriting can be improved to a poly-sized one (although this question is yet to be resolved). In this paper, we present a new variation on the combined approach that can handle CQs and DL-LiteR ontologies and eliminates the need for auxiliary query rewriting altogether, thus also eliminating the need to deal with exponentially sized queries. Specifically, we replace auxiliary query rewriting with a filtering component: spurious answers are eliminated by a polynomial-time filtering pro- cedure (called a filter in the rest of the paper) that is installed as a user-defined function in the underlying RDBMS. Our main contributions are as follows. (1) We develop a polynomial time procedure for filtering out spurious answers to CQs for ontologies formulated in DL-LiteR . Interestingly, the existence of such 17 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 3 a filtering procedure appears to be quite sensitive to how exactly the instance data is completed. Compared to the data completion for the original combined approach [8], the filtering technique requires subtle modifications to the data completion in order to obtain a polytime filter. (2) To analyze the performance of our approach and to compare it with the query rewriting approach, we modify the Lehigh University Benchmark (LUBM) [6] by introducing additional existential restrictions and subconcepts into the LUBM ontology and incompleteness into the LUBM Data Generator. Additional queries that are designed to stress-test the performance of the filtering procedure are introduced as well. (3) Finally, the resulting ontologies, data, and queries are used to demonstrate the feasibility of our approach, and to show that it scales to large amounts of instance data. Some technical proofs and details of our experimental evaluation are presented in the appendix of the full version of this paper, available at http://informatik. uni-bremen.de/~clu/combined/ 2 Preliminaries We introduce DL-LiteR -TBoxes, ABoxes, and conjunctive queries. Let NI , NC , and NR be countably infinite sets of individual names, concept names and role names. Roles R, simple concepts C, and concepts D are built according to the following syntax rules, where P ranges over NR and A over NC : R ::= P | P − , C ::= A | ∃R, D ::= C | ¬C | ∃R.A. As usual, we use N− − − R to denote the set of all roles and identify (P ) with P . In DL-LiteR , a TBox is a finite set T of concept inclusions (CIs) C � D with C a simple concept and D a concept, and role inclusions (RIs) R1 � R2 with R1 , R2 roles.1 An ABox is a finite set of concept assertions A(a) and role assertions P (a, b), where A ∈ NC , P ∈ NR and a, b ∈ NI . We denote by Ind(A) the set of individual names that occur in A, and write P − (a, b) ∈ A instead of P (b, a) ∈ A whenever convenient. A knowledge base (KB) is a pair (T , A) with T a TBox and A an ABox. The semantics of TBoxes and ABoxes is defined in the standard way based on interpretations I = (∆I , ·I ), where ∆I is a non-empty domain and ·I an interpretation function that maps each A ∈ NC to a subset AI ⊆ ∆I , each P ∈ NR to a relation P I ⊆ ∆I × ∆I , and each a ∈ NI to an element aI ∈ ∆I ; for details consult [1, 3]. An interpretation is a model of a TBox T if it satisfies all inclusions in T ; models of ABoxes and knowledge bases are defined analogously. A knowledge base is consistent if it has a model. For a CI or RI α, we write T |= α when α is a consequence of T (satisfied in all models of T ). Instead of 1 A set of role inclusions is also called a role hierarchy. 18 4 The Combined Approach to OBDA: Taming Role Hierarchies using Filters T |= R � S, we will usually write R �∗T S to clearly distinguish consequences of this form (which are RIs) from consequences of the form T |= ∃R � ∃S (which are CIs). Note that, in DL-LiteR , deciding consistency and logical consequence amounts to computing a form of transitive closure [3]. Let NV be a countably infinite set of variables. Taken together, the sets NV and NI form the set NT of terms. A first-order (FO) query is a first-order formula q = ϕ(x) in the signature NC ∪ NR and with terms from NT , where the concept and role names are treated as unary and binary predicates, respectively, and x are the free variables of ϕ called the answer variables; we say that q is k-ary if x comprises k variables. If k = 0, then q is a Boolean query. A conjunctive query (CQ) is an FO query of the form q = ∃y ψ(y, x), where ψ is a conjunction of concept atoms A(t) and role atoms P (t, t� ) where t, t� ∈ NT . As in the case of ABox assertions, we do not distinguish between P − (t, t� ) and P (t� , t). We denote by term(q) the set of terms in q. Let q = ϕ(x) be a k-ary FO query with x = x1 , . . . , xk , and I an interpre- tation. A mapping π : term(q) → ∆I with π(a) = aI for all a ∈ term(q) ∩ NI is a match for q in I if I satisfies q under the variable assignment that maps each t ∈ term(q) to π(t); in this case, we write I |=π q. For a k-tuple of individual names a = a1 , . . . , ak , a match π for q in I is an a-match if π(xi ) = aIi for i ≤ k. We say that a is an answer to q in an interpretation I if there is an a-match for q in I and use ans(q, I) to denote the set of all answers to q in I. Finally, a is a certain answer to q over a KB K = (T , A) if a ⊆ Ind(A) and I |= q[a] for all models I of K. The set of all certain answers to q over K is denoted by cert(q, K). The query answering problem considered in this paper is: given a DL-LiteR knowledge base K and a CQ q, compute cert(q, K). To simplify notation, throughout the paper we adopt the unique name as- sumption (UNA), i.e., require that aI �= bI for distinct a, b ∈ NI . This assumption has no impact on the query answering problem. 3 ABox Completion As explained in the introduction, the central idea of the combined approach is to materialize consequences of the TBox in the ABox as a preprocessing step, and then to execute queries over the completed data stored in an RDBMS as a plain table. We illustrate this using two examples from the university domain, similar in spirit to the LUBM ontology from [6] used in the experimental evaluation. Example 1. For any ABox A, the concept inclusions Student � Person (1) Student � ∃takesCourse (2) lead to the following additions: (1) for every assertion Student(a) ∈ A, add (1) Person(a) and (2) takesCourse(a, b) for some fresh individual b (unless such assertions are already present). After this completion, a CQ such as q1 (x) = ∃y Person(x) ∧ takesCourse(x, y) 19 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 5 Faculty a degreeFrom b Univ degreeFrom deptOf c d Faculty teachesAt Dept Fig. 1. Completed ABox for Example 3. correctly returns each a with Student(a) ∈ A as a certain answer. The following example shows that naive completion can result in infinite ABoxes. Example 2. Completed naively, the ABox {Faculty(a)} and LUBM inclusions Faculty � ∃degreeFrom ∃degreeFrom− � Univ (3) − Univ � ∃deptOf ∃deptOf � Dept (4) − Dept � ∃teachesAt ∃teachesAt � Faculty (5) result in an infinite role chain that indefinitely repeats the roles degreeFrom, deptOf − , and teachesAt− . The problem can be overcome by reusing fresh individuals in an economic way. Example 3. Consider again the TBox (3)-(5). By reusing individuals, the ABox {Faculty(a)} can be completed as shown in Figure 1, replacing the infinite role chain with a cycle. Individual reuse compromises soundness of query answering as some queries now have spurious answers; for example, the CQ q2 (x) = ∃y, z Faculty(x) ∧ degreeFrom(x, y) ∧ Univ(y) ∧ deptOf(z, y) ∧ Dept(z) ∧ teachesAt(x, z) returns c as an answer when executed over the completed ABox shown in Fig- ure 1. This answer is spurious for two reasons: first, the cycle in Figure 1 is present only due to individual reuse and thus should be disregarded for query matches; and second, the freshly introduced individuals b, c, d are ‘labeled nulls’ and thus can never be returned as answers. To recover soundness, it is necessary to eliminate the spurious answers. In the original combined approach, this was achieved by query rewriting [8, 9]. In this paper, the spurious answers are eliminated by a filtering procedure that is in- stalled as a user-defined function in the RDBMS. In the remainder of this sec- tion, we introduce ABox completion in full detail. In the subsequent section, we describe the filtering procedure. From a conceptual perspective, the ABox completion step can be viewed as replacing the original ABox with the canonical model IK of the knowledge 20 6 The Combined Approach to OBDA: Taming Role Hierarchies using Filters base K [8]. To define IK , we need a few preliminaries. From now on, we will generally disallow concepts of the form ∃R.C. This can be done without loss of generality since each CI D � ∃R.C can be replaced with D � ∃RC , RC � R, − and ∃RC � C, where RC is a fresh role name. Let K = (T , A) be a DL-LiteR KB. We use rol(T ) to denote the set of all role names in T plus their inverses. The canonical model comprises at most two fresh individuals for every role in rol(T ). However, we only want to introduce the fresh individuals for a given role when necessary. Formally, we call a role R ∈ rol(T ) generating in T if there exist an a ∈ Ind(A) and R0 , . . . , Rn ∈ rol(T ) such that Rn = R and the following conditions hold: / A for all b ∈ Ind(A) (written a � ∃R0 ), (agen) K |= ∃R0 (a) and R0 (a, b) ∈ (rgen) for i < n, T |= ∃Ri− � ∃Ri+1 and Ri− �= Ri+1 (written ∃Ri− � ∃Ri+1 ). To facilitate the implementation of efficient filters, we refine the definition of canonical models as given in [8]: in some cases, we introduce two fresh individ- uals for a given role instead of only a single one. The need for the additional individual is related to particular role configurations in the TBox called a loop: a set {R, S} ⊆ rol(T ) (where potentially R = S) is a loop in T if R �= S − , T |= ∃R− � ∃S, T |= ∃S − � ∃R, and there is some T ∈ rol(T ) such that S − �∗T T and R �∗T T . Let LT denote the set of all roles that occur in a loop in T . The canonical model IK is then based on the domain ∆IK = Ind(A) ∪ {cR,0 | R ∈ rol(T ) \ LT is generating in K} ∪ {cR,0 , cR,1 | R ∈ LT is generating in K}. To define the extension of roles in IK , we need some additional preparation. Let “≺” be an arbitrary, but fixed total ordering on rol(T ). For all d, d� ∈ ∆IK and each role R, we write d �R d� whenever there is an S such that S �∗T R and one of the following cases applies: – d = a ∈ Ind(A), a � ∃S, and d� = cS,0 ; – d = cT,i , ∃T − � ∃S, d� = cS,j , and one of the following holds • i = j and {S, T } is not a loop in T ; • i = j, {S, T } is a loop in T , and S ≺ T ; • i = j, {S, T } is a loop in T , and T = S or T ≺ S (for 0 = 1 and 1 = 0). The canonical model IK for K is now defined as follows, based on the domain ∆IK introduced above: AIK = {a ∈ Ind(A) | K |= A(a)} ∪ {cR,i ∈ ∆IK | T |= ∃R− � A}, RIK = {(a, b) ∈ Ind(A) × Ind(A) | ∃S : S(a, b) ∈ A and S �∗T R} ∪ {(d, d� ) ∈ ∆IK | d �R d� or d� �R− d}, aIK = a. The ABox completion consists of replacing the ABox A originally stored in the RDBMS with its canonical model IK . This can be achieved by executing a set of FO/SQL-queries whose size is polynomial in the size of T [8]. 21 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 7 a a w i i w acw,0 p p cw,0 cp,0 i i w acw,0 cp,0 i w w i p i ac c c cw,1 cp,1 .. w,0 p,0 w,1 i . Fig. 2. Canonical model IK and unraveled canonical model UK for Example 5. It can be shown that IK is a model of K whenever K is consistent. Note that one can find a Boolean CQ qT of size polynomial in the size of T such that for any ABox A stored in the RDBMS, qT gives a positive answer iff K = (T , A) is consistent [8]. We can thus safely assume that the knowledge base has been tested for consistency before query answering. Example 4. Reconsider Examples 2 and 3. The canonical model IK for the ABox {Faculty(a)} and TBox (3)-(5) is the structure displayed in Figure 1. Follow- ing our construction above, the fresh individuals b, c, d are named cdegreeFrom,0 , cteachesAt− ,0 , and cdeptOf − ,0 . Note that the TBox (3)-(5) does not give rise to any loops, and thus all cR,i have index i = 0. Example 5. The following TBox gives rise to the loop {worksFor, paysSalaryOf}: Employee � ∃worksFor ∃worksFor− � Employer (6) − Employer � ∃paysSalaryOf ∃paysSalaryOf � Employee (7) − worksFor � isAffiliatedWith paysSalaryOf � isAffiliatedWith. (8) The canonical model IK for the ABox {Employee(a)} and the TBox (6)-(8) with paysSalaryOf ≺ worksFor is shown on the left-hand side of Figure 2, where concept names are omitted and role names are abbreviated by their first letter. Note that a more straightforward version of the canonical model could be ob- tained by identifying all elements cR,0 and cR,1 . Section 4 explains why this leads to problems for efficient filtering. Also note that real world ontologies seem to contain only very few loops, if any, and thus having two domain elements per role that participates in a loop should not significantly increase the size of canonical models in practical cases. To characterize the spurious answers that have to be filtered out, it is useful to introduce an unraveled (infinite) version of canonical models. Let K be a knowledge base. A path is a finite sequence ad1 · · · dn , n ≥ 0, such that a ∈ Ind(A), d1 , . . . , dn ∈ ∆IK \ Ind(A), a �R d1 for some R ∈ N− R , and di �R di+1 for some R ∈ N− R , 1 ≤ i < n. We denote by tail(σ) the last element of the path σ. 22 8 The Combined Approach to OBDA: Taming Role Hierarchies using Filters The unraveled canonical model UK is then defined by taking: ∆UK is the set of all paths in IK , aUK = a, for all a ∈ Ind(A), AUK = {σ ∈ ∆UK | tail(σ) ∈ AIK }, RUK = {(a, b) ∈ Ind(A) × Ind(A) | ∃S : S(a, b) ∈ A and S �∗T R} ∪ {(σ, σd) | σd ∈ ∆UK and tail(σ) �R d} ∪ {(σd, σ) | σd ∈ ∆UK and tail(σ) �R− d}. As an example, the canonical model UK for the KB from Example 4 is shown on the right-hand side of Figure 2. The following result shows that, as one would expect, UK does not suffer from spurious answers. Theorem 1. For every consistent DL-LiteR -KB K and every CQ q, we have cert(q, K) = ans(q, UK ). The proof of Theorem 1 is standard and omitted, see [8] for a similar proof. 4 Filtering To remove spurious answers, we install a filtering procedure as a user-defined function in the RDBMS. The procedure takes as input a match of the query in the canonical model IK stored in the RDBMS and returns “false” if this match is spurious and “true” otherwise. We assume that the filtering procedure has access to the query and the TBox, but not to the data. To define its behavior more precisely, we formally define spurious matches based on unraveled canonical models UK and Theorem 1. Let K be a KB and q(x) a CQ. A match π of q in IK is reproduced by a match τ of q in UK if for all t ∈ term(q), we have π(t) = tail(τ (t)). We say that π is spurious if it is not reproduced by any match τ of q in UK . The following lemma, which is an immediate consequence of Theorem 1, shows that IK can be used for query answering when spurious matches are filtered out. Lemma 1. a ∈ cert(q, K) iff there is an a-match π of q in IK that is not spurious. We want to show that it can be decided in time polynomial in the size of q and T whether a given match in IK is spurious. Clearly, it is enough to test for each maximally connected component of q whether the given match is spurious on that component. We can thus w.l.o.g. assume that q is connected. We need a few preliminaries. An anonymous path is a path without the leading individual name, i.e., it is a finite sequence d1 · · · dn , n ≥ 1, such that d1 , . . . , dn ∈ ∆IK \ Ind(A) and di �R di+1 for some R ∈ N− R , 1 ≤ i < n. We use Paths to denote the set of all paths, both anonymous and non-anonymous. A root configuration for q given π is a set ρ ⊆ term(q) such that one of the following conditions is true: – ρ is the set of those t ∈ term(q) such that π(t) ∈ NI and this set is non-empty; 23 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 9 Student Student Student a 1 a2 ··· an takesCourse takesCourse ctakesCourse,0 Fig. 3. Canonical model IK for Example 6. – the above set is empty and ρ contains exactly one term (actually a variable). The filtering procedure first checks whether all answer variables are mapped to elements of Ind(A), i.e., individuals of the original ABox A. If this is not the case, it immediately returns “false”. Then the procedure iterates through all root configurations ρ. For each ρ, it constructs a sequence Sρ0 , Sρ1 , . . . of relations Sρi ⊆ term(q) × Paths as follows: – Sρ0 contains all pairs (t, π(t)) with t ∈ ρ; – Sρi+1 is Sρi extended with the following pairs: (a) (t, σπ(t)) for all R(s, t) ∈ q with (s, σ) ∈ Sρi and π(s) �R π(t); (b) (t, σπ(t)) for all R(s, t) ∈ q with (s, σπ(t)π(s)) ∈ Sρi and π(t) �R− π(s). The computation stops as soon as the sequence stabilizes or Sρi becomes non- functional which happens after at most |term(q)| iterations. The procedure re- turns “true” if the final Sρi is a function with domain term(q) for some root configuration ρ, and “false” otherwise. Example 6. Consider the TBox (1)-(2) from Example 1, the query q3 (x, y) = ∃z Student(x) ∧ Student(y) ∧ takesCourse(x, z) ∧ takesCourse(y, z), (9) and the ABox {Student(a1 ), . . . , Student(an )}. (10) The canonical model IK is shown in Figure 3. Suppose the filter gets as input the match π = {x �→ a1 , y �→ a2 , z �→ ctakesCourse,0 }. There is only one possible root configuration for π, which is ρ = {x, y}. The procedure computes Sρ = {(x, a1 ), (y, a2 ), (z, a1 ctakesCourse,0 ), (z, a2 ctakesCourse,0 )} which is not a function; thus, the match is identified as spurious and “false” is returned. Example 7. Consider the ABox {Faculty(a)}, TBox (3)-(5), and query q2 from Example 3. To make things a bit more interesting, assume that x is a quantified variable in q2 rather than an answer variable. Recall that the canonical model IK is shown in Figure 1, modulo the names of fresh individuals. Given the match π = {x �→ c, y �→ b, z �→ d} and considering the root configuration ρ = {x}, the procedure computes Sρ = {(x, c), (y, cb), (z, cbd), (x, cbdc)} and stops because of non-functionality. For the other root configurations ρ = {y} and ρ = {z}, the procedure fails in a similar way and thus returns “false”. 24 10 The Combined Approach to OBDA: Taming Role Hierarchies using Filters Similar to the “tree witnesses” from [8], the filtering procedure follows a simple idea for reproducing the input match π in IK as a match τ in UK : when we have already decided that τ (x) = σ ∈/ Ind(A) and R(x, y) ∈ q, then there is a uniquely determined individual σ � to which y can be matched. This follows from requiring π(y) = tail(τ (y)) and the following property of UK : if (σ, σ � ) ∈ RUK and (σ, σ �� ) ∈ RUK with σ � �= σ �� , then tail(σ � ) �= tail(σ �� ). In fact, it is this determinism of matches that is made explicit by Conditions (a) and (b) of the filtering procedure. Note that, without introducing two individual names cR,0 and cR,1 whenever R is involved in a loop, the above crucial property of UK fails. In fact, we do not know whether polytime filtering is possible based on the variation of the canonical model where all individuals cR,0 and cR,1 are identified. The problem is illustrated by the following example. Example 8. Consider the ABox {Employee(a)} and TBox (6)-(8) from Exam- ple 5 and the CQ q4 (x) = ∃y, z, u w(x, y) ∧ p(y, z) ∧ i(u, z). Let π = {x �→ a, y �→ cw,0 , z �→ cp,0 , u �→ cw,1 }. The only root configuration is ρ = {x}. During the first two iterations, the filtering procedure produces Sρ2 = {(x, a), (y, acw,0 ), (z, acw,0 cp,0 )}. Sρ2 says that z has to be mapped to acw,0 cp,0 . Due to the atom i(z, u) ∈ q4 and the two i-edges outgoing from z in UK , the possible targets for u are acw,0 and acw,1 . However, to produce a match in UK that is compatible with π, we can only choose a target that ends with π(u) = cw,1 and obtain Sρ3 = {(x, a), (y, acw,0 ), (z, acw,0 cp,0 ), (z, acw,0 cp,0 cw,1 )} which is functional, showing that the match π is not spurious. In a canonical model IK where cw,0 and acw,1 are identified, there are indeed two choices for mapping of u. This makes it non-obvious how to find a polytime filtering proce- dure in this case, if one exists at all. We now analyze the runtime and correctness of the filtering procedure. First note that, in Conditions (a) and (b), the filtering procedure has to check whether π(s) �R π(t) and π(t) �R− π(s), respectively. As required, both conditions can be tested without access to the ABox A. For example, in Condition (a) we have: – if π(t) ∈ Ind(A), then π(s) �R π(t) does not hold and checking whether π(t) ∈ Ind(A) requires only to check whether or not π(t) is of the form cR,i ; – if π(s) ∈ Ind(A) and π(t) ∈ / Ind(A), then π(s) �R π(t) holds by the con- struction of IK since π is a match of q in IK , and; – if π(s) ∈ / Ind(A) and π(t) ∈ / Ind(A), then π(s) �R π(t) can be checked by using only π and T based on the definition of “�R ”. It is not hard to see that the algorithm runs in polynomial time. The runtime is quadratic in the size of q because we first have to iterate over all root con- figurations ρ and then need to compute Sρ , essentially a breadth-first search 25 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 11 of (the graph of) q. We conjecture that iterating over all root configurations is avoidable at the cost of a less transparent filtering procedure, improving the runtime to linear in the size of q. The runtime also depends on T as checking the applicability of Conditions (a) and (b) involves testing consequences of the forms T |= ∃R � ∃S and S �∗T R. Since it is efficient to pre-compute all these consequences in practical cases, this amounts to a simple lookup. The following lemma asserts correctness of the filtering procedure. It is proved in Appendix A of the full version. Lemma 2. Given a match π of q in IK , the filtering procedure returns “true” iff π is not spurious. 5 Experiments We carry out an experimental evaluation to analyze the performance of the com- bined approach with filtering, based on the IBM DB2 relational database system. We use a modified version of the ontology from the Lehigh University Benchmark (LUBM) [6] and produce test ABoxes using a modified version of the LUBM data generator. We execute five queries from the LUBM suite as well as six hand- crafted ones that were designed to test the proposed approach more fully. Note that LUBM was not specifically designed for evaluating OBDA with DL-LiteR and in its original form is not too useful for this purpose, for reasons explained in more detail below. Our modifications of the LUBM ontology, data generator, and query set all aim at making the LUBM suite more realistic for OBDA evaluation. We believe that this material might be interesting also for future experiments and provide it online at http://informatik.uni-bremen.de/~clu/combined/. 5.1 Ontology, Data, and Queries The LUBM ontology comprises 42 concept names and 25 role names and is formulated in the description logic ELI extended with transitive roles, role hier- archies, and datatypes. The TBox contains concept inclusions of the form A � C, concept definitions A ≡ C as abbreviations for A � C, C � A, and domain and range restrictions of the form ∃R � A and ∃R− � A. We converted this on- tology to DL-LiteR by dropping all datatypes, treating the only transitive role subOrganizationOf as a standard role, replacing concept equations A ≡ C with A � C, and breaking up conjunctions A � C1 � C2 into A � C1 , A � C2 . While the resulting TBox is formulated in DL-LiteR as required, it is only moderately interesting for evaluating query answering techniques: first, there is a lack of existential restrictions ∃R and ∃R.C on the right-hand side of con- cept inclusions, which leads to extremely few fresh anonymous individuals being generated during the ABox completion, and consequently to very few role edges between those individuals (from now on, we call this part of the canonical model IK the anonymous part); second, the overall size of the TBox is too small to be representative for real-world ontologies. To attenuate these deficiencies while still 26 12 The Combined Approach to OBDA: Taming Role Hierarchies using Filters being able to use the LUBM data generator, we extended the DL-LiteR -version of LUBM in two directions: (1) We added 26 concept inclusions, many of which have existential restrictions on the right-hand side, to generate a more interesting anonymous part of canon- ical models. A complete list of these CIs can be found in Appendix B of the full version of this paper. (2) With reasonable effort, it does not seem possible to significantly increase the size of LUBM (to hundreds or thousands of concepts) while retaining a careful modeling. One particularly unrealistic aspect of LUBM and a striking differ- ence to more comprehensive ontologies is its limited concept hierarchy, where each concept has only very few subconcepts. To alleviate this shortcoming, we added subconcepts to each of the LUBM concepts Course, Department, Professor, and Student by introducing subject areas, such as MathCourse, BioCourse, and CSCourse for courses, MathProfessor, BioProfessor for professors, etc. We call the resulting TBox LUBM∃n with n indicating the number of sub- concepts introduced in Point 2 above (20 by default). For example, LUBM∃20 contains 106 concept names and 27 role names. To generate ABoxes, we use the LUBM Data Generator (UBA) version 1.7, modified so as to complement our modifications to the TBox. Specifically, the original UBA generates data that is complete w.r.t. existential restrictions in the LUBM ontology: it produces ABoxes A such that for every assertion A(a) ∈ A and CI A � ∃R (and A � ∃R.B) in LUBM∃n , there is already an r-successor of a in A. Our modifications introduce a controlled amount of incompleteness: the modified data generator takes a probability p as a parameter and, in selected parts of the data, drops generated role assertions with probability p. More infor- mation can be found in Appendix C of the full version. The second modification of the data generator is linked to the subconcepts introduced in Point 2 above. Whenever the original generator produces an instance a of Student, the new generator randomly chooses a value between 1 and n and generates an asser- tion for the i-th subject, SubjiStudent(a); similarly for Course, Department, and Professor. The main aim of our experiments is to show that our approach is feasible on realistic ontologies, data, and queries. Additionally, we also provide a prelimi- nary comparison with the query rewriting approach, using the Requiem tool for producing those rewritings [10]. We use 11 queries, six of which we have hand- crafted specifically for our experiments and five originating from the evaluation of Requiem presented in [10]. The latter queries are extremely simple and do in most cases neither pose a serious challenge for the filtering approach nor for pure rewriting. The former are shown in Figure 4. Note that q3 is very similar to the query discussed in Examples 3, 4, and 7; and is designed in such a way that spurious cycles in the anonymous part of canonical models produce spurious matches that have to be filtered out. Query q2 is essentially the query discussed in Example 6 and is designed to stress-test the filtering approach: based on the data generation scheme, it is expected to produce a very large number of spurious answers. 27 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 13 q1(x,y) <- Student(x), takesCourse(x,z), Course(z), teacherOf(y,z), Faculty(y), worksFor(y,u), Department(u), memberOf(x,u) q2(x,y) <- Subj3Student(x), Subj4Student(y), takesCourse(x,z), takesCourse(y,z) q3(x) <- Faculty(x), degreeFrom(x,y), University(y), subOrganizationOf(z,y), Department(z), memberOf(x,z) q4(x,y) <- Subj3Department(x), Subj4Department(y), Professor(z), memberOf(z,x), publicationAuthor(u,z), Professor(v), memberOf(v,y), publicationAuthor(u,v) q5(x) <- Publication(x), publicationAuthor(x,y), Professor(y), publicationAuthor(x,z), Student(z) q6(x,y) <- University(x), University(y), memberOf(z,x), Student(z), memberOf(u,y), Professor(u), advisor(z,u) Fig. 4. Queries q1 to q6 . Universities CA CA (compl) RA RA (compl) Inds Inds (compl) 10 373K 636K 593K 1.3M 201K 201K 25 984K 1.6M 1.5M 3.6M 528K 528K 50 1.9M 3.3M 3.1M 7.2M 1M 1M 75 3M 5.1M 4.7M 10.9M 1.6M 1.6M 100 4M 6.8M 6.3M 14.6M 2.1M 2.1M 125 5M 8.5M 7.9M 18M 2.7M 2.7M 150 6M 10.1M 9.5M 21.8M 3.2M 3.2M 200 8M 13.5M 12.6M 29M 4.3M 4.3M Fig. 5. Number of concepts, roles, and individuals in original and completed ABoxes. 5.2 Results We report on three experiments, in each experiment varying a different param- eter: in experiment one, we vary the size of the ABox via the number of univer- sities generated by the data generator; in experiment two, we vary the degree of incompleteness of the data; and in experiment three, we vary the number of subclasses, i.e., the parameter n of the ontology LUBM∃n . All experiments were carried out on a Linux (3.2.0) machine with a 3.5Ghz quad-core processor and 8GB of RAM, using IBM DB2 version 9.7.5. The size of the test data is detailed in Figure 5 and query execution times for the first experiment are reported in Figure 7. Here we use 5% incompleteness of the data and n = 20 subclasses in LUBM∃n . The orange curves are for the combined approach with filtering while the blue ones indicate the pure rewriting approach for those cases in which Requiem succeeded generating a rewriting. Using the combined approach with filtering, all queries were answered within very reasonable time. To better understand the results, it is interesting to con- sider the number of spurious and valid answers for each query shown in Figure 6 for the 200 universities experiment. Query q2 , designed specifically to stress-test the filter, as expected produces a huge number of spurious answers. Indeed, the 28 14 The Combined Approach to OBDA: Taming Role Hierarchies using Filters q1 q2 q3 q4 q5 q6 req1 req2 req3 req4 req5 spurious answers 2 28M 2 24K 0 0 0 22K 0 163K 0 valid answers 4.6M 0 0 0 8.3M 0 0 410K 48K 137K 0 Fig. 6. Number of answers for 200 universities, 5% incompleteness, 20 subclasses. comparably long execution time of this query appears to be mainly due to the fact that DB2 has to handle a large number of spurious answers before the filter- ing takes place, and not to a poor performance of the filter itself. The execution times of q1 and q5 can also be explained by a large number of answers. Note that the number of filter calls is actually the sum of the numbers of answers, both spurious and valid. Also note that, in principle, it is possible to avoid an extremely large number of spurious answers in q2 (and any other query) at the cost of slightly increasing the size of the canonical model: duplicate the anony- mous part of the canonical model so that no two individuals in the original ABox ‘share’ an anonymous part of the canonical model. Analyzing this further in experiments is left for future work. Experiments two and three are reported about in Figures 8 and 9. Here, we only tested the filtering approach. In both cases, we use 100 universities. In experiment two, the number of subclasses is fixed to 20 while in experiment three, the degree of incompleteness is fixed to 5%. In general, the degree of incompleteness has virtually no effect on the execution time of queries. Again, q2 is an exception as the number of spurious answers increases dramatically from 3M for 1% incompleteness to 125M for 20% incompleteness. The number of subclasses also has essentially no effect on query execution times (in contrast to the pure query rewriting approach for which a non-trivial number subclasses can dramatically increase the size of the rewritten query). Note that the execution time of q2 becomes shorter with an increasing number of subclasses because the number of spurious matches decreases from 6.5M for 5 subclasses to 211K for 100 subclasses: this is due to the atoms Subj3Student and Subj4Student in q2 and the fact that the number of assertions for these two concepts decreases as the number of subject areas increases. 6 Conclusion We have modified the combined approach to OBDA by replacing the query rewriting part with a filtering technique. This step is natural from an implemen- tation perspective and allows to circumvent an exponential blowup of the query. Based on experiments with an improved version of the LUBM ontology, we have demonstrated the scalability of our approach. As future work, we plan to extend the combined approach with filtering to other description logics for which, until now, it is unknown how to avoid an exponential blowup. For example, we believe that polytime filtering is possible for the extension of EL with transitive roles, as found in the OWL2 EL pro- file. It would also be interesting to better understand the impact of modifying the canonical model on query answering, both from a theoretical and from a 29 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter 15 100 time (s) 50 200 150 s IV 100 N 0 U q1 q2 q3 q4 q5 q6 req1 req2 req req req 50 of 3 4 5 # Fig. 7. Query run times for varying numbers of Universities. 400 time (s) 200 20 15 ) (% 10 e 0 p let q1 q2 q3 q4 q5 5 m q6 req1 req2 req req req 3 4 5 02 i nc o Fig. 8. Query run times for varying incompleteness (in %). time (s) 20 100 80 es ss 0 60 b cla q1 q2 q3 q4 q5 q6 req1 req2 req req req 40 su 3 4 5 20 of # Fig. 9. Query run times for varying number of subclasses. 30 16 The Combined Approach to OBDA: Taming Role Hierarchies using Filters practical perspective. For example, we do not know whether the more natural canonical model obtained by identifying all individuals cR,0 and cR,1 admits polytime filtering. Moreover, as discussed above it is conceivable that versions of the canonical model that are less economic regarding individual reuse result in better runtime in practice. We also plan to compare the performance of our approach more thoroughly with the performance of pure query rewriting, using other state-of-the-art query rewriting tools such as Quest [12], Presto [13], OWLgres [14], CLIPPER [4]. In this context, it is interesting to note that promising new optimization techniques have recently been developed in [11] and implemented in the Quest system. While some of them (such as the exploitation of ABox integrity constraints) aim specifically at the query rewriting approach, others (such as semantic indexing) can easily be combined with the filtering approach proposed in this paper. References 1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation and Applications. Cam- bridge University Press (2003) 2. Calı̀, A., Gottlob, G., Pieris, A.: New expressive languages for ontological query answering. In: AAAI (2011) 3. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. of Automated Reasoning 39(3), 385–429 (2007) 4. Eiter, T., Ortiz, M., Simkus, M., Tran, T.K., Xiao, G.: Query rewriting for Horn- SHIQ plus rules. In: AAAI (2012) 5. Eiter, T., Ortiz, M., Simkus, M., Tran, T.K., Xiao, G.: Towards practical query answering for Horn-SHIQ. In: Description Logics (2012) 6. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys- tems. J. Web Sem. 3(2-3), 158–182 (2005) 7. Kikot, S., Kontchakov, R., Podolskii, V.V., Zakharyaschev, M.: Exponential lower bounds and separation for query rewriting. In: ICALP (2). pp. 263–274 (2012) 8. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com- bined approach to query answering in DL-Lite. In: KR (2010) 9. Lutz, C., Wolter, F., Toman, D.: Conjunctive query answering in the description logic EL using a relational database systems. In: IJCAI. pp. 2070–2075 (2009) 10. Pérez-Urbina, H., Horrocks, I., Motik, B.: Efficient query answering for OWL 2. In: International Semantic Web Conference. pp. 489–504 (2009) 11. Rodriguez-Muro, M., Calvanese, D.: High performance query answering over DL- Lite ontologies. In: KR (2012) 12. Rodriguez-Muro, M., Calvanese, D.: Quest, an OWL 2 QL reasoner for ontology- based data access. In: OWLED (2012) 13. Rosati, R., Almatelli, A.: Improving query answering over DL-Lite ontologies. In: KR (2010) 14. Stocker, M., Smith, M.: Owlgres: A scalable OWL reasoner. In: OWLED (2008) 15. Thomazo, M., Baget, J.F., Mugnier, M.L., Rudolph, S.: A generic querying algo- rithm for greedy sets of existential rules. In: KR (2012) 31