Heuristics for Applying Cached OBDA Rewritings Andreas Nakkerud and Evgenij Thorstensen Department of Informatics, University of Oslo Abstract. In OBDA systems, cached rewritings can be used to significantly shorten the process of finding rewritings of bigger queries. The main challenge in using cached rewritings is finding the optimal combination to apply, a problem which is NP-complete. In this paper, we present a way to calculate the value of a cached rewriting, and propose using this value to decide which cached rewritings to apply. The idea behind our technique is to estimate how much computation, in particular optimization, has gone into the rewriting. This can tell us something about which cached rewritings we should prioritise, and also when to cache a rewriting. In order to quantify optimization, we define a measure of UCQs, and calculate this measure at different stages of the rewriting. 1 Introduction Ontology-based data access (OBDA) [11] is a recent paradigm for accessing data sources through an ontology that acts as a conceptual, integrated view of the data. The data sources are connected to the ontology via mappings that specify how to retrieve the appropriate data from the sources. The framework of OBDA has received a lot of attention in the last years: many theoretical studies have paved the way for the construction of OBDA systems (e.g., [3, 5, 13]) and the development of OBDA projects for enterprise data management in various domains [2]. The usual way of answering a query over the ontology in the OBDA framework [11] is to transform it into a set of queries to be executed over the data sources. This process is computationally expensive, as it requires logical reasoning over knowledge represented in the ontology (rewriting) followed by the application of mappings to the ontological query (unfolding). Furthermore, as the resulting set of queries is likely to contain redundancies, optimization techniques are typically applied along the way [10]. It therefore makes sense to cache the results of a query rewriting and unfolding, in the hope of using them to speed up the answering of future queries [10]. However, the problem of finding the cached rewritings to use is itself hard, as it is a variant of the well-known problem of query answering using views [8]. In particular, checking whether a cached rewriting can be applied to a query is in general NP-complete. In this paper, we therefore study the following problem: Given a query Q to answer and a set of cached rewritings of previous queries, how useful is each cached rewriting likely to be, if it is applicable to Q? Knowing this, the system can make sensible decisions as to which cached rewritings to try and apply, and in what order. As a measure of the usefulness of an existing cached rewriting, previous work by Di Pinto et al. [10] use the number of atoms in the queries. This approach does not take into account the details of the rewriting algorithm used, nor the specification of the OBDA system the queries are answered over. In order to get a finer picture of the usefulness of a cached rewriting, we define a measure for UCQs, and apply it to an intermediate stage of the rewriting. We then use this measure, as well as measures of the cached rewriting, to define a heuristic. In addition to analysing the queries of a cached rewriting directly, this heuristic also takes into account how ontology queries are rewritten by the OBDA system. 2 Preliminaries In this section, we define basic notions related to OBDA systems and their components: databases, ontologies, and mappings. We then give a definition of query rewriting and unfolding, followed by a formal definition of cached query rewritings as mappings that possess specific properties. 2.1 Databases In this paper, we assume a fixed relational schema S, and adopt the standard notions for conjunctive queries (CQs) over S [1]. To every conjunctive query CQ we associate a measure of its size, denoted by size(CQ). There are several ways to measure the size of a conjunctive query; we will discuss using the number of atoms or the number of variables in the query. However, it is possible to also use more advanced measures of query size, such as tree and hypertree width [6, 7]. Given a conjunctive query Q and a tuple of constants t, we write Q[t] for the query obtained by replacing the free variables of Q with the constants in t. Given a database instance D and a query Q, we write ans(Q, D) for the set of answers to Q over D, defined in the standard way. 2.2 Ontologies An ontology is a description of a domain of interest in some formal language. Here, we consider the languages of Description Logics (DLs). In general, an ontology expressed in a description logic is a pair O = hT , Ai, where the TBox T contains axioms specifying universal properties of the concepts and roles in the domain, while the ABox A specifies instances of concepts and roles. In the OBDA setting, the ABox is given by mappings rather than explicitly, and so the TBox is the only relevant component. We therefore set O = T . In the examples discussed in this paper, we will use the description logic DL-LiteA [4, 12]. The syntax of DL-LiteA is based on concepts, value-domains, roles, and attributes, and can be defined using the following grammar [10]: B → A | ∃Q | δ(UC ) E → ρ(U ) C → B | ¬B F → T1 | . . . | Tn − Q→P |P V → U | ¬U R → Q | ¬Q, where A is a concept name, P is a role name, U is an attribute name, and T1 , . . . , Tn are value-domains. We let ΣO be a set of ontology predicates, and ΓC a set of constants. The semantics of DL-LiteA is defined in terms of first- order interpretations I = (∆I , ·I ) over ΣO ∪ ΓC . The non-empty domain ∆I is the union of the disjoint sets ∆V and ∆IO , where ∆V is the domain for interpreting data values, and ∆IO is the domain for interpreting object constants. The interpretation function ·I is defined as follows: AI ⊆ ∆IO (¬U )I = (∆IO × ∆V ) \ U I P I ⊆ ∆IO × ∆IO (P − )I = {(o, o0 ) | ∃v.[(o0 , o) ∈ P I ]} U I ⊆ ∆IO × ∆V (∃Q)I = {o | ∃o0 .[(o, o0 ) ∈ QI ]} (¬B)I = ∆IO \ B I (δ(U ))I = {o | ∃v.[(o, v) ∈ U I ]} (¬Q)I = (∆IO × ∆IO ) \ QI (ρ(U ))I = {v | ∃o.[(o, v) ∈ U I ]}. A DL-LiteA TBox T is a finite set of axioms of the form BvC QvR EvF U vV (funct Q) (funct U ). The interpretation I satisfies an axiom X v Y if X I ⊆ Y I . I satisfies (funct Z) if for every o, o0 , o00 ∈ ∆IO , if (o, o0 ) ∈ Z I and (o, o00 ) ∈ Z I , then o0 = o00 . 2.3 Mappings In the general case, a mapping assertion m between a database schema S and an ontology O has the form Q q, where the body Q is a CQ over S, and the head q a CQ over the vocabulary of O [11], possibly with shared free variables. To define the semantics of mapping assertions, we first need to define the notion of an OBDA system. 2.4 OBDA systems An OBDA system specification is a triple B = hO, S, Mi where S is a database schema, M a set of mapping assertions, and O an ontology. The semantics of query answering in OBDA systems are usually defined using first-order interpretations. Definition 1 (OBDA semantics). Let B = hO, S, Mi be an OBDA system specification, and D a database for S. A first order interpretation I with I |= D is a model for B if – I |= O, and – for every tuple of constants t from D, and every mapping assertion Q q ∈ M, we have that I |= q[t] whenever I |= Q[t]. The set of answers to a query q over B and D is the set of tuples t from D such that q[t] is true in every model of B and D. We write ans(q, B, D) for the answers to q over B and D. As mentioned in the introduction, answering a query in an OBDA system is usually done by rewriting and unfolding the query into the correct database queries to execute. Definition 2 (Rewriting and unfolding). Let B = hO, S, Mi be an OBDA system specification, and q a CQ over the vocabulary of O. A rewriting of q under B is a query q 0 over the same vocabulary such that ans(q 0 , h∅, S, Mi, D) = ans(q, B, D) for every database instance D. An unfolding of the rewriting q 0 of q is a query Q over S such that ans(Q, D) = ans(q, B, D) for every database instance D. We can now define the notion of a perfect mapping assertion, which captures the idea of caching the rewriting and unfolding of a query. Definition 3 (Perfect mapping assertion [10]). Let B = hO, S, Mi be an OBDA system specification. A mapping assertion Q q is a perfect mapping assertion if for every database instance D, we have ans(q, B, D) = ans(Q, D). Each cached rewriting and unfolding is a perfect mapping assertion. In fact, caching is one of the primary methods for obtaining perfect mapping assertions [10]. 3 Applying Perfect Mapping Assertions As part of their full OBDA rewriting algorithm, Di Pinto et al. [10] present the algorithm ReplaceSubqueryR for applying a perfect mapping assertion before the regular rewriting procedure starts. Using the notion of restricted homomorphisms, they give an exact definition of when a perfect mapping assertion can be applied. Given a conjunctive query q, ReplaceSubqueryR goes through the perfect mapping assertions in the order specified by some heuristic, modifying q whenever the perfect mapping assertion is applicable. The order of application is significant, since the application of one perfect mapping assertion can prohibit the subsequent application of another. Finding the optimal combination of perfect mapping assertions to apply is NP-hard [10]. Di Pinto et al. use a greedy strategy. The heuristic for this strategy is the number of atoms in the heads of the perfect mapping assertions. Our goal is to define an improved heuristic for the order of application of perfect mapping assertions. The following example, based on one of the challenges faced by the Optique project1 , shows how we create perfect mapping assertions. Example 4. Let company(name, owner, manager, accountant) be a database table. We define the concepts Company and Owner, M anager and Accountant, and the roles hasOwner, hasM anager, and hasAccountant. We define the TBox,     ∃hasOwner v Company,  T = ∃hasM anager v Company, ,   ∃hasAccountant v Company   and mapping assertions ∃y, z, w.company(x, y, z, w)    Company(x)     ∃z, w.company(x, y, z, w)  hasOwner(x, y)   M= .   ∃y, w.company(x, y, z, w) hasM anager(x, z)     ∃y, z.company(x, y, z, w)   hasAccountant(x, w) We now look at the query q(x) = Company(x). The ontology rewriting of Company(x) is q 0 (x) = Company(x) ∨ ∃v.hasOwner(x, y) ∨ ∃v.hasM anager(x, y) ∨ ∃v.hasAccountant(x, y). When unfolding q 0 (x) we get the UCQ ∃y, z, w.company(x, y, z, w) ∨ ∃v, z, w.company(x, v, z, w) ∨ ∃y, v, w.company(x, y, v, w) ∨ ∃y, z, v.company(x, y, z, v), which is obviously equivalent to ∃y, z, w.company(x, y, z, w). We cache this rewriting by saving the perfect mapping assertion ∃y, z, w.company(x, y, z, w) Company(x). Example 4 illustrates one situation where perfect mapping assertions are useful. An alternative way of dealing with his particular example is by modifying the set M of mapping assertions according to the TBox [13], and then to optimise it [9, 13]. This approach is not as general as the perfect mapping assertion approach, because it only works when there are redundancies in the mapping assertions. The perfect mapping assertions can represent optimisations that are only valid in special cases. 1 http://optique-project.eu/ 4 Query Measure In order to evaluate the quality of a perfect mapping assertion, we need some way of measuring a UCQ. We will use this measure to quantify the amount of optimization that has gone into creating a perfect mapping assertion. There are several ways to measure the size of a conjunctive query. For ease of exposition, we will use the number of atoms in the query in our examples, although the number of variables is usually more relevant to the exact cost of optimising a query, since query optimisation amounts to finding homomorphisms between queries. If the number of variables is approximately linear in the number of atoms, the results of using either will be similar. Furthermore, the number of atoms in each conjunction of the unfolding of a conjunctive query can be approximated from the number of mapping assertions that mentions conjunct. On the other hand, finding the number of variables requires an analysis of each mapping assertion. The size of a UCQ is harder to describe with a single number, because UCQs have two dimensions: the conjunctions and the conjuncts in them. Definition 5 (Measure of UCQs). Let Q = CQ1 ∨ . . . ∨ CQk be a UCQ, size(CQi ) the size (e.g. number of atoms) of CQi , and f a weighting function. Define g(x) = x if f is at most linear, and g = f −1 otherwise. The measure of Q is k ! X SQ = g f (size(CQi )) . i=1 The function f defines how SQ depends on the size of the conjunctive queries, the function g makes SQ at most linear in the sum of the sizes. Our choice of g will make SQ of the order O(k · maxi [size(CQi )]), where, and k is the number of conjunctive queries in the UCQ. We choose f according to the following observations. – f constant: We disregard the size of disjuncts. The cost of performing optimizations is assumed to be insignificant compared to the cost of rewriting, unfolding and storing the query. – f linear: The cost of optimization is assumed to be on the same order as the cost of rewriting, unfolding and storing the query. – f polynomial or exponential: The cost of optimization is assumed to be more important than the cost of rewriting, unfolding and storing the query. The above list is explains how different definitions of f affects the measure SQ . It gives a strong indication of how we should define f , depending on what we want to use the measure SQ for. In Sections 5 and 6, we will use linear f . We get back to the choice of f briefly in Section 7. 5 Maximal expansion Having defined a measure for UCQs, we are now in a position to assign a value to the body Q of a perfect mapping assertion Q q. Such an analysis gives us some information about how much we stand to gain by applying this perfect mapping assertion. We can, however, get an even better picture by looking at the process of rewriting q. Doing this, we can get a measure of how complex q really is in the relevant OBDA system. Definition 6 (Maximal expansion). Let q be a query, B = hO, S, Mi an OBDA system specification, and R an ontology rewriting algorithm. The maximal expansion of q over B and R, denoted me(q, B, R), is the unoptimised ontology rewriting and unfolding of q over B using R. We write me(q, B) when the choice of R is clear from the context. When the OBDA system is also understood from the context, we let Sme(q) denote the measure of me(q, B). The following example shows how we calculate Sme(q) . Example 7. We look at an OBDA system specification with TBox axioms A v ∃R and S v T , and the mapping    QA1 A, QA2 A, QA3 A,      Q R, QR2 R, QR3 R, QR4 R, R1 M=   QS1 S, QS2 S,       QT1 T, QT2 T, QT3 T and rewrite the query q(x, y) = ∃z.R(x, z) ∧ T (x, y). The ontology rewriting of q according to rewriting algorithm R is q 0 (x, y) = [∃z.R(x, z) ∧ T (x, y)] ∨ [∃z.R(x, z) ∧ S(x, y)] ∨ [A(x) ∧ T (x, y)] ∨ [A(x) ∧ S(x, y)] We choose f (x) = g(x) = x, let size(CQ) be the number of atoms in CQ, and calculate Sme(q) . If each QXn is a query without joins, then the resulting maximal expansion is a UCQ with joins of size 2. The first disjunct in q 0 is unfolded into 12 conjunctive queries of size 2, since there are four queries mapped to R and 3 to T . We do similar calculations with each disjunct of q 0 , and end up with a total of 35 conjunctive queries, each of size 2. Therefore Sme(q) = 70. In Example 7, we assumed that there were no conflicts between mapping assertions during the unfolding. Such conflicts can arise when there the mapping assertions contain constants in place of some variables. In this case, only those mapping assertions that have constants replacing the same variables can create conflict, and this will usually only be the case for a very few combinations of assertions. If constants are common in the used mapping assertions, then greater care must be taken when using approximations of Sme(q) . 6 Heuristic With a measure for UCQs and the notion of maximal expansions, we are now in a position to expand on the heuristic suggested by Di Pinto et al. [10]. We design a heuristic where large values are better. In the following we assume a fixed OBDA system and ontology rewriting algorithm. For a perfect mapping assertion Q q, we calculate the measures Sq , SQ and Sme(q) . Note that Sq = size(q), since q is a conjunctive query.During both ontology rewriting and unfolding, the size of a conjunctive query can grow exponentially. In both cases there is generally multiple options for dealing with every conjunct, and the result is a very large UCQ. For this reason, we will use log SQ and log Sme(q) , since they will be approximately proportional to Sq . We also calculate the ratio log Sme(q) / log SQ , which tells us how much optimisation has gone into the creation of the perfect mapping assertion. We use this optimization ratio to adjust other measures of the value of the perfect mapping assertion. If Sq is large for a perfect mapping assertion Q q, that is, the size of q is large, then applying it cuts a large portion of the original query. There is, however, a risk that this portion could be rewritten directly without much difficulty. In order to compensate for this effect, we scale Sq by the optimisation ratio log Sme(q) / log SQ . A perfect mapping assertion is also valuable if the corresponding Sme(q) is large, no matter the size of Sq . Again, we scale log Sme(q) by the optimisation ratio log Sme(q) / log SQ . Assigning a weight parameter to each of these compound measures, we get the heuristic (log Sme(q) )2 Sq · log Sme(q) a +b + cSq log SQ log SQ We assume Sq , log SQ and log Sme(q) have the same units, since they are all approximately linear in the size of the head q of the perfect mapping assertion Q q. Then, the unit of each term in the above sum is the same, and also approximately linear in the size(q). This means that the ratio between the terms will be fairly stable in respect to typical query size, and as such, the tuning of the parameters a, b, and c will not be very sensitive to the typical query size. Example 8. We look at an ontology with an empty TBox, and the four roles R1 , R2 , R3 , and R4 . We define the set of mapping assertions M = {Qji Ri | j = 1, . . . , i2 }, where i = 1, . . . , 4. We let size(CQ) be the number of atoms in CQ, and assume that each Qji is atomic. We let f (x) = x, and calculate Sq and Sme(q) for different conjunctions of Ri , with each role occurring at most once in each query. The results are shown in Table 1. We see that Sme(q) gives us a much finer division than Sq , which only divides the queries into 3 groups. Given the query R1 (x1 , x2 ) ∧ R2 (x2 , x3 ) ∧ R3 (x3 , x4 ) ∧ R4 (x4 , x5 ), and the perfect mapping assertions QA (y1 , y2 , y3 , y4 ) R1 (y1 , y2 ) ∧ R2 (y2 , y3 ) ∧ R3 (y3 , y4 ) QB (z1 , z2 , z3 ) R3 (z1 , z2 ) ∧ R4 (z3 , z4 ), there is a good chance that our best option is to apply the second, shorter perfect mapping assertion, because the maximal expansion of R3 (x3 , x4 ) ∧ R4 (x4 , x5 ) is so large. In order to be more precise, we would need to obtain the measures SQA and SQB . R1 R1 R1 R2 R1 R1 R1 R2 R2 R3 R2 R2 R3 R3 q R1 R2 R3 R4 R2 R3 R4 R3 R4 R4 R3 R4 R4 R4 Sq 1 1 1 1 2 2 2 2 2 2 3 3 3 3 Sme(q) 1 4 9 16 8 18 32 72 128 288 108 192 432 1728 Table 1. The measure Sme(q) can be used to fine order the groups defined by Sq , but there are also pairs of queries where Sq and Sme(q) disagree on ordering. 7 Discussion and Conclusion Di Pinto et al. [10] have found that using cached rewritings can significantly reduce the cost of answering queries. Their well performing algorithm, ReplaceSubqueryR, relies on a heuristic for determining the order in which to apply cached rewritings. Di Pinto et al. have chosen a simple heuristic based on the sizes of the heads of the cached rewritings. This is a good choice if all subqueries have approximately the same complexity when seen together with the TBox and the mapping. If some ontology predicates are easier to rewrite and unfold than others, then the measures of the maximal expansion Sme(q) and the optimised rewriting SQ become relevant to the choice of cached rewriting. We suggest the heuristic (log Sme(q) )2 Sq · log Sme(q) a +b + cSq log SQ log SQ If we let a = b = 0, c 6= 0, and define size(CQ) to be the number of atoms in CQ, then our heuristic becomes the same as the one used in [10]. By tweaking f in Definition 5, and the parameters a, b and c, we can shift importance away from Sq , and towards maximal expansion Sme(q) and optimisation SQ . The ideal setup will depend on the OBDA system specification, and should be decided experimentally. For the heuristic presented here to be more accurate than the one in [10], the OBDA system specification needs to be uneven in terms of how each ontology predicate is rewritten. If the mapping has very many assertions for some ontology predicates and few for others, or if some ontology predicates occur often in the TBox while others don’t, then the count of predicates in a query need not reflect how it behaves during rewriting. Also, if the mapping is very large, the optimisation ratios are more likely to be significant, since the unfolding and subsequent optimisation will be a large part of query rewriting. Since Sq , Sme(q) , and SQ are relatively cheap to calculate or approximate during rewriting, and cheap to store in a cache, we claim our suggested heuristic, in many settings, will outperform the simpler heuristic provided by [10]. Even when the simple heuristic performs very well, we can let a = 0 and b  c, so that Sme(q) is used for a fine splitting as illustrated by Example 8. 8 Future Work We plan to continue this work by experimentally verifying our results. In particular, we would like to compare the different weighting and scaling functions discussed here on real-world datasets. Another line of enquiry would be to see how well these heuristics can substitute for e.g. techniques to eliminate mapping redundancy, as discussed in Section 3. References 1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. Natalia Antonioli, Francesco Castanò, Cristina Civili, Spartaco Coletta, Stefano Grossi, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Domenico Fabio Savo, and Emanuela Virardi. Ontology-based data access: the experience at the Italian Department of Treasury. volume 1017, pages 9–16, 2013. 3. D. Calvanese, M. Giese, P. Haase, I. Horrocks, T. Hubauer, Y. Ioannidis, E. Jiménez-Ruiz, E. Kharlamov, H. Kllapi, J. Klüwer, M. Koubarakis, S. Lamparter, R. Möller, C. Neuenstadt, T. Nordtveit, Ö. Özcep, M. Rodriguez- Muro, M. Roshchin, F. Savo, M. Schmidt, A. Soylu, A. Waaler, and D. Zheleznyakov. Optique: OBDA solution for big data. In Revised Selected Papers of ESWC 2013 Satellite Events, volume 7955 of LNCS, pages 293–295. Springer, 2013. 4. Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. Autom. Reasoning, 39(3):385–429, 2007. 5. Cristina Civili, Marco Console, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Lorenzo Lepore, Riccardo Mancini, Antonella Poggi, Riccardo Rosati, Marco Ruzzi, Valerio Santarelli, and Domenico Fabio Savo. MASTRO STUDIO: Managing ontology-based data access applications. PVLDB, 6:1314–1317, 2013. 6. Georg Gottlob, Nicola Leone, and Francesco Scarcello. Hypertree decompositions and tractable queries. Journal of Computer and System Sciences, 64(3):579–627, 2002. 7. Georg Gottlob, Reinhard Pichler, and Fang Wei. Tractable database design through bounded treewidth. In Proceedings of the 25th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS’06), pages 124– 133. ACM, 2006. 8. Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, 2001. 9. Domenico Lembo, José Mora, Riccardo Rosati, Domenico Fabio Savo, and Evgenij Thorstensen. Towards mapping analysis in ontology-based data access. In Roman Kontchakov and Marie-Laure Mugnier, editors, Web Reasoning and Rule Systems - 8th International Conference, RR 2014, Athens, Greece, September 15-17, 2014. Proceedings, volume 8741 of Lecture Notes in Computer Science, pages 108–123. Springer, 2014. 10. Floriana Di Pinto, Domenico Lembo, Maurizio Lenzerini, Riccardo Mancini, Antonella Poggi, Riccardo Rosati, Marco Ruzzi, and Domenico Fabio Savo. Optimizing query rewriting in ontology-based data access. In Giovanna Guerrini and Norman W. Paton, editors, Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013, pages 561–572. ACM, 2013. 11. Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Linking data to ontologies. In Stefano Spaccapietra, editor, Journal on Data Semantics X, volume 4900 of Lecture Notes in Computer Science, pages 133–173. Springer, 2008. 12. Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Linking data to ontologies. J. Data Semantics, 10:133–173, 2008. 13. Mariano Rodriguez-Muro, Roman Kontchakov, and Michael Zakharyaschev. Ontology-based data access: Ontop of databases. In Harith Alani, Lalana Kagal, Achille Fokoue, Paul T. Groth, Chris Biemann, Josiane Xavier Parreira, Lora Aroyo, Natasha F. Noy, Chris Welty, and Krzysztof Janowicz, editors, The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, pages 558–573. Springer, 2013.