Reverse Engineering of Data Services (Extended Abstract) Gianluca Cima1 , Maurizio Lenzerini1 , Antonella Poggi1,2 Sapienza Università di Roma cima, lenzerini, poggi@diag.uniroma1.it Abstract. We study the problem of designing reverse engineering techniques for associating semantic descriptions to existing data services. We base our proposal on the Ontology-Based Data Access paradigm, where a domain ontology is used to provide a semantic layer mapped to the data sources of an organization. The basic idea is to perform the reverse engineering of a data service, expressed a query over the data sources, by deriving a query over the ontology that explains the semantics of the data service in terms of the element of the ontology. We illustrate a formal framework for this problem, based on the notion of source-to- ontology rewriting, which comes in three variants, called sound, complete and perfect, respectively. We present a thorough complexity analysis of two compu- tational problems, namely verification (checking whether a query is a source- to-ontology rewriting of a given data service), and computation (computing a source-to-ontology rewriting of a data service). Introduction The architecture of many modern Information Systems is based on data services [13], i.e., services deployed on top of data stores, other services, and/or applications to encap- sulate a wide range of data-centric operations. In order to realize the promises of data services, in particular to foster their reuse, it is of vital importance to well document and clearly specify their semantics. While most current techniques manually associate APIs (Application Programming Interface) to data services, and describe their intended meaning with ad-hoc methods, often using natural language or complex metadata [5], we propose a new approach, whose goal is to automatically associate formal semantic descriptions to data services. We base our proposal on the Ontology-Based Data Access (OBDA) paradigm [11]. An OBDA specification consists of an ontology expressed in Description Logic (DL) [2], the schema of the data sources forming the information system, and a mapping between the source schema and the ontology. The ontology is a formal representation of the underlying domain, and the mapping specifies the re- lationship between the data at the sources and the concepts in the ontology. With the OBDA specification at hand, we pursue the idea of expressing the semantics of data services using the elements of the domain ontology, which is assumed to be familiar to the consumer of data services. Copyright © 2019 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. But how can we automatically produce a semantic characterization of a data service, having an OBDA specification available? The method we propose is to exploit a new reasoning task over the OBDA specification, that works as follows: we express the data service in terms of a query over the sources, and we aim at automatically deriving the query over the ontology that best describes the data service, given the mapping. The following example illustrates this idea. Example 1. Let Σ = hO, S, Mi be as follows: – O = { ErasmusStudent v Student, MathStudent v Student } – S = { s1 , s2 , s3 , s4 } {(x) | s1 (x)} → {(x) | Student(x)} {(x) | s2 (x)} → {(x) | Student(x)} – M= {(x) | s1 (x), s3 (x, y)} → {(x) | ErasmusStudent(x)} {(x) | s1 (x), s4 (x, y)} → {(x) | MathStudent(x)} and consider the data service expressed as the source query qS (x) = {(x) | s1 (x) ∨ s3 (x)}. It is easy to see that the query that best describes qS in terms of O is qO (x) = {(x) | Student(x)}. Note that most of (if not all) the literature about managing data sources through an ontology [9,12] deals with user queries expressed over the ontology, and studies the problem of finding an ontology-to-source rewriting, i.e., a query over the source schema that, once executed over the data, provides the answers to the original query. Here, the problem is reversed, because we start with a source query and we aim at deriving a corresponding query over the ontology, called a source-to-ontology rewriting. Thus, we deal with a sort of reverse engineering problem, which is novel in the investigation of both OBDA and data integration. The notions introduced in this paper are relevant in a plethora of scenarios. For the sake of brevity, we mention only two of them. Following the ideas in [6,7], it can be shown that our notions of source-to-ontology rewriting can be used to provide the se- mantics of open datasets and open APIs published by organizations, which is a crucial aspect for unchaining all the potentials of open data. In [10], the concept of realization of source queries, corresponding to one of the notions studied here, is used for check- ing whether the mapping provides the right coverage for expressing the relevant data services at the ontology level. The contributions provided by this paper can be summarized as follows. We pro- pose a formal framework for the problem of semantically characterizing a data service through an ontology (Section 3). We introduce the notions of perfect, sound, and com- plete source-to-ontology rewritings, and we define two basic reasoning tasks, namely verification and computation. The former checks whether a given query is a source-to- ontology rewriting of a data service, whereas the latter computes one such rewriting. We show that, although the ideal notion is the one of perfect source-to-ontology rewriting, there are cases where, with the given mapping, no query over the ontology can pre- cisely characterize the data service at hand. Thus, we introduce maximally sound and minimally complete source-to-ontology rewritings, which intuitively aim at approxi- mating the perfect rewriting of a data service at best, with the goal of either precision (sound rewriting), or recall (complete rewriting). We study the verification and the computation problem for complete (Section 4) and sound (Section 5) source-to-ontology rewritings in one of the most popular OBDA set- ting considered in the literature, namely where the ontology language is DL-LiteR [4,3], each mapping assertion maps a conjunctive query (CQ) over the source to a CQ over the ontology, and both the data service and the source-to-ontology rewriting are expressed as unions of CQs. For complete source-to-ontology rewritings we present algorithms for verification and computation, and characterize the complexity of both tasks. For the case of sound rewritings, we do the same for verification, and we precisely determine the cases where a maximally sound rewriting is not guaranteed to exist. For the lack of space, we do not tackle here the case of perfect rewritings, but we point out that results for complete and sound rewritings can be combined to study this case. To the best of our knowledge, the problem studied in this paper has been (partially) addressed only in [6,10]. The former provides upper bound complexity results for com- plete rewritings, and the latter focuses on both DL-LiteR and the EL family of ontology languages, and studies perfect rewritings only, under a slightly different semantics with respect to the one proposed here. This paper is an extended abstract of [8]. Hence, while we assume basic knowl- edge about databases [1] and Description Logics (DL) [2], for specific concepts and notations, we refer to the Preliminaries section of [8]. Framework We implicitly refer to an OBDA specification Σ = hO, S, Mi. Intuitively, given a data service expressed as a query qS over S, we aim at finding the query over O that precisely characterizes qS w.r.t. Σ. Since the evaluation of queries over O is based on certain answers, this means that we aim at finding a query over O whose certain answers w.r.t. Σ and D exactly capture the answers of qS w.r.t. D for every S-database D. So, we are naturally led to the notion of perfect source-to-ontology rewriting. In what follows, qS refers to a query over S, and qO to a query over O of the same arity. Definition 1. qO is a perfect S-to-O Σ-rewriting of qS if for every S-database D, M odD (Σ) 6= ∅ implies qSD = certD qO ,Σ . As noted in [6,10] and illustrated in the next example, a perfect source-to-ontology rewriting of qS may not exist. Example 2. Refer to Example 1, and consider the data service exprresed as the source query qS (x) = {(x) | s1 (x)}. By inspecting the mappings, one can see that, since the certain answers of Student include the values stored both in s1 and in s3 , such concept is too general for exactly characterizing qS . On the other hand, it can also seen that both ErasmusStudent and MathStudent are too specific, and therefore we can conclude that no perfect S-to-O Σ-rewriting of qS exists. In order to cope with the situations illustrated in the example, we introduce the no- tions of sound and complete source-to-ontology rewritings, which, intuitively, provide sound and complete approximations of perfect rewritings, respectively. Definition 2. qO is a sound (respectively, complete) S-to-O Σ-rewriting of qS if for every S-database D, M odD (Σ) 6= ∅ implies cert D D D D qO ,Σ ⊆ qS (resp., qS ⊆ cert qO ,Σ ). Example 3. We refer to Example 2, and observe that {(x) | ErasmusStudent(x) ∧ MathStudent(x)} is a sound S-to-O Σ-rewriting of qS , whereas {(x) | Student(x) } is a complete S-to-O Σ-rewriting of qS . Obviously, qO is a perfect S-to-O Σ-rewriting of qS if and only if qO is both a sound, and a complete S-to-O Σ-rewriting of qS . There are also interesting relation- ships between the notions of S-to-O Σ-rewritings introduced here and the usual notions of rewritings studied in OBDA. Proposition 1. qO is a complete S-to-O Σ-rewriting of qS if and only if qS is an O-to- S Σ-rewriting of qO . If qS is a perfect O-to-S Σ-rewriting of qO , then qO is a perfect S-to-O Σ-rewriting of qS . It is easy to see that different sound or complete source-to-ontology rewritings of qS may exist, and therefore it is reasonable to look for the “best” approximations of qS , at least relative to a certain class of queries. Definition 3. qO ∈ L is an L-maximally sound (respectively, L-minimally complete) S-to-O Σ-rewriting of qS if qO is a sound (resp. complete) S-to-O Σ-rewriting of qS and no q 0 ∈ L exists such that (i) q 0 is a sound (resp., complete) S-to-O Σ-rewriting of qS , (ii) cert qO ,Σ v cert q0 ,Σ (resp., cert q0 ,Σ v cert qO ,Σ ), and (iii) there exists an S-database D s.t. cert D D D D qO ,Σ ⊂ cert q 0 ,Σ (resp., cert q 0 ,Σ ⊂ cert qO ,Σ ). Example 4. We refer again to Example 2, and observe that while {(x) | Student(x) } is the minimally complete S-to-O Σ-rewriting of qS in the class of UCQs, both {(x) | ErasmusStudent(x)}, and {(x) | MathStudent(x)} are maximally sound S-to-O Σ- rewritings of qS in the class of CQs, while {(x) | ErasmusStudent(x)∨MathStudent(x)} is so in the class of UCQs. Given the general framework presented so far, it is natural to consider the following two basic computational problems, for classes LS and LO of queries: – Verification: given Σ = hO, S, Mi, qS ∈ LS over S and qO ∈ LO over O of the same arity as qS , verify whether qO is a sound (resp., complete, perfect) S-to-O Σ-rewritings of qS . – Computation: given Σ = hO, S, Mi, and qS ∈ LS over S compute any LO - maximally sound (resp., LO -minimally complete, perfect) S-to-O Σ-rewriting of qS , if it exists. In the rest of this paper, if not otherwise stated, we refer to the most common setting studied in OBDA, i.e., where (i) the ontology is expressed in DL-LiteR , (ii) S is a relational database schema without integrity constraints, and (iii) both LO and LS denote the class of UCQs. Interestingly, in this case, we have the following. Proposition 2. If q1 and q2 are two UCQ-minimally complete, or UCQ-maximally sound S-to-O Σ-rewritings of qS , then they are equivalent w.r.t. Σ. Complete source-to-ontology rewritings In this section, we study both the verification and the computation problem for complete source-to-ontology rewritings. Verification. Suppose we want to check whether qO is a complete S-to-O Σ-rewriting of qS . Obviously, if qS is contained in PerfRefqO ,Σ , then for every S-database D consis- tent with Σ, we have that qSD ⊆ cert D qO ,Σ and therefore the answer is positive. However, if qS is not contained in PerfRefqO ,Σ , it might be that qO is still a complete S-to-O Σ- rewriting of qS , in the case where the non-emptiness of qS in D reveals the presence of inconsistencies. From this observation, we derive the following characterization. Proposition 3. qO is a complete S-to-O Σ-rewriting of qS (t) if and only if qS v PerfRefqO ,Σ ∨ PerfRefV t ,Σ . O The following theorem characterizes the complexity of verification for complete source-to-ontology rewritings. Theorem 1. Verification for complete source-to-ontology rewritings is NP-complete. Computation. Our algorithm for the computation of minimally complete source-to- ontology rewritings is below. Algorithm 1 Input: Σ = hO, S, Mi, qS (t) = qS1 (t) ∨ . . . ∨ qSn (t) over S Output: qO (t) over O begin return qO = {t | M(qS1 ) ∧ >(t) ∨ . . . ∨ M(qSn ) ∧ >(t)} end Intuitively, the algorithm computes the output query as union of CQs obtained by simply applying the mapping M to each CQ qSi in qS , using > to bind the variables that are not involved in the application of M to qSi . Theorem 2. Algorithm 1 computes the UCQ-minimally complete S-to-O Σ-rewriting of qS . The algorithm shows that the UCQ-minimally complete S-to-O Σ-rewriting of qS always exists. Moreover, if qS is a CQ, then it is a CQ. Finally, we observe that the complexity of Algorithm 1 does not depend on O, is in E XP T IME in σ(M), and in PT IME in σ(qS ), where we use σ(x) to denotes the size of x. It can be shown that a PT IME algorithm for computing the UCQ-minimally complete source-to-ontology rewriting would imply a PT IME algorithm for CQ containment. So, assuming PT IME 6= NP, computation cannot be solved in PT IME. Sound source-to-ontology rewritings We now turn to verification and computation of sound source-to-ontology rewritings. Verification. Notice that, since for an S-database D consistent with Σ, PerfRefD qO ,Σ computes exactly cert D qO ,Σ , checking whether qO is a sound S-to-O Σ-rewriting of qS means checking whether for all S-databases D, either M odD (Σ) = ∅ or PerfRefD qO ,Σ ⊆ qSD . This observation leads to the following characterization. Proposition 4. qO (t) is a sound S-to-O Σ-rewriting of qS if and only if PerfRefqO ,Σ v qS ∨ PerfRefVOt ,Σ . The following theorem characterizes the complexity of the verification problem for sound source-to-ontology rewritings. Theorem 3. Verification for sound source-to-ontology rewritings is Π2p -complete. Computation. We address the problem of computing UCQ-maximally sound source- to-ontology rewritings. Our main result is that there are many cases where a UCQ- maximally sound source-to-ontology rewriting of a query is not guaranteed to exist. To illustrate the result, we introduce a specific setting, that we call restricted, obtained from the general one by: (i) limiting the ontology language to DL-LiteRDFS , (ii) limiting the mapping to pure GAV, and (iii) limiting qS to UCQJFEs. The following shows that, surprisingly, as soon as we try to extend such setting, we lose the guarantee of the existence of source-to-ontology rewritings that are maximally sound. Theorem 4. UCQ-maximally sound source-to-ontology rewritings of a query qS may not exist if we extend the restricted setting with each of the following features: (i) dis- jointness axioms in the ontology; (ii) inclusion axioms with ∃R as right-hand side in the ontology; (iii) LAV mapping assertions, even without joins involving existential vari- ables in the right-hand side; (iv) non-pure GAV mapping assertions; (v) the fragment of UCQ for qS beyond UCQJFE. Proof sketch. We present the proof for case 5. Consider the OBDA specification Σ = hO, S, Mi, where O has no axiom, and M is the following pure GAV mapping: {(x, y) | s1 (y, y) ∧ s3 (x)} → {(x, y) | R(x, y)} {(x, y) | s1 (x, y)} → {(x, y) | R(x, y)} 0 and let qS be the query {() | s1 (x, y) ∧ s1 (y, z)}. Observe that qO = {() | R(x, y) ∧ R(y, z)} is the complete S-to-O Σ-rewriting of qS , but is not sound, because the query qs0 = {() | s1 (x, y) ∧ s1 (z, z) ∧ s3 (y)} is a disjunct of PerfRefqO 0 ,Σ such that q 0 S 6v qS . Conversely, one can verify that each of the following queries is a sound S-to-O Σ- rewriting of qS : – q0 = {() | R(x, y) ∧ R(y, y)}, – q1 = {() | R(x, y) ∧ R(y, z1 ) ∧ R(z1 , y)}, – q2 = {() | R(x, y) ∧ R(y, z1 ) ∧ R(z1 , z2 ) ∧ R(z2 , y)}, – ... More precisely, if we define qn to be {() | R(x, y) ∧ R(y, z1 ) ∧ R(z1 , z2 ) ∧ . . . ∧ R(zn−1 , zn )∧R(zn , y)}, for n ≥ 2,, then it can be shown that every qn is a sound S-to- O Σ-rewriting of qS , and for no pair (i, j), with i 6= j, i, j ≥ 0, cert qi ,Σ v cert qj ,Σ . It follows that the infinite union q of q0 , q1 , and all qn ’s can be shown to be the maximally sound S-to-O Σ-rewriting of qS in the class of positive queries. We observe that, despite its limitations, the expressive power of the restricted setting is sufficient for several meaningful applications. Indeed, several popular ontologies are expressible in DL-LiteRDFS , and the form of pure GAV mapping is exactly the one originally defined in the literature of data integration. Moreover, UCQJFEs capture data services expressible in the famous SPJ (Select, Project, Join) fragment of Relational Algebra, with the only limitation of forbidding projection outside join, which makes them suitable for all tasks related to source profiling. Interestingly, it can be shown that in the restricted setting maximally sound source-to-ontology rewritings always exist. Conclusion We have presented a framework for semantically characterizing data services through ontologies, and carried out a comprehensive analysis for the most common OBDA set- ting. We plan to continue this work along several directions. For example, in the un- restricted setting, we aim at (i) studying the problem of checking for the existence of a UCQ-maximally sound source-to-ontology rewriting of a query, and computing it in case it exists, and (ii) singling out the minimal class LO of queries that guarantees the existence of an LO -maximally sound source-to-ontology rewritings. Furthermore, we will extend our analysis to OBDA settings based on other DLs as ontology languages. Acknowlegments. Work supported by MIUR under the SIR project “MODEUS” – grant n. RBSI14TQHQ, and by Sapienza research project “PRE-O-PRE”. References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publ. Co., 1995. 2. F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge Uni- versity Press, 2003. 3. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. of Automated Reasoning, 39(3):385–429, 2007. 4. D. Calvanese, G. De Giacomo, M. Lenzerini, R. Rosati, and G. Vetere. DL-Lite: Practical reasoning for rich DLs. In Proc. of DL 2004, volume 104 of CEUR, ceur-ws.org, 2004. 5. M. J. Carey, N. Onose, and M. Petropoulos. Data services. Comm. of the ACM, 55(6):86–97, 2012. 6. G. Cima. Preliminary results on ontology-based open data publishing. In Proc. of DL 2017, volume 1879 of CEUR, ceur-ws.org, 2017. 7. G. Cima, M. Lenzerini, and A. Poggi. Semantic technology for open data publishing. In Proc. of WIMS 2017, 2017. 8. G. Cima, M. Lenzerini, and A. Poggi. Semantic characterization of data services through ontologies. In Proc. of IJCAI 2019, 2019. 9. M. Lenzerini. Managing data through the lens of an ontology. AI Magazine, 39(2):65–74, 2018. 10. C. Lutz, J. Marti, and L. Sabellek. Query expressibility and verification in ontology-based data access. In Proc. of KR 2018, pages 389–398, 2018. 11. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. on Data Semantics, X:133–173, 2008. 12. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za- kharyaschev. Ontology-based data access: A survey. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pages 5511–5519, 2018. 13. Z. Zheng, J. Zhu, and M. R. Lyu. Service-generated big data and big data-as-a-service: An overview. In Proc. of the 2013 IEEE Int. Conf. on Big Data, pages 403–410, 2013.