Reverse Engineering of Data Services
                       (Extended Abstract)

               Gianluca Cima1 , Maurizio Lenzerini1 , Antonella Poggi1,2

                                Sapienza Università di Roma
                           cima, lenzerini, poggi@diag.uniroma1.it


       Abstract. We study the problem of designing reverse engineering techniques for
       associating semantic descriptions to existing data services. We base our proposal
       on the Ontology-Based Data Access paradigm, where a domain ontology is used
       to provide a semantic layer mapped to the data sources of an organization. The
       basic idea is to perform the reverse engineering of a data service, expressed a
       query over the data sources, by deriving a query over the ontology that explains
       the semantics of the data service in terms of the element of the ontology. We
       illustrate a formal framework for this problem, based on the notion of source-to-
       ontology rewriting, which comes in three variants, called sound, complete and
       perfect, respectively. We present a thorough complexity analysis of two compu-
       tational problems, namely verification (checking whether a query is a source-
       to-ontology rewriting of a given data service), and computation (computing a
       source-to-ontology rewriting of a data service).


Introduction
The architecture of many modern Information Systems is based on data services [13],
i.e., services deployed on top of data stores, other services, and/or applications to encap-
sulate a wide range of data-centric operations. In order to realize the promises of data
services, in particular to foster their reuse, it is of vital importance to well document
and clearly specify their semantics. While most current techniques manually associate
APIs (Application Programming Interface) to data services, and describe their intended
meaning with ad-hoc methods, often using natural language or complex metadata [5],
we propose a new approach, whose goal is to automatically associate formal semantic
descriptions to data services. We base our proposal on the Ontology-Based Data Access
(OBDA) paradigm [11]. An OBDA specification consists of an ontology expressed in
Description Logic (DL) [2], the schema of the data sources forming the information
system, and a mapping between the source schema and the ontology. The ontology is
a formal representation of the underlying domain, and the mapping specifies the re-
lationship between the data at the sources and the concepts in the ontology. With the
OBDA specification at hand, we pursue the idea of expressing the semantics of data
services using the elements of the domain ontology, which is assumed to be familiar to
the consumer of data services.
  Copyright © 2019 for the individual papers by the papers’ authors. Copying permitted for
  private and academic purposes. This volume is published and copyrighted by its editors. SEBD
  2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
    But how can we automatically produce a semantic characterization of a data service,
having an OBDA specification available? The method we propose is to exploit a new
reasoning task over the OBDA specification, that works as follows: we express the data
service in terms of a query over the sources, and we aim at automatically deriving the
query over the ontology that best describes the data service, given the mapping. The
following example illustrates this idea.

Example 1. Let Σ = hO, S, Mi be as follows:
 – O = { ErasmusStudent v Student, MathStudent v Student }
 – S = { s1 , s2 , s3 , s4 }
        {(x) | s1 (x)} → {(x) | Student(x)}
        {(x) | s2 (x)} → {(x) | Student(x)}
 – M=
        {(x) | s1 (x), s3 (x, y)} → {(x) | ErasmusStudent(x)}
        {(x) | s1 (x), s4 (x, y)} → {(x) | MathStudent(x)}
and consider the data service expressed as the source query qS (x) = {(x) | s1 (x) ∨
s3 (x)}. It is easy to see that the query that best describes qS in terms of O is qO (x) =
{(x) | Student(x)}.

    Note that most of (if not all) the literature about managing data sources through
an ontology [9,12] deals with user queries expressed over the ontology, and studies the
problem of finding an ontology-to-source rewriting, i.e., a query over the source schema
that, once executed over the data, provides the answers to the original query. Here, the
problem is reversed, because we start with a source query and we aim at deriving a
corresponding query over the ontology, called a source-to-ontology rewriting. Thus, we
deal with a sort of reverse engineering problem, which is novel in the investigation of
both OBDA and data integration.
    The notions introduced in this paper are relevant in a plethora of scenarios. For the
sake of brevity, we mention only two of them. Following the ideas in [6,7], it can be
shown that our notions of source-to-ontology rewriting can be used to provide the se-
mantics of open datasets and open APIs published by organizations, which is a crucial
aspect for unchaining all the potentials of open data. In [10], the concept of realization
of source queries, corresponding to one of the notions studied here, is used for check-
ing whether the mapping provides the right coverage for expressing the relevant data
services at the ontology level.
    The contributions provided by this paper can be summarized as follows. We pro-
pose a formal framework for the problem of semantically characterizing a data service
through an ontology (Section 3). We introduce the notions of perfect, sound, and com-
plete source-to-ontology rewritings, and we define two basic reasoning tasks, namely
verification and computation. The former checks whether a given query is a source-to-
ontology rewriting of a data service, whereas the latter computes one such rewriting. We
show that, although the ideal notion is the one of perfect source-to-ontology rewriting,
there are cases where, with the given mapping, no query over the ontology can pre-
cisely characterize the data service at hand. Thus, we introduce maximally sound and
minimally complete source-to-ontology rewritings, which intuitively aim at approxi-
mating the perfect rewriting of a data service at best, with the goal of either precision
(sound rewriting), or recall (complete rewriting).
    We study the verification and the computation problem for complete (Section 4) and
sound (Section 5) source-to-ontology rewritings in one of the most popular OBDA set-
ting considered in the literature, namely where the ontology language is DL-LiteR [4,3],
each mapping assertion maps a conjunctive query (CQ) over the source to a CQ over the
ontology, and both the data service and the source-to-ontology rewriting are expressed
as unions of CQs. For complete source-to-ontology rewritings we present algorithms
for verification and computation, and characterize the complexity of both tasks. For the
case of sound rewritings, we do the same for verification, and we precisely determine
the cases where a maximally sound rewriting is not guaranteed to exist. For the lack of
space, we do not tackle here the case of perfect rewritings, but we point out that results
for complete and sound rewritings can be combined to study this case.
    To the best of our knowledge, the problem studied in this paper has been (partially)
addressed only in [6,10]. The former provides upper bound complexity results for com-
plete rewritings, and the latter focuses on both DL-LiteR and the EL family of ontology
languages, and studies perfect rewritings only, under a slightly different semantics with
respect to the one proposed here.
    This paper is an extended abstract of [8]. Hence, while we assume basic knowl-
edge about databases [1] and Description Logics (DL) [2], for specific concepts and
notations, we refer to the Preliminaries section of [8].

Framework

We implicitly refer to an OBDA specification Σ = hO, S, Mi. Intuitively, given a
data service expressed as a query qS over S, we aim at finding the query over O that
precisely characterizes qS w.r.t. Σ. Since the evaluation of queries over O is based
on certain answers, this means that we aim at finding a query over O whose certain
answers w.r.t. Σ and D exactly capture the answers of qS w.r.t. D for every S-database
D. So, we are naturally led to the notion of perfect source-to-ontology rewriting. In
what follows, qS refers to a query over S, and qO to a query over O of the same arity.

Definition 1. qO is a perfect S-to-O Σ-rewriting of qS if for every S-database D,
M odD (Σ) 6= ∅ implies qSD = certD
                                 qO ,Σ .

   As noted in [6,10] and illustrated in the next example, a perfect source-to-ontology
rewriting of qS may not exist.

Example 2. Refer to Example 1, and consider the data service exprresed as the source
query qS (x) = {(x) | s1 (x)}. By inspecting the mappings, one can see that, since the
certain answers of Student include the values stored both in s1 and in s3 , such concept
is too general for exactly characterizing qS . On the other hand, it can also seen that both
ErasmusStudent and MathStudent are too specific, and therefore we can conclude that
no perfect S-to-O Σ-rewriting of qS exists.

    In order to cope with the situations illustrated in the example, we introduce the no-
tions of sound and complete source-to-ontology rewritings, which, intuitively, provide
sound and complete approximations of perfect rewritings, respectively.
Definition 2. qO is a sound (respectively, complete) S-to-O Σ-rewriting of qS if for
every S-database D, M odD (Σ) 6= ∅ implies cert D         D         D        D
                                                qO ,Σ ⊆ qS (resp., qS ⊆ cert qO ,Σ ).

Example 3. We refer to Example 2, and observe that {(x) | ErasmusStudent(x) ∧
MathStudent(x)} is a sound S-to-O Σ-rewriting of qS , whereas {(x) | Student(x) } is
a complete S-to-O Σ-rewriting of qS .

    Obviously, qO is a perfect S-to-O Σ-rewriting of qS if and only if qO is both a
sound, and a complete S-to-O Σ-rewriting of qS . There are also interesting relation-
ships between the notions of S-to-O Σ-rewritings introduced here and the usual notions
of rewritings studied in OBDA.
Proposition 1. qO is a complete S-to-O Σ-rewriting of qS if and only if qS is an O-to-
S Σ-rewriting of qO . If qS is a perfect O-to-S Σ-rewriting of qO , then qO is a perfect
S-to-O Σ-rewriting of qS .
     It is easy to see that different sound or complete source-to-ontology rewritings of
qS may exist, and therefore it is reasonable to look for the “best” approximations of qS ,
at least relative to a certain class of queries.

Definition 3. qO ∈ L is an L-maximally sound (respectively, L-minimally complete)
S-to-O Σ-rewriting of qS if qO is a sound (resp. complete) S-to-O Σ-rewriting of qS
and no q 0 ∈ L exists such that (i) q 0 is a sound (resp., complete) S-to-O Σ-rewriting
of qS , (ii) cert qO ,Σ v cert q0 ,Σ (resp., cert q0 ,Σ v cert qO ,Σ ), and (iii) there exists an
S-database D s.t. cert D                D                 D             D
                         qO ,Σ ⊂ cert q 0 ,Σ (resp., cert q 0 ,Σ ⊂ cert qO ,Σ ).

Example 4. We refer again to Example 2, and observe that while {(x) | Student(x) }
is the minimally complete S-to-O Σ-rewriting of qS in the class of UCQs, both {(x) |
ErasmusStudent(x)}, and {(x) | MathStudent(x)} are maximally sound S-to-O Σ-
rewritings of qS in the class of CQs, while {(x) | ErasmusStudent(x)∨MathStudent(x)}
is so in the class of UCQs.

   Given the general framework presented so far, it is natural to consider the following
two basic computational problems, for classes LS and LO of queries:

 – Verification: given Σ = hO, S, Mi, qS ∈ LS over S and qO ∈ LO over O of the
   same arity as qS , verify whether qO is a sound (resp., complete, perfect) S-to-O
   Σ-rewritings of qS .
 – Computation: given Σ = hO, S, Mi, and qS ∈ LS over S compute any LO -
   maximally sound (resp., LO -minimally complete, perfect) S-to-O Σ-rewriting of
   qS , if it exists.

    In the rest of this paper, if not otherwise stated, we refer to the most common setting
studied in OBDA, i.e., where (i) the ontology is expressed in DL-LiteR , (ii) S is a
relational database schema without integrity constraints, and (iii) both LO and LS
denote the class of UCQs. Interestingly, in this case, we have the following.
Proposition 2. If q1 and q2 are two UCQ-minimally complete, or UCQ-maximally
sound S-to-O Σ-rewritings of qS , then they are equivalent w.r.t. Σ.
Complete source-to-ontology rewritings
In this section, we study both the verification and the computation problem for complete
source-to-ontology rewritings.
Verification. Suppose we want to check whether qO is a complete S-to-O Σ-rewriting
of qS . Obviously, if qS is contained in PerfRefqO ,Σ , then for every S-database D consis-
tent with Σ, we have that qSD ⊆ cert D qO ,Σ and therefore the answer is positive. However,
if qS is not contained in PerfRefqO ,Σ , it might be that qO is still a complete S-to-O Σ-
rewriting of qS , in the case where the non-emptiness of qS in D reveals the presence of
inconsistencies. From this observation, we derive the following characterization.
Proposition 3. qO is a complete S-to-O Σ-rewriting of qS (t) if and only if qS v
PerfRefqO ,Σ ∨ PerfRefV t ,Σ .
                          O

   The following theorem characterizes the complexity of verification for complete
source-to-ontology rewritings.
Theorem 1. Verification for complete source-to-ontology rewritings is NP-complete.

Computation. Our algorithm for the computation of minimally complete source-to-
ontology rewritings is below.
Algorithm 1
Input: Σ = hO, S, Mi, qS (t) = qS1 (t) ∨ . . . ∨ qSn (t) over S
Output: qO (t) over O
begin return qO = {t | M(qS1 ) ∧ >(t) ∨ . . . ∨ M(qSn ) ∧ >(t)} end

    Intuitively, the algorithm computes the output query as union of CQs obtained by
simply applying the mapping M to each CQ qSi in qS , using > to bind the variables
that are not involved in the application of M to qSi .
Theorem 2. Algorithm 1 computes the UCQ-minimally complete S-to-O Σ-rewriting
of qS .
     The algorithm shows that the UCQ-minimally complete S-to-O Σ-rewriting of qS
always exists. Moreover, if qS is a CQ, then it is a CQ. Finally, we observe that the
complexity of Algorithm 1 does not depend on O, is in E XP T IME in σ(M), and in
 PT IME in σ(qS ), where we use σ(x) to denotes the size of x. It can be shown that
a PT IME algorithm for computing the UCQ-minimally complete source-to-ontology
rewriting would imply a PT IME algorithm for CQ containment. So, assuming PT IME
6= NP, computation cannot be solved in PT IME.

Sound source-to-ontology rewritings
We now turn to verification and computation of sound source-to-ontology rewritings.
Verification. Notice that, since for an S-database D consistent with Σ, PerfRefD   qO ,Σ
computes exactly cert D qO ,Σ , checking whether qO is a sound S-to-O Σ-rewriting of qS
means checking whether for all S-databases D, either M odD (Σ) = ∅ or PerfRefD  qO ,Σ ⊆
qSD . This observation leads to the following characterization.
Proposition 4. qO (t) is a sound S-to-O Σ-rewriting of qS if and only if PerfRefqO ,Σ v
qS ∨ PerfRefVOt ,Σ .

   The following theorem characterizes the complexity of the verification problem for
sound source-to-ontology rewritings.

Theorem 3. Verification for sound source-to-ontology rewritings is Π2p -complete.

Computation. We address the problem of computing UCQ-maximally sound source-
to-ontology rewritings. Our main result is that there are many cases where a UCQ-
maximally sound source-to-ontology rewriting of a query is not guaranteed to exist. To
illustrate the result, we introduce a specific setting, that we call restricted, obtained from
the general one by: (i) limiting the ontology language to DL-LiteRDFS , (ii) limiting the
mapping to pure GAV, and (iii) limiting qS to UCQJFEs. The following shows that,
surprisingly, as soon as we try to extend such setting, we lose the guarantee of the
existence of source-to-ontology rewritings that are maximally sound.

Theorem 4. UCQ-maximally sound source-to-ontology rewritings of a query qS may
not exist if we extend the restricted setting with each of the following features: (i) dis-
jointness axioms in the ontology; (ii) inclusion axioms with ∃R as right-hand side in
the ontology; (iii) LAV mapping assertions, even without joins involving existential vari-
ables in the right-hand side; (iv) non-pure GAV mapping assertions; (v) the fragment
of UCQ for qS beyond UCQJFE.

  Proof sketch. We present the proof for case 5. Consider the OBDA specification
Σ = hO, S, Mi, where O has no axiom, and M is the following pure GAV mapping:
{(x, y) | s1 (y, y) ∧ s3 (x)} → {(x, y) | R(x, y)}
{(x, y) | s1 (x, y)}          → {(x, y) | R(x, y)}
                                                                        0
and let qS be the query {() | s1 (x, y) ∧ s1 (y, z)}. Observe that qO      = {() | R(x, y) ∧
R(y, z)} is the complete S-to-O Σ-rewriting of qS , but is not sound, because the query
qs0 = {() | s1 (x, y) ∧ s1 (z, z) ∧ s3 (y)} is a disjunct of PerfRefqO
                                                                     0 ,Σ such that q
                                                                                      0
                                                                                     S 6v qS .
Conversely, one can verify that each of the following queries is a sound S-to-O Σ-
rewriting of qS :
 – q0 = {() | R(x, y) ∧ R(y, y)},
 – q1 = {() | R(x, y) ∧ R(y, z1 ) ∧ R(z1 , y)},
 – q2 = {() | R(x, y) ∧ R(y, z1 ) ∧ R(z1 , z2 ) ∧ R(z2 , y)},
 – ...
More precisely, if we define qn to be {() | R(x, y) ∧ R(y, z1 ) ∧ R(z1 , z2 ) ∧ . . . ∧
R(zn−1 , zn )∧R(zn , y)}, for n ≥ 2,, then it can be shown that every qn is a sound S-to-
O Σ-rewriting of qS , and for no pair (i, j), with i 6= j, i, j ≥ 0, cert qi ,Σ v cert qj ,Σ . It
follows that the infinite union q of q0 , q1 , and all qn ’s can be shown to be the maximally
sound S-to-O Σ-rewriting of qS in the class of positive queries.
    We observe that, despite its limitations, the expressive power of the restricted setting
is sufficient for several meaningful applications. Indeed, several popular ontologies are
expressible in DL-LiteRDFS , and the form of pure GAV mapping is exactly the one
originally defined in the literature of data integration. Moreover, UCQJFEs capture data
services expressible in the famous SPJ (Select, Project, Join) fragment of Relational
Algebra, with the only limitation of forbidding projection outside join, which makes
them suitable for all tasks related to source profiling. Interestingly, it can be shown that
in the restricted setting maximally sound source-to-ontology rewritings always exist.

Conclusion

We have presented a framework for semantically characterizing data services through
ontologies, and carried out a comprehensive analysis for the most common OBDA set-
ting. We plan to continue this work along several directions. For example, in the un-
restricted setting, we aim at (i) studying the problem of checking for the existence of
a UCQ-maximally sound source-to-ontology rewriting of a query, and computing it in
case it exists, and (ii) singling out the minimal class LO of queries that guarantees the
existence of an LO -maximally sound source-to-ontology rewritings. Furthermore, we
will extend our analysis to OBDA settings based on other DLs as ontology languages.

Acknowlegments. Work supported by MIUR under the SIR project “MODEUS” –
grant n. RBSI14TQHQ, and by Sapienza research project “PRE-O-PRE”.


References

 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publ. Co.,
    1995.
 2. F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The
    Description Logic Handbook: Theory, Implementation and Applications. Cambridge Uni-
    versity Press, 2003.
 3. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning
    and efficient query answering in description logics: The DL-Lite family. J. of Automated
    Reasoning, 39(3):385–429, 2007.
 4. D. Calvanese, G. De Giacomo, M. Lenzerini, R. Rosati, and G. Vetere. DL-Lite: Practical
    reasoning for rich DLs. In Proc. of DL 2004, volume 104 of CEUR, ceur-ws.org, 2004.
 5. M. J. Carey, N. Onose, and M. Petropoulos. Data services. Comm. of the ACM, 55(6):86–97,
    2012.
 6. G. Cima. Preliminary results on ontology-based open data publishing. In Proc. of DL 2017,
    volume 1879 of CEUR, ceur-ws.org, 2017.
 7. G. Cima, M. Lenzerini, and A. Poggi. Semantic technology for open data publishing. In
    Proc. of WIMS 2017, 2017.
 8. G. Cima, M. Lenzerini, and A. Poggi. Semantic characterization of data services through
    ontologies. In Proc. of IJCAI 2019, 2019.
 9. M. Lenzerini. Managing data through the lens of an ontology. AI Magazine, 39(2):65–74,
    2018.
10. C. Lutz, J. Marti, and L. Sabellek. Query expressibility and verification in ontology-based
    data access. In Proc. of KR 2018, pages 389–398, 2018.
11. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking
    data to ontologies. J. on Data Semantics, X:133–173, 2008.
12. G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, and M. Za-
    kharyaschev. Ontology-based data access: A survey. In Proceedings of the Twenty-Seventh
    International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018,
    Stockholm, Sweden., pages 5511–5519, 2018.
13. Z. Zheng, J. Zhu, and M. R. Lyu. Service-generated big data and big data-as-a-service: An
    overview. In Proc. of the 2013 IEEE Int. Conf. on Big Data, pages 403–410, 2013.