=Paper= {{Paper |id=Vol-1644/paper1 |storemode=property |title=OBDA Over Non-Relational Databases |pdfUrl=https://ceur-ws.org/Vol-1644/paper1.pdf |volume=Vol-1644 |authors=Elena Botoeva,Diego Calvanese,Benjamin Cogrel,Martin Rezk,Guohui Xiao |dblpUrl=https://dblp.org/rec/conf/amw/BotoevaCCRX16 }} ==OBDA Over Non-Relational Databases== https://ceur-ws.org/Vol-1644/paper1.pdf
               OBDA Over Non-Relational Databases?

    Elena Botoeva, Diego Calvanese, Benjamin Cogrel, Martin Rezk, and Guohui Xiao

            Free University of Bozen-Bolzano, Italy, lastname@inf.unibz.it


         Abstract. The database landscape has been significantly diversified during the
         last decade, resulting in the emergence of a variety of non-relational (also called
         NoSQL) databases, e.g., XML and JSON-document databases, key-value stores,
         and graph databases. To facilitate access to such databases and to enable data in-
         tegration of non-relational data sources, we generalize the well-known ontology-
         based data access (OBDA) framework so as to allow for querying arbitrary data-
         bases through a mediating ontology. We instantiate this framework to MongoDB,
         a popular JSON-document database, and implement a prototype extension of the
         virtual OBDA system Ontop for answering SPARQL queries over MongoDB.


1     Introduction
Accessing data using native query languages is getting a more and more involved task
for users, as databases (DBs) increase in complexity and heterogeneity. The Ontology-
Based Data Access (OBDA) paradigm [9] has emerged as a proposal to simplify this
kind of access, by allowing users to write high-level ontological queries, which in the
classical virtual approach are translated automatically into low-level queries that DB
engines can handle. This separation of concerns between the conceptual level and the
DB level has been proven successful in practice, notably when data sources have a com-
plex structure and end-users have domain but not necessarily data management exper-
tise [6,5,1]. The OBDA approach is implemented by connecting a DB to an ontology by
means of mappings, where traditionally the ontology is expressed in the OWL 2 QL pro-
file of the Web Ontology Language OWL 2 [7], the queries are formulated in SPARQL,
the Semantic Web query language, and the DB is assumed to be relational [4].
     As envisioned by Stonebraker and Cetintemel [11], a multitude of DB architec-
tures is required to satisfy the needs of a wide variety of modern applications. This has
been confirmed by the significant diversification of the DB landscape during the last
decade. Some of these architectural changes have been proposed within the scope of re-
lational DBs (e.g., column-oriented storage), while many have gone beyond them, caus-
ing the emergence of the so-called NoSQL (not only SQL) DBs. These non-relational
DBs usually adopt one of four main data models, namely the column-family, key-value,
document, and graph data models, and provide an (intimidating) number of query lan-
guages with varying querying capabilities. Notably, many of these new languages are
limited [8] and thus clients might need to set up compensating post-processing tech-
niques so as to satisfy advanced information needs. In this paper, to let users query
non-relational DBs at the conceptual level, we propose to extend the OBDA framework
to non-relational DBs, and then instantiate it to MongoDB, a popular document DB.
?
    This is an abridged version of [3].
2      Generalized OBDA framework

We consider fixed countably infinite sets C of DB values, and ED of elements built over
C, e.g., named tuples, trees, or XML documents. We assume to deal with a class D of
DBs, where each DB in D is a finite subset of ED , e.g., relational DBs, MongoDB, or
XML DBs. Moreover, we assume that D comes equipped with:

    – Suitable forms of constraints, which might express both information about the
      structure of the data in DBs of D, e.g., the schema information in relational DBs,
      and “constraints” in the usual sense of DBs, e.g., primary and foreign key con-
      straints for relational DBs. We call a collection of such constraints a D-schema.
    – A query language QD , such that, for each query q ∈ QD and for each instance
      D ∈ D, the answer ans(q, D) of q over D is defined, and is itself a DB in D.
    – A relational wrapper [[·]]D , which is a function transforming a QD -query q into a
      new query [[q]]D that takes a DB in D and returns a relation over C (i.e., a relation
      in first normal form)1 . The role of the wrapper is to present ans(q, D) as a relation,
      by actually computing a query that retrieves from D what can be considered as the
      relational representation of ans(q, D). In particular, when q is the identity query,
      [[q]]D is the query computing the relational view of a DB D ∈ D.

    Having these building blocks at hand, we now define D-mapping assertions and
their semantics. We start by introducing the notion of variable-to-RDF-term map,
which is a generalization of the RDF-term template in relational OBDA. We say that
f (x1 , . . . , xn ) is an RDF term constructor if it is a (partial) function f : Cn → I ∪ L,
where I is the set of IRIs and L is the set of RDF literals. Then, a variable-to-(RDF)-
term map κ (for variable ?X) has the form ?X 7→ f (x1 , . . . , xn ). In the following,
πx1 ,...,xn denotes the standard projection operator of relational algebra.

Definition 1. A D-mapping assertion m is an expression q K h where:
  – q is a QD -query, called source query;
  – h is an RDF triple pattern, called target, of the form (?X1 rdf:type A) or
     (?X1 P ?X2 ), where A is a class name, and P is a property name;
  – K is a set of variable-to-term maps, one for each variable ?Xi appearing in h.
The mapping assertion m is safe if for each ?Xi 7→ f (x1 , . . . , xn ) in K, we have that
the function πx1 ,...,xn ◦ [[q]]D is well-defined.
    A D-mapping M is a finite set of D-mapping assertions.

Notice that, the mapping assertion m is safe if and only if, for each variable-to-term
map ?Xi 7→ f (x1 , . . . , xn ) used by m, the wrapper [[·]]D , when applied to the source
query q of m, returns a query producing a relation whose attributes contain x1 , . . . , xn .

Definition 2. Let D be a class of DBs, m = q K h a D-mapping assertion, and
D ∈ D. The RDF graph m(D) generated by m from D is defined as follows:
    – When h = (?X1 rdf:type A), then
                 
        m(D) = (f (v) rdf:type A) | K = {?X1 7→ f (x)}, v ∈ πx ([[q]]D (D)) .
 1
     Discussing the specific form and properties of wrappers is outside the scope of this paper.
 – When h = (?X1 P ?X2 ), then
              
       m(D) = (f1 (v 1 ) P f2 (v 2 )) | K = {?X1 7→ f1 (x1 ), ?X2 7→ f2 (x2 )},
                                        v 1 ∈ πx1 ([[q]]D (D)), v 2 ∈ πx2 ([[q]]D (D)) .
                                                            S
For a D-mapping M, the RDF graph M(D) is defined as m∈M m(D).

     Now, an OBDA specification for D is a triple hT , M, Si, where T is an ontology, S
is a D-schema, and M is a D-mapping. An OBDA instance for D consists of an OBDA
specification hT , M, Si for D and an instance D ∈ D satisfying S. The semantics of
such an instance is derived naturally from the semantics of D-mapping assertions.


3   OBDA Framework over MongoDB
In this section we instantiate the generalized OBDA framework to MongoDB, a pop-
ular DB that stores collections of semi-structured JSON-style documents. A sample
document consisting of (possibly nested) key-value pairs and arrays is given below.
{ "_id": 4,
  "awards": [
    { "award": "Rosing Prize", "year": 1999, "by": "Norwegian Data Association" },
    { "award": "Turing Award", "year": 2001, "by": "ACM" },
    { "award": "IEEE John von Neumann Medal", "year": 2001, "by": "IEEE" } ],
  "birth": "1926-08-27",
  "contribs": ["OOP", "Simula"],
  "death": "2002-08-10",
  "name": {
      "first": "Kristen", "last": "Nygaard" }
}

     This instantiation relies on work in [2], where we obtain the following results:
  – We formalize a fragment of MongoDB aggregate queries that consists of match,
     unwind, project, group, and lookup stages, and that we call MUPGL.
  – We propose a notion of MongoDB type constraints. These constraints allow one
     to specify that certain paths (concatenations of keys) must point to an array (e.g.,
     awards), or an atomic value (e.g., name.last), or an object (e.g., name).
  – We define a relational view over MongoDB with respect to a set of type constraints.
  – We develop a translation from RA expressions (over the relational view) to MUPGL
     queries. This translation shows that full RA can be captured by MUPGL, while RA
     over a single collection can be captured by MUPG.
     We start by introducing a compact notation for MongoDB mapping assertions,
which we also use to specify type constraints. We consider two extensions of paths
called array paths. An array index path is a path where a key is followed by any com-
bination of keys and ‘#’, separated by dots. An array element path is the concatenation
of an array index path with ‘.[#]’. Intuitively, for (simple) paths p1 and p2 , any of the
array paths p1 .#.p2 , p1 .#, or p1 .[#] imply that p1 must point to an array. Moreover, the
path p1 .#.p2 (e.g., awards.#.year) is used to access the value of the path p2 inside the
array pointed to by p1 , while p1 .# (e.g., awards.#) is used to denote the indices and
p1 .[#] (e.g., contribs.[#]) to denote the single elements of such an array. Hence, the
presence of ‘#’ or ‘[#]’ requires that the path preceding it points to an array, whereas a
path that does not end with ‘#’ or ‘[#]’ must point to atomic values. We use extended
paths to refer both to normal paths and to array paths.
    The source queries we consider are a restricted form of MUP queries and can be rep-
resented as a pair (C, ϕ), where C is a collection name and ϕ is a criterion constructed
using extended paths. Let K be a set of variable-to-term maps ?X 7→ f (p1 , . . . , pn ),
where each pi is an extended path. Then, a MongoDB mapping assertion is an expres-
sion of the form (C, ϕ) K h, where the variables in h constitute the domain of K.
Given a MongoDB mapping M, we can extract from it the MongoDB schema SM (a
set of type constraints), and then define a relational wrapper [[·]]MongoDB for the source
queries in M (see [3] for more details).
    We built a prototype implementation for answering SPARQL queries over MongoDB
as an extension of the state-of-the-art OBDA system Ontop [4].
Architecture of Ontop. Query answering (QA) in Ontop under the OWL 2 QL
entailment regime involves an offline and an online stage:
               1                                                                             3                                     4
 SPARQL             SPARQL           SPARQL Q        2                       Rewritten           Unfolding          RA q1              Structural/semantic
 string             parsing                              Rewriting                     0
                                                                             SPARQL QT         w.r.t. mappings                            optimization
                                                                                     T-Mapping MT
    Ontology         Ontology        T        Ontology             Classified T      Mapping       Mapping M        Mapping             Mapping
    file              parsing               classification                         compilation       Schema S       analysis            file          RA q2
                                ii                           iii   OFFLINE                      iv                             i

    SPARQL            Post-              Native                                Native             RA-to-native       RA q3                  Normalization/
    result         processing                                Evaluation        queries          query translation                           Decomposition
                                         results         7
               8                                                                            6                                           5

The offline stage (highlighted in gray) takes as input the mapping and the ontology
files and produces three entities used by the online QA stage: the classified ontology,
the DB schema (extracted from the mapping file), and the T-mapping, constructed
by ‘compiling’ the classified ontology into the input mapping [10]. The online stage
handles individual SPARQL queries: 1 the input SPARQL query is parsed, 2 rewritten
w.r.t. the ontology, and 3 unfolded w.r.t. the T-mapping; 4 (the internal representation
of) the resulting RA query is simplified by applying structural and semantic optimiza-
tion techniques; 5 the RA query is normalized and (possibly) decomposed so that
it can be directly handled by the RA-to-native-query translator; 6 each normalized
RA (sub)query is translated into a native query, which is then 7 evaluated by the DB
engine; 8 the native results are post-processed into SPARQL results.
Implementation for MongoDB. We observe that steps ii – iv and 1 – 4 are indepen-
dent of the actual class D of DBs, while steps i and 5 – 8 require specific implementa-
tions according to D. Therefore, our prototype implements the latter five components.
The current implementation supports MongoDB 3.2 and is able to return sound and
complete answers to the subset of SPARQL queries that (i) correspond to BGPs with
filters consisting of comparisons, and (ii) can be translated into MUPG queries.


4       Conclusions

We are currently working on providing support for (most of) SPARQL 1.0. In the future
we want to explore the tradeoff between delegating all of query processing to MongoDB
vs. a fine-grained decomposition of the input query together with post-processing to
combine the results of the sub-queries. We will test our techniques on real-world large-
scale use-cases, so as to assess the practical feasibility of OBDA over MongoDB.
References
 1. N. Antonioli, F. Castanò, C. Civili, S. Coletta, S. Grossi, D. Lembo, M. Lenzerini, A. Poggi,
    D. F. Savo, and E. Virardi. Ontology-based data access: the experience at the Italian De-
    partment of Treasury. In Proc. of the Industrial Track of the 25th Int. Conf. on Advanced
    Information Systems Engineering (CAiSE), volume 1017 of CEUR Electronic Workshop Pro-
    ceedings, http://ceur-ws.org/, pages 9–16, 2013.
 2. E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, and G. Xiao. A formal presentation of
    MongoDB (Extended version). CoRR Technical Report abs/1603.09291, arXiv.org e-Print
    archive, 2016. Available at http://arxiv.org/abs/1603.09291.
 3. E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, and G. Xiao. OBDA beyond relational DBs:
    A study for MongoDB. In Proc. of the 29th Int. Workshop on Description Logics (DL), vol-
    ume 1577 of CEUR Electronic Workshop Proceedings, http://ceur-ws.org/, 2016.
 4. D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez-
    Muro, and G. Xiao. Ontop: Answering SPARQL queries over relational databases. Semantic
    Web J., 2016. To appear.
 5. D. Calvanese, P. Liuzzo, A. Mosca, J. Remesal, M. Rezk, and G. Rull. Ontology-based
    data integration in EPNet: Production and distribution of food during the Roman Empire.
    Engineering Applications of Artificial Intelligence, 2016. To appear.
 6. M. Giese, A. Soylu, G. Vega-Gorgojo, A. Waaler, P. Haase, E. Jiménez-Ruiz, D. Lanti,
    M. Rezk, G. Xiao, Ö. L. Özçep, and R. Rosati. Optique: Zooming in on Big Data. IEEE
    Computer, 48(3):60–67, 2015.
 7. B. Motik, A. Fokoue, I. Horrocks, Z. Wu, C. Lutz, and B. Cuenca Grau. OWL Web Ontol-
    ogy Language profiles. W3C Recommendation, World Wide Web Consortium, Oct. 2009.
    Available at http://www.w3.org/TR/owl-profiles/.
 8. K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Config-
    urable, unifying and semi-structured. CoRR Technical Report abs/1405.3631, arXiv.org e-
    Print archive, 2014. Available at http://arxiv.org/abs/1405.3631.
 9. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking
    data to ontologies. J. on Data Semantics, X:133–173, 2008.
10. M. Rodriguez-Muro, R. Kontchakov, and M. Zakharyaschev. Ontology-based data access:
    Ontop of databases. In Proc. of the 12th Int. Semantic Web Conf. (ISWC), volume 8218 of
    Lecture Notes in Computer Science, pages 558–573. Springer, 2013.
11. M. Stonebraker and U. Cetintemel. “one size fits all”: An idea whose time has come and
    gone. In Proc. of the 21st IEEE Int. Conf. on Data Engineering (ICDE), pages 2–11, 2005.