=Paper=
{{Paper
|id=Vol-1644/paper1
|storemode=property
|title=OBDA Over Non-Relational Databases
|pdfUrl=https://ceur-ws.org/Vol-1644/paper1.pdf
|volume=Vol-1644
|authors=Elena Botoeva,Diego Calvanese,Benjamin Cogrel,Martin Rezk,Guohui Xiao
|dblpUrl=https://dblp.org/rec/conf/amw/BotoevaCCRX16
}}
==OBDA Over Non-Relational Databases==
OBDA Over Non-Relational Databases? Elena Botoeva, Diego Calvanese, Benjamin Cogrel, Martin Rezk, and Guohui Xiao Free University of Bozen-Bolzano, Italy, lastname@inf.unibz.it Abstract. The database landscape has been significantly diversified during the last decade, resulting in the emergence of a variety of non-relational (also called NoSQL) databases, e.g., XML and JSON-document databases, key-value stores, and graph databases. To facilitate access to such databases and to enable data in- tegration of non-relational data sources, we generalize the well-known ontology- based data access (OBDA) framework so as to allow for querying arbitrary data- bases through a mediating ontology. We instantiate this framework to MongoDB, a popular JSON-document database, and implement a prototype extension of the virtual OBDA system Ontop for answering SPARQL queries over MongoDB. 1 Introduction Accessing data using native query languages is getting a more and more involved task for users, as databases (DBs) increase in complexity and heterogeneity. The Ontology- Based Data Access (OBDA) paradigm [9] has emerged as a proposal to simplify this kind of access, by allowing users to write high-level ontological queries, which in the classical virtual approach are translated automatically into low-level queries that DB engines can handle. This separation of concerns between the conceptual level and the DB level has been proven successful in practice, notably when data sources have a com- plex structure and end-users have domain but not necessarily data management exper- tise [6,5,1]. The OBDA approach is implemented by connecting a DB to an ontology by means of mappings, where traditionally the ontology is expressed in the OWL 2 QL pro- file of the Web Ontology Language OWL 2 [7], the queries are formulated in SPARQL, the Semantic Web query language, and the DB is assumed to be relational [4]. As envisioned by Stonebraker and Cetintemel [11], a multitude of DB architec- tures is required to satisfy the needs of a wide variety of modern applications. This has been confirmed by the significant diversification of the DB landscape during the last decade. Some of these architectural changes have been proposed within the scope of re- lational DBs (e.g., column-oriented storage), while many have gone beyond them, caus- ing the emergence of the so-called NoSQL (not only SQL) DBs. These non-relational DBs usually adopt one of four main data models, namely the column-family, key-value, document, and graph data models, and provide an (intimidating) number of query lan- guages with varying querying capabilities. Notably, many of these new languages are limited [8] and thus clients might need to set up compensating post-processing tech- niques so as to satisfy advanced information needs. In this paper, to let users query non-relational DBs at the conceptual level, we propose to extend the OBDA framework to non-relational DBs, and then instantiate it to MongoDB, a popular document DB. ? This is an abridged version of [3]. 2 Generalized OBDA framework We consider fixed countably infinite sets C of DB values, and ED of elements built over C, e.g., named tuples, trees, or XML documents. We assume to deal with a class D of DBs, where each DB in D is a finite subset of ED , e.g., relational DBs, MongoDB, or XML DBs. Moreover, we assume that D comes equipped with: – Suitable forms of constraints, which might express both information about the structure of the data in DBs of D, e.g., the schema information in relational DBs, and “constraints” in the usual sense of DBs, e.g., primary and foreign key con- straints for relational DBs. We call a collection of such constraints a D-schema. – A query language QD , such that, for each query q ∈ QD and for each instance D ∈ D, the answer ans(q, D) of q over D is defined, and is itself a DB in D. – A relational wrapper [[·]]D , which is a function transforming a QD -query q into a new query [[q]]D that takes a DB in D and returns a relation over C (i.e., a relation in first normal form)1 . The role of the wrapper is to present ans(q, D) as a relation, by actually computing a query that retrieves from D what can be considered as the relational representation of ans(q, D). In particular, when q is the identity query, [[q]]D is the query computing the relational view of a DB D ∈ D. Having these building blocks at hand, we now define D-mapping assertions and their semantics. We start by introducing the notion of variable-to-RDF-term map, which is a generalization of the RDF-term template in relational OBDA. We say that f (x1 , . . . , xn ) is an RDF term constructor if it is a (partial) function f : Cn → I ∪ L, where I is the set of IRIs and L is the set of RDF literals. Then, a variable-to-(RDF)- term map κ (for variable ?X) has the form ?X 7→ f (x1 , . . . , xn ). In the following, πx1 ,...,xn denotes the standard projection operator of relational algebra. Definition 1. A D-mapping assertion m is an expression q K h where: – q is a QD -query, called source query; – h is an RDF triple pattern, called target, of the form (?X1 rdf:type A) or (?X1 P ?X2 ), where A is a class name, and P is a property name; – K is a set of variable-to-term maps, one for each variable ?Xi appearing in h. The mapping assertion m is safe if for each ?Xi 7→ f (x1 , . . . , xn ) in K, we have that the function πx1 ,...,xn ◦ [[q]]D is well-defined. A D-mapping M is a finite set of D-mapping assertions. Notice that, the mapping assertion m is safe if and only if, for each variable-to-term map ?Xi 7→ f (x1 , . . . , xn ) used by m, the wrapper [[·]]D , when applied to the source query q of m, returns a query producing a relation whose attributes contain x1 , . . . , xn . Definition 2. Let D be a class of DBs, m = q K h a D-mapping assertion, and D ∈ D. The RDF graph m(D) generated by m from D is defined as follows: – When h = (?X1 rdf:type A), then m(D) = (f (v) rdf:type A) | K = {?X1 7→ f (x)}, v ∈ πx ([[q]]D (D)) . 1 Discussing the specific form and properties of wrappers is outside the scope of this paper. – When h = (?X1 P ?X2 ), then m(D) = (f1 (v 1 ) P f2 (v 2 )) | K = {?X1 7→ f1 (x1 ), ?X2 7→ f2 (x2 )}, v 1 ∈ πx1 ([[q]]D (D)), v 2 ∈ πx2 ([[q]]D (D)) . S For a D-mapping M, the RDF graph M(D) is defined as m∈M m(D). Now, an OBDA specification for D is a triple hT , M, Si, where T is an ontology, S is a D-schema, and M is a D-mapping. An OBDA instance for D consists of an OBDA specification hT , M, Si for D and an instance D ∈ D satisfying S. The semantics of such an instance is derived naturally from the semantics of D-mapping assertions. 3 OBDA Framework over MongoDB In this section we instantiate the generalized OBDA framework to MongoDB, a pop- ular DB that stores collections of semi-structured JSON-style documents. A sample document consisting of (possibly nested) key-value pairs and arrays is given below. { "_id": 4, "awards": [ { "award": "Rosing Prize", "year": 1999, "by": "Norwegian Data Association" }, { "award": "Turing Award", "year": 2001, "by": "ACM" }, { "award": "IEEE John von Neumann Medal", "year": 2001, "by": "IEEE" } ], "birth": "1926-08-27", "contribs": ["OOP", "Simula"], "death": "2002-08-10", "name": { "first": "Kristen", "last": "Nygaard" } } This instantiation relies on work in [2], where we obtain the following results: – We formalize a fragment of MongoDB aggregate queries that consists of match, unwind, project, group, and lookup stages, and that we call MUPGL. – We propose a notion of MongoDB type constraints. These constraints allow one to specify that certain paths (concatenations of keys) must point to an array (e.g., awards), or an atomic value (e.g., name.last), or an object (e.g., name). – We define a relational view over MongoDB with respect to a set of type constraints. – We develop a translation from RA expressions (over the relational view) to MUPGL queries. This translation shows that full RA can be captured by MUPGL, while RA over a single collection can be captured by MUPG. We start by introducing a compact notation for MongoDB mapping assertions, which we also use to specify type constraints. We consider two extensions of paths called array paths. An array index path is a path where a key is followed by any com- bination of keys and ‘#’, separated by dots. An array element path is the concatenation of an array index path with ‘.[#]’. Intuitively, for (simple) paths p1 and p2 , any of the array paths p1 .#.p2 , p1 .#, or p1 .[#] imply that p1 must point to an array. Moreover, the path p1 .#.p2 (e.g., awards.#.year) is used to access the value of the path p2 inside the array pointed to by p1 , while p1 .# (e.g., awards.#) is used to denote the indices and p1 .[#] (e.g., contribs.[#]) to denote the single elements of such an array. Hence, the presence of ‘#’ or ‘[#]’ requires that the path preceding it points to an array, whereas a path that does not end with ‘#’ or ‘[#]’ must point to atomic values. We use extended paths to refer both to normal paths and to array paths. The source queries we consider are a restricted form of MUP queries and can be rep- resented as a pair (C, ϕ), where C is a collection name and ϕ is a criterion constructed using extended paths. Let K be a set of variable-to-term maps ?X 7→ f (p1 , . . . , pn ), where each pi is an extended path. Then, a MongoDB mapping assertion is an expres- sion of the form (C, ϕ) K h, where the variables in h constitute the domain of K. Given a MongoDB mapping M, we can extract from it the MongoDB schema SM (a set of type constraints), and then define a relational wrapper [[·]]MongoDB for the source queries in M (see [3] for more details). We built a prototype implementation for answering SPARQL queries over MongoDB as an extension of the state-of-the-art OBDA system Ontop [4]. Architecture of Ontop. Query answering (QA) in Ontop under the OWL 2 QL entailment regime involves an offline and an online stage: 1 3 4 SPARQL SPARQL SPARQL Q 2 Rewritten Unfolding RA q1 Structural/semantic string parsing Rewriting 0 SPARQL QT w.r.t. mappings optimization T-Mapping MT Ontology Ontology T Ontology Classified T Mapping Mapping M Mapping Mapping file parsing classification compilation Schema S analysis file RA q2 ii iii OFFLINE iv i SPARQL Post- Native Native RA-to-native RA q3 Normalization/ result processing Evaluation queries query translation Decomposition results 7 8 6 5 The offline stage (highlighted in gray) takes as input the mapping and the ontology files and produces three entities used by the online QA stage: the classified ontology, the DB schema (extracted from the mapping file), and the T-mapping, constructed by ‘compiling’ the classified ontology into the input mapping [10]. The online stage handles individual SPARQL queries: 1 the input SPARQL query is parsed, 2 rewritten w.r.t. the ontology, and 3 unfolded w.r.t. the T-mapping; 4 (the internal representation of) the resulting RA query is simplified by applying structural and semantic optimiza- tion techniques; 5 the RA query is normalized and (possibly) decomposed so that it can be directly handled by the RA-to-native-query translator; 6 each normalized RA (sub)query is translated into a native query, which is then 7 evaluated by the DB engine; 8 the native results are post-processed into SPARQL results. Implementation for MongoDB. We observe that steps ii – iv and 1 – 4 are indepen- dent of the actual class D of DBs, while steps i and 5 – 8 require specific implementa- tions according to D. Therefore, our prototype implements the latter five components. The current implementation supports MongoDB 3.2 and is able to return sound and complete answers to the subset of SPARQL queries that (i) correspond to BGPs with filters consisting of comparisons, and (ii) can be translated into MUPG queries. 4 Conclusions We are currently working on providing support for (most of) SPARQL 1.0. In the future we want to explore the tradeoff between delegating all of query processing to MongoDB vs. a fine-grained decomposition of the input query together with post-processing to combine the results of the sub-queries. We will test our techniques on real-world large- scale use-cases, so as to assess the practical feasibility of OBDA over MongoDB. References 1. N. Antonioli, F. Castanò, C. Civili, S. Coletta, S. Grossi, D. Lembo, M. Lenzerini, A. Poggi, D. F. Savo, and E. Virardi. Ontology-based data access: the experience at the Italian De- partment of Treasury. In Proc. of the Industrial Track of the 25th Int. Conf. on Advanced Information Systems Engineering (CAiSE), volume 1017 of CEUR Electronic Workshop Pro- ceedings, http://ceur-ws.org/, pages 9–16, 2013. 2. E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, and G. Xiao. A formal presentation of MongoDB (Extended version). CoRR Technical Report abs/1603.09291, arXiv.org e-Print archive, 2016. Available at http://arxiv.org/abs/1603.09291. 3. E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, and G. Xiao. OBDA beyond relational DBs: A study for MongoDB. In Proc. of the 29th Int. Workshop on Description Logics (DL), vol- ume 1577 of CEUR Electronic Workshop Proceedings, http://ceur-ws.org/, 2016. 4. D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez- Muro, and G. Xiao. Ontop: Answering SPARQL queries over relational databases. Semantic Web J., 2016. To appear. 5. D. Calvanese, P. Liuzzo, A. Mosca, J. Remesal, M. Rezk, and G. Rull. Ontology-based data integration in EPNet: Production and distribution of food during the Roman Empire. Engineering Applications of Artificial Intelligence, 2016. To appear. 6. M. Giese, A. Soylu, G. Vega-Gorgojo, A. Waaler, P. Haase, E. Jiménez-Ruiz, D. Lanti, M. Rezk, G. Xiao, Ö. L. Özçep, and R. Rosati. Optique: Zooming in on Big Data. IEEE Computer, 48(3):60–67, 2015. 7. B. Motik, A. Fokoue, I. Horrocks, Z. Wu, C. Lutz, and B. Cuenca Grau. OWL Web Ontol- ogy Language profiles. W3C Recommendation, World Wide Web Consortium, Oct. 2009. Available at http://www.w3.org/TR/owl-profiles/. 8. K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Config- urable, unifying and semi-structured. CoRR Technical Report abs/1405.3631, arXiv.org e- Print archive, 2014. Available at http://arxiv.org/abs/1405.3631. 9. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. J. on Data Semantics, X:133–173, 2008. 10. M. Rodriguez-Muro, R. Kontchakov, and M. Zakharyaschev. Ontology-based data access: Ontop of databases. In Proc. of the 12th Int. Semantic Web Conf. (ISWC), volume 8218 of Lecture Notes in Computer Science, pages 558–573. Springer, 2013. 11. M. Stonebraker and U. Cetintemel. “one size fits all”: An idea whose time has come and gone. In Proc. of the 21st IEEE Int. Conf. on Data Engineering (ICDE), pages 2–11, 2005.