Towards schema-independent querying on document data stores

Towards schema-independent querying on document data stores HamdiBenHamadou hamdi.ben-hamadou@irit.fr IRIT UMR 5505 Université de Toulouse UT3 CNRS

Toulouse France

FaizaGhozzi faiza.ghozzi@isims.usf.tn Université de Sfax ISIMS

Sfax MIRACL Tunisia

AndréPéninou andre.peninou@irit.fr UT2J IRIT UMR 5505 Université de Toulouse CNRS

Toulouse France

OlivierTeste olivier.teste@irit.fr UT2J IRIT UMR 5505 Université de Toulouse CNRS

Toulouse France

Towards schema-independent querying on document data stores 1613-0073) 9B546076C85CEADD202721545C88709E GROBID - A machine learning software for extracting information from scholarly documents

Document is a pervasive semi-structured data model in today's Web and the Internet of Things (IoT) applications where the data structure is rapidly evolving over time. NoSQL documentoriented databases are well-tailored to efficiently load and manage massive collections of heterogeneous documents without any prior structural validations. However, this flexibility becomes a serious challenge while querying a heterogeneous collection of documents. Hence, it is mandatory for users to reformulate original query or to formulate new ones when more structures arrive in the collection. In this paper, we propose a novel approach to build schema-independent queries designed for querying multistructured documents. We introduce a query enrichment mechanism that consults a pre-materialized dictionary defining all possible underlying document structures. We automate the process of query enrichment via an algorithm that rewrites select and project operators to support multi-structured documents. To study the performances of our proposed approach we conduct experiments on synthetic dataset. First results are promising when compared to the normal execution of queries on homogeneous dataset.

INTRODUCTION

The popularity of NoSQL systems is growing in the database community thanks to their ability to store and query schemafree data in flexible and efficient ways [8,21]. The document data model is pervasive in the most Web and the Internet of Things (IoT) applications [13], and several database systems support this data model in an efficient way [1,4,5]. Furthermore, in such applications, the structures of documents representing same entity are subject to structural changes [7]. An application may face the problem of dealing with multi-structured data [2]. To formulate relevant queries, there is a need to have a precise knowledge of data structures because document stores do not provide native support for querying multi-structured data. Thus, it is mandatory to manually include all possible navigational paths for the attributes of interest to formulate relevant query. The structural changes require users to reformulate original query which is a time-consuming and prone to error task. The challenge addressed in this paper is how to support querying upon future structural heterogeneity without affecting the application code.

In the context of document-oriented databases and due to the flexible nature of documents, it is possible to create a collection of documents describing a single entity with multiple structures. This characteristic points to several kinds of heterogeneity [20]. The structural heterogeneity refers to diverse representation of documents, (e.g.: nested or flat structures, nesting levels, etc.) as shown in Figure 2. The syntactic heterogeneity is a result of differences in representation of data, (e.g. "movie_title" or "movieTitle"). Moreover, the semantic heterogeneity is presented when the same fields may rely on distinct concepts in separate documents. The aim of this paper is to focus on structural heterogeneity.

The problem of schema-independent querying is a hot topic in the study of document-oriented databases for both industry and academia [6,22,23]. Previous work from the literature resolved this issue with the following two approaches: (i) performs physical integration of data by mapping integrated document structures into a unified structure [22] and (ii) performs a virtual integration by introducing a custom interface that proposes new virtual schema to be learned by the users while querying heterogeneous data [23]. The first approach modifies the underlying data structure, which is not possible while supporting legacy applications designed to run over original data structures. Moreover, this approach implies to define the mapping for any original data structure. The second one requires more efforts from the user to learn new global structures. This approach is a time-consuming task and possibly prone to be error when there is a need to query new documents with new structures since all queries are subject to revisions.

In this paper, we propose a novel approach to build schemaindependent queries designed for querying multi-structured documents. We propose a virtual integration that runs in a transparent way, hides the complexity to build expected queries, and supports structural heterogeneity evolution. Always, we rewrite the queries during the execution time to guarantee the usage of the latest structures of documents as defined in the dictionary.

The problem of structural heterogeneity refers to the possibility to find different navigational paths that lead to the same attribute. The attributes are not located at the same position inside documents, and having a limited knowledge of navigational paths is insufficient to retrieve the required information. In Figure 1, the attribute "country" in documents describing films may not be relevant to differentiate between "actor .country" or "director .country." Some sub-paths may help to resolve the ambiguity such as "actors.country" and "director .country" anywhere in the document. Therefore, some sub-paths may be used rather than attributes names. In all cases, we hypothesise that there exist some navigational paths to differentiate the different entities contained in the document. { "movie_title":"Fast and Furious", "country": "USA", "actors": [ { "name": "Vin Diesel", "country": "USA" }, ... ], "director" : { "name": "F. Gray Gray", "country": "USA" } }

Figure 1: Descriptive fields ambiguity

We introduce the EasyQ, that stands for "Easy Query, " as a tool to validate our approach. We give particular interest to MongoDB for the implementation and evaluation. The primary contribution of our work is to reformulate users original queries formulated based on simple knowledge of the descriptive field or sub-paths that contains the desired information. The users' queries are formulated based on a schema-independent fashion, i.e. users can formulate an initial query based on a subset of possible schemas without carrying about all available structures. Query rewriting engine is responsible to transparently reformulate the initial query to match with all existing schemas returning a relevant result. To deal with document schema heterogeneity, we define a dictionary that contains all possible paths for all existing fields. The query rewriting engine enriches the user query with all possible paths found in the dictionary for each field used in the user query. In this paper, we use interchangeably the terms field, descriptive field and attribute.

The rest of our paper is structured as follows: In section 2, we illustrate the paper issue. Section 3 reviews the most relevant works, providing support to query multi-structured documents. Section 4 describes in details our approach. Section 5 presents our first experiments and the performances of our approach while changing the size and the number of schemas per collection. In Section 6, we summarize our findings.

QUERYING DOCUMENT STORES WITH MULTIPLE SCHEMA ISSUES

As discussed earlier, querying multi-structured data is a complex task. The problem is which structure to use while formulating queries and how this choice affects the results. In the following, we present a simple illustrating example. Let C = {d 1 , d 2 , d 3 , d 4 } be a set of four films as documented in Figure 2. In this example we represent documents using JSON (JavaScript Object Notation). Most of the NoSQL systems support this notation of representing semi-structured data. A document d i is defined by a key-value pair (i, v i ) where i is the key and v i is the value described with JSON.

Let us consider that we want to retrieve information related to available languages for each presented movie. We formulate a projection query with the fields "movie_title" and "lanдuaдe." using MongoDB syntax as follows: db.C.f ind({}, {"movie_title" : 1, "lanдuaдe" : 1}).

In this query, the field "movie_title" does not cause any difficulty since it is always at the same structural level in the four documents.Therefore, the query engine is able to locate all information related to the field "movie_title." However, the field "lanдuaдe" may cause some information loss since it is founded d1: { "movie_title":"Fast and furious", "year":2017, "language":"English" }, d2: { "movie_title": "Titanic", "details": {"year":1997,"language":"English"} }, d3: { "movie_title": "Despicable Me 3", "year":2017 }, d4: { "movie_title": "The Hobbit", "versions": [{"year":2012, "language":"English"}, {"year":2013, "language":"French"}] } Figure 2: Four documents of films collection at several positions across documents. Thus, assuming that we formulate a query with limited knowledge of the structure s 2 from d 2 , we build a query with the fields "movie_title" and "details.lanдuaдe." db.C.f ind({}, {"movie_title" : 1, "details.lanдuaдe" : 1}) Executing such query in MongoDB leads to an incomplete result since "details.lanдuaдe" field is not available in documents d 1 , d 3 , and d 4 . The problem comes from the structural heterogeneity due to the different structural position of the field "lanдuaдe, " i.e. "lanдuaдe" in document d 1 , "details.lanдuaдe" in document d 2 , and "versions.lanдuaдe" in document d 4 . Hence, we may include all these paths in the query using specific and often complex syntax.

Moreover, we can try to formulate two different queries. The first one is formulated over schema s 1 of the document d 1 in order to retrieve the list of titles and "lanдuaдe" for each film. We use the following MongoDB query: db.C.f ind({}, {"movie_title" : 1, "lanдuaдe" : 1}) We than get the following result:

C 1 = [ {"

movie_title" : "Fast and f urious", "lanдuaдe" : Enдlish}, {"movie_title" : "Titanic"}, {"movie_title" : "Despicable Me 3"}, {"movie_title" : "The Hobbit"}]

We formulate the second using the schema s 4 of document d 4 :

db.C.f ind({}, {"movie_title" : 1, "versions.lanдuaдe" : 1})

We get the following result:

C 2 = [ {"

movie_title" : "Fast and f urious"}, {"movie_title" : "Titanic"}, {"movie_title" : "Despicable Me 3"}, {"movie_title" : "The Hobbit", "versions" : [{"lanдuaдe" : "Enдlish"}, {"lanдuaдe" : "French"}] ] When executing both queries, the query engine returns two results. As expected, all possible information related to the field "movie_title" is returned for all documents as it is located on document's root. For the first query, only first document matches with the field containing "lanдuaдe" information. The second query succeeded to retrieve "lanдuaдe" information only from the fourth document. The challenge is how to formulate a single query and retrieve all information related to the field "lanдuaдe" without any redundancy. For instance, the same information related to the field "movie_title" is obtained twice in the resulted collections

C 1 & C 2 .

To solve this we introduce a transparent way to build relevant schema-independent queries that bypass structural heterogeneity in documents stores. A simple knowledge of the required attributes allows users to retrieve adequate documents regardless the structural heterogeneity in the collection. This ease simplifies the task for end-users and provides them an efficient way to retrieve information of interest. In case of there-above example, we enrich the original user query by adding all possible navigational paths to retrieve relevant documents. For instance, we formulate the query db.C.f ind({}, {"movie_title" : 1, "lanдuaдe" : 1, "details.lanдuaдe" : 1, "versions.lanдuaдe"}) in MongoDB syntax. It can bypass the structural heterogeneity in the current state of the collection and it projects all desired values.

STATE OF THE ART

The widespread use of semi-structured data gives increased interest to build solutions enabling queries over semi-structured data. We distinguish existing solutions systems on the basis of the proposed querying approach: 1) schema-dependent querying approach that requires knowledge of the schema in a similar way as conventional relational database systems, 2) schema-independent querying approach that does not need any prior schema knowledge from the user and is able to extract the schemas at querying time.

The first category of systems is designed to enable queries based on reliable knowledge about the schema or the navigational paths for desired values when dealing with nested data. Such systems offer complicated querying language such as regular expressions with XQuery or Xpath [17] when dealing with XML data. XQuery works with the structure to retrieve precisely the desired results. However, if the user does not know the structure, it is impossible to write the relevant query. Moreover, a single query is generally not able to retrieve data when several schemas are to be considered simultaneously. We can notice the same considerations with JSONiq [9], the extension of XQuery, designed to deal with large-scale data such as JSON data. Other systems suggest JavaScript queries API, the case of MongoDB [5], to build a query by specifying a document with properties expected to match with the results. It offers a broad range of querying capabilities, in particular data processing pipelines. The API requires a complex syntax and it is necessary that queries explicitly include all the various schema structures within documents to access data. Otherwise, the query engine returns only documents that match the supplied criteria even if the fields with the desired information exist but under other paths than those existing in the query. Another kind of works is SQL++ [19] relies on the rich SQL querying interface. In this case, it is also mandatory to express all exact navigational paths in order to obtain the desired results.

The above-studied systems are designed to support queries over semi-structured data with known schemas. To formulate queries user needs to know the exact underlying data structures. Also, they neglect the fact that user may have limited knowledge about the data structure and hence may be unable to formulate correct queries over the heterogeneous dataset.

To overcome these limitations, recent works were conducted to enable schema-independent querying; the second category underlined at the beginning of this section. Thus, the schema is not mandatory to be known in advance at loading time. We classify the studied works according to two approaches: (i) performs physical integration by refactoring integrated data structures into an unified structure; and (ii) adopts virtual integration by introducing either a custom interface and/or a new query language [23].

In the first direction, several works were designed to deal with semi-structured data. Those works share the idea of the schema-on-read. There is no need to define schemas before loading data, they infer the implicit schema later from stored datasets on query time. They expose for the users a relational view over the data to help them to build SQL queries. Sinew [22], is able to infer schemas from semi-structured data. It defines for the user a logical view on the inferred schema, and it flattens data into columns to be stored into relational database system (RDBMS). Drill [12] enables schema-independent querying via SQL over heterogeneous data without first defining a schema. It gives support for nested data. Tenzing [18] infers a relational schema from the underlying data but can only do so for flat structures that can be trivially mapped to a relational schema.

The principle of the previous solutions suggests heavy physical refactorization that requires flattening the underlying data structures into a relational format using complex encoding techniques. Hence, the refactorization requires additional resources such as the need for external relational database and extra efforts to learn the unified inferred relational schema. Besides, some solutions do not support the flexible nature of semi-structured data [18] for instance they cannot handle nested data. User dealing with those systems has to learn new schemas every time the workload changes, or new data comes because there is a need to re-generate the relational view and the stored columns after every change.

Virtual integration gets also attention from researchers [14,23]. Works are inspired by the data lake approach [11] where data is collected in their original format for later use. We consider two major classes: i) schema-oriented queries; and ii) keyword querying.

Works from the first class infer the schema from a collection of data and offer for the users the possibility to query the inferred schema and to check whether a field or sub-schema exists or not to guide them while developing their applications. In [23] the authors propose to summarize all the document schema under a skeleton to discover the existence of fields or sub-schemas inside the collection. In [14] the authors suggest extracting all the schema that are present in the collection to help final users to be aware of the schemas and all fields in the integrated collection of documents. These solutions are limited only to type and field identification and are not used to determine the different paths to access a field in the collection.

Keyword querying has been adopted in the context of XML [10,24]. The process of answering a keyword query on XML data starts by identifying the existence of the keywords within the documents (possibly through some free-text search). They take as input the searched keywords and return a subset from the document that matches with the query keywords. A score is computed based on the structure of sub-documents, and according to this score, the respective XML documents containing all the keywords are returned.

Works in Keyword querying suggest doing a pairwise comparison or binary search to identify the possible positions for queried keywords. This concept is not well tailored for a large number of documents with complex structures (different nested elements, numerous attributes, etc.).

From state of the art, we build our approach in the idea of offering virtual integration to enable schema-independent querying via the usage of keyword based on the attributes and to support native semi-structured features such as nested attributes and support for heterogeneous collections of documents.

Our work relates in some way to previous attempts with XML keyword querying [16]. The most important contribution of these earlier efforts is to prevent users from learning complex underlying schemas as well as a complex query language to manage paths. We adopt this idea that the user may not be aware of all existing schemas and cannot manage too complex queries in order to enable schema-independent querying based on only knowledge about the field with the desired information. The main difference between our works and the keywords querying is that we require from the user some simple details about the queried data. For instance, if we execute a keyword query "Enдlish". It is possible to have as result "lanдuaдe" : "Enдlish" and also "movie_title" : "Johnny Enдlish." With our approach, we will specify that we give interest to the field "lanдuaдe" = "Enдlish" or to the field "movie_title" contains "Enдlish."

QUERYING HETEROGENEOUS COLLECTION OF DOCUMENTS

In our proposal, we want to enable queries over multi-structured documents by automatically handling the underlying structural heterogeneity. Thus, our query rewriting engine will give transparent support for the heterogeneity on both stored and future new data.

Figure 3: An overview of EasyQ

To give an overview of our approach, let us consider the following selection query (selection operation is defined later in this section):

σ ("year "=1997) (C)

We refer to the collection presented in Figure 2 in which we notice that it exists three distinct navigational paths leading to the attribute "year , " i.e. "details.year " "versions.year " and "year ." For each document, at least one path can lead to the attribute "year ." It is possible to express the selection predicate in disjunctive form of navigational paths. X = ("year " = 1997) ∨ ("details.year " = 1997) ∨ ("versions.year " = 1997)

We rewrite the initial query into σ (X ) (C). Two conditions have to be satisfied to select one document, (i) it does exist at least one navigational path from the sub-conditions of X inside the document and (ii) the result of evaluating at least one of these sub-conditions is equal to true. Otherwise, X is equal to false and the document is not returned in the result.

The challenges are how to enable schema-independent querying in transparent ways and how to support future new structures without revising the application code.

Figure 3 gives a high-level illustration of our query rewriting engine called EasyQ. EasyQ is designed to be used early in data loading phase to materialize a dictionary that tracks the different navigational paths, for all attributes. EasyQ is also used at querying time to enrich the query Q of the user and to bypass the structural heterogeneity. It takes as input the user query formulated over final fields or sub-paths, and the desired collection. The query rewriting engine produces one extended query Q ex t that will be executed by the underlying document store system. The result of this extended query is a collection containing relevant information. An important result of such architecture is that the same user query, evaluated at different moment, will be rewritten each time. So, if new documents with new structures have been inserted in the collection (or existing documents are updated), these new structures are automatically handled and results remain relevant with the query.

In the rest of this section, we describe the formal model of multi-structured documents, dictionary, and the querying operators across multi-structured documents. Finally, we formally define how we rewrite the queries.

Multi-structured data modelingd i = (k d i , v d i )

• k d i is a key that identifies the document (by abusive notation we noted i the key k d i in section 1 and 2;

• v d i = {a d i ,1 : v d i ,1 , . . . , a d i ,n : v d i ,n } is the document

value. The document value v d i is defined as an object composed by a set of (a d i , j , v d i , j ) pairs, where each a d i , j , is a string called attribute and each v d i , j , is the value that can be atomic (numeric, string, boolean, null) or complex (object, array). A value v d i , j is defined below.

An atomic value is defined as follows ∀j ∈ [1.

.n]:

• v d i , j = n if n ∈ N * , the set of numeric values (integer, float); • v d i , j = "s" if "s" is a string formulated in U nicodeA * ; • v d i , j = b if b ∈ B, the set of boolean {true, f alse}; • v d i , j = ⊥ is a null value;

A complex value is defined as follows ∀j ∈ [1..n]:

• v d i , j = {a d i , j,1 : v d i , j,1 , . . . , a d i , j,m : v d i , j,m } is an object value where v d i , j,k , ∀k ∈ [1.

.m] are values, and

a d i , j,k , ∀k ∈ [1.

.m] are Strings in A * called attributes. This is a recursive definition identical to document value;

• v d i , j = [v d i , j,l , . . . , v d i , j,l ] represents an array of values v d i , j,k , ∀k ∈ [1..l], l =∥ v d i , j ∥ ;

In case of having document values v d i , j as object or array, their inner values v d i , j,k can be complex values too allowing to have different nesting levels. To cope with nested documents and navigate through schemas, we adopt the navigational path notations [3,15].

Definition 4.3 (Schema). The schema, called s d i , is inferred from the document value v d i = {a d i ,1 : v d i ,1 , . . . , a d i ,n : v d i ,n } is defined as s d i = {p 1 , . . . , p m i } where p j , ∀j ∈ [1..m i ]

, is a path of each attribute of v d i , or navigational path for nested values such as v d i , j,k . For multiple nesting levels, the navigational path is extracted recursively to find the path from the root to the final atomic value that can be found in the document hierarchy.

A schema s v d i of value v d i from document d i is formally defined as:

• if v d i , j is atomic, s d i = s d i ∪ {a i, j }; • if v d i , j is object, s d i = s d i ∪ {a d i , j } ∪ {∪ p ∈s d i , j a d i , j .p} where s d i , j is the schema of v d i , j ; • if v d i , j is an array, s d i = s d i ∪ {a d i , j } ∪ ∥v d i , j ∥ j=1 { a d i , j .k} ∪ {∪ p ∈s d i , j,k a d i , j .k.

p} where s d i , j,k is the schema of the k t h value from the array v d i , j ;

Example. Let us consider the documents d 1 and d 2 of Figure 2 . The underlying schema for both documents is described as follows:

s v d 1 = {"movie_title", "year ", "lanдuaдe"} s v d 2 = {"movie_title", "details", "details.year ", "details.lanдuaдe"}

We notice that the attribute "details" from document d 2 is a complex one in which are nested the attributes "year " and "lanдuaдe" which leads to have two different navigational paths "details.year " and "details.lanдuaдe". Definition 4.4 (Collection Schema). The schema S C is inferred from collection C is defined by

S C = c i=1 s v d i

Definition 4.5 (Dictionary). The dictionary dict C of a collection C is defined by

∀p k ∈ S C , dict C = {(p k , △ k )} • p k ∈ S C

is a path for an attribute which is present at least in one document of the collection;

• △ k = {p p k, 1 , . . . , p p k,q } ⊆ S C is a set of navigational paths leading to p k ;

For the rest of this paper, we will call equally any path p k as attribute. We will use dictionary paths and dictionary attributes accordingly.

Example. The dictionary dict C constructed from the collection C is defined below, each dictionary entry p k refers to the set of all extracted navigational paths.

dict C = { (movie_title,

Querying multi-structured data

Querying multi-structured data is possible via a combination of a set of unary operators. In this paper, we limit the querying process to projection and selection operators expressed by native MongoDB operators "f ind" and "aддreдate".

Minimal closed kernel of unary operators.

We define a minimal closed kernel of unary operators. We call C in the queried collection, and C out the resulting collection. Definition 4.6 (Projection). The project operator helps to reduce initial schemas of documents from the collection to a finite subset of attributes as;

π A (C in ) = C out

where A ⊆ S in is a sub-set of attributes from S C in ( the schema of the input collection C in Definition 4.7 (Selection). The select operator runs to retrieve only documents that match some predicates; we call

σ p (C in ) = C out

where p refers to the predicate (or condition) for the selection operator. A simple predicate is expressed by a k ω k v k where a k ⊆ S C in is an attribute, ω k ∈ {= ; > ; < ; ; ≥ ; ≤ } is a comparison operator, and v k i is a value. It is possible to combine predicates by these operator from Ω = { ∨, ∧, ¬} and this leads to a complex predicate.

We call Norm p the normal conjunctive form of the predicates p defined as follows:

Norm p = i j a i , j ϖ i , j v i, j

We consider that all predicates in selection operators as in normal conjunctive form. Definition 4.8 (Query). A query Q can be formulated by composing operators.

Q = q 1 • • • • • q r (C)

where ∀i ∈ [1, r ] q i ∈ {π , σ } Example. Let us consider the collection presented in Figure 2. q 1 : σ ("l anдuaдe"="Enдl ish") (C) = [ {"movie_title" : "Fast and f urious", "year " : 2017, "lanдuaдe" : "Enдlish"} ] q 2 : π ("movie_t itl e","year ") (C) = [ {"movie_title" : "Fast and f urious", year : 2017} {"movie_title" : "Titanic"} {"movie_title" : "Despicable Me 3", year : 2017} {"movie_title" : "The Hobbit"} ] q 3 : π ("movie_t itl e","year ") (σ ("l anдuaдe"="Enдl ish") (C)) = [ {"movie_title" : "Fast and f urious", "year " : 2017} ]

Here, the query q 3 is constructed by combining select and project operators.

4.2.2

Query extension for multi-structured data. In this section, we introduce a new query extension algorithm that automatically enriches the user query. The native query engine of document-oriented stores such as MongoDB can efficiently execute our rewritten queries. Then, it is possible to find out all desired information regardless the structural heterogeneity inside the collection.

Algorithm 1: Automatic extension for the initial user queryinput: Q output: Q ex t Q ex t ← id // identity foreach q i ∈ Q do switch q i do case π A i : // projection do A ex t ← ∀a k ∈A i △ k Q ex t ← Q ex t • π A e x t end case σ p : // selection do P ex t ← i j a k ∈△ i, j , a k ϖ i, j v i, j Q ex t ← Q ex t • σ P e x t end end end

Our approach aims to enable transparent querying on a multistructured collection of documents via automatic query rewriting. This process employs the materialized dictionary to enrich the original query by including the different navigational paths that lead to desired attributes. The algorithm 1 describes the query extension process as:

• In case of projection operation, the algorithm extends the list of attributes A i by uniting different navigational paths △ k for each projected a k . • In case of the selection operation, the algorithm enriches the predicate p, expressed in the normal conjunctive form, with the set of extended dis-junctions built from the navigational paths △ i, j for each attribute a i, j .

Example. Let us consider the query q 3 from the previous example. First, the query rewriting engine starts by extending the project operator. (line "projection" in Algorithm 1) π ("movie_t itl e","year ") (C)

For each projected field, the process consults the dictionary and extracts all the possible navigational paths. The dictionary entry, for the field movie_title, corresponds to:

(movie_title, {movie_title}) So, A ex t = {movie_title} The dictionary entry, for the field year, corresponds to:

(year, {year , details.year , versions.1.year, versions.2.year }) So, A ex t = {movie_title, year, details.year, versions.1.year , versions.2.year } The projection query is then rewritten as: π ("movie_t itl e", "year ", "det ails .year ", "ver sions .1.year ", "ver sions .2.year ") (C)

Next, the process continues with the selection query (line "selection" in Algorithm 1) σ ("l anдuaдe"="Enдlish") (C)

The dictionary entry for the field "language" corresponds to: (lanдuaдe, {lanдuaдe, details.lanдuaдe, versions.1.lanдuaдe, versions.2.lanдuaдe}) So, P ex t = { "lanдuaдe" = "Enдlish" ∨ "details.lanдuaдe" = "Enдlish" ∨ "versions.1.lanдuaдe" = "Enдlish" ∨ "versions.2.lanдuaдe" = "Enдlish"}

The selection is then rewritten as: σ ("l anдuaдe"="Enдlish"∨"det ails .l anдuaдe"="Enдlish"∨"ver sions .

1.l anдuaдe"="Enдlish" ∨ "ver sions .2.l anдuaдe"="Enдlish") (C)

Finally, the query rewriting engine generates the final query by combining the generated queries: π ("movie_t itl e","year ","det ails .year ","ver sions .1.year ","ver sions .2.year ") (σ ("l anдuaдe"="Enдlish"∨"det ails .l anдuaдe"="Enдlish" ∨"ver sions .1.l anдuaдe"="Enдlish"∨"ver sions .2.l anдuaдe"="Enдlish") )(C))

= [ {"movie_title" : "Fast and f urious", "year " : 2017}, {"movie_title" : "Titanic", "details" : {"year " : 2017}}, {"movie_title" : "The Hobbit", "versions" :

[ {"year " : 2017}]} ]

The query rewriting process injects additional complexity to the original user's queries.

EXPERIMENTS

In this section, we conduct a series of experiments to study the aforementioned points:

• Which are the effects on the execution time of the rewritten queries while varying the size of the collection and is this cost acceptable or not? • Is the time to build the dictionary acceptable and what about the size of the dictionary according to structural variability?

Next, we explain the experimental protocol, then we study the queries execution cost, and finally we evaluate the dictionary generation time and its size.

Experimental protocol

We choose to run all the queries on synthetic datasets loaded into the document store MongoDB. In this section, we introduce the details of the experimental setup, the process of generating the synthetic datasets and the evaluation queries set. Later on, we present the results of executing the evaluation set in three separate contexts. The goal is to compare the cost of executing the rewritten queries; (i) the cost of executing the original queries on homogeneous documents, (ii) the execution time of several distinct queries that we build manually based on each schema. Then, we study the effects of the heterogeneity on the dictionary in terms of size and construction time. Finally, we evaluate the scale of the heterogeneity and its impact on generating the rewritten queries.

We conducted our experiments on MongoDB v3.4. We used an I5 3.4GHZ machine coupled with 16GB of RAM with 4 TB of storage space that runs CentOS7.

Dataset.

To study the structural heterogeneity, we generate a custom synthetic datasets. First, we collected a JSON collection of documents from imdb 1 that describe movies. The original dataset has only flat documents with 28 attributes in each document. Then, we reuse this flat collection to produce documents with structural heterogeneity. For each generated dataset, we can define several parameters such as the number of schemas to produce in the collection, the percentage of the presence of every generated schema. For each schema, we can adjust the number of grouping objects. We mean by grouping object, a compound field in which we nest a subset of the document attributes. In other words, we cannot find the same grouping objects inside two structures. To make sure about the heterogeneity within documents, the grouping objects are unique in every schema. Only the original fields from the flat dataset are common to all documents. The values of those fields are randomly chosen from the original film collection. To add more complexity, we can set the nesting level used for each structure. For the rest of the experiments, we built our dataset based on the characteristics that we describe in the Table 1. We generate collections of 10, 25, 50 and 100 GB of data.

Setting

Value # of schema 10 # of grouping objects per schema {5, We generate a flat collection with same leaf attributes and their corresponding values as found in the heterogeneous datasets. This new collection helps us to have a proper environment and to compare; the execution time of the rewritten query on the heterogeneous datasets, versus the execution time of the original query on homogeneous datasets. Therefore, we ensure that every 1 imdb.com query returns the identical results from both heterogeneous or flat datasets. The same result implies: i) the same number of documents, and -0ii) the same values for their attributes (leaf fields).

5.1.2 Queries. we choose to build a synthetic set of queries based on the different comparison operators supported by Mon-goDB. We employed the classical comparison operators, i.e {< , >, ≤, ≥, =, } for numerical values as well as classical logical operators, i.e {and, or } between query predicates. Also, we employed a regular expression to deal with string values. We select 8 attributes of different types and under different levels inside the documents in heterogeneous datasets. The Table 2 shows that for each attribute its type and the selection operator that we used later while formulating the synthetic queries. In addition, we present for each attribute the number of possible paths as found in the synthetic heterogeneous collection, the different nesting levels and the selectivity of the predicate.

We formulate the following 6 queries:

• Q1 : p1 ∧ p2 • Q2 : p1 ∨ p2

With the queries Q1&Q2 the rewritten queries contain 15 predicates unlike the original queries that contains 2 predicates. 15 predicates are due to the 8 existing paths for Director N ame in p1 plus 7 paths for Gross in p2 that are included in a disjunctive form as described in rewriting algorithm.

• Q3 : p1 ∧ p2 ∧ p5 ∧ p7 • Q4 : p1 ∨ p2 ∨ p5 ∨ p7

The rewritten versions of Q3 & Q4 contain 29 predicates unlike the original queries that contain 4 predicates.

• Q5 : p1 ∧ p2 ∧ p5 ∧ p7 ∧ p6 ∧ p3 ∧ p4 ∧ p8 • Q6 : p1 ∨ p2 ∨ p5 ∨ p7 ∨ p6 ∨ p3 ∨ p4 ∨ p8

Finally, the rewritten versions of the queries Q5&Q6 contain 57 predicates unlike the original queries that contain 8 predicates.

The Table 3 presents for each dataset: i) the number of documents inside the collection, ii) the number of expected results regarding each executed query. We define three contexts on which we run the above-defined queries. For each context, we measure the average of execution time after executing each query at least five times. The order of query execution is set to be random.

Collection size in GB # of documents Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 10 GB 12 M 520 K 8,

In the following, we present the details of the three evaluations contexts:

• We call "Q Base " the query that refers to the initial user query (one of the above-defined queries), and it is executed over the homogeneous versions of the datasets. The purpose of this first context is to study the native behavior of the document store. We use this first context as a baseline for our experimentation. • The "Q Rewr it t en " refers to the query "Q Base " rewritten by our approach. It is executed over the heterogeneous versions of the datasets. • The "Q Accumul at ed " refers to the set of equivalent queries formulated on each possible schema from the collection.

In our case, it is mad of 10 separated queries since we are dealing with collections having ten schemas. These queries are build "by hand" as should have done any user without any assisting tool. We do not consider the time necessary to merge the results of each query as the goal is to compare the time to find the set of result documents. "Q Accumul at ed " is obviously executed over the heterogeneous versions of the datasets.

Table 4 synthesizes our execution contexts. As shown in Figure 4, we can notice that our rewritten query, Q Rewr it t en , outperforms the accumulated one, Q Accumul at ed . The difference between the two execution scenarios come from the capabilities of our rewritten query to include automatically all navigational paths extracted from the collection. Hence, the query is executed only once when the accumulated query may require several passes through the stored collection. This solution requires more CPU loads and more intensive disk I/O operations. We move now to study the efficiency of the rewritten query when compared to the baseline query Q Base . We can notice that the overhead of our solution is up to two times (e.g., disjunctive form) when compared to the native execution of the baseline query on the homogeneous dataset. Moreover, we score an overall overhead that does not exceed 1,5 times in all six queries. We believe that this overhead is acceptable since we bypass the needed costs for refactoring the underlying data structures. Unlike the baseline, our synthetic dataset contains different grouping objects with varying nesting levels. Then, the rewritten query include several navigational paths which will be processed by the native query engine of MongoDB to find matches in each visited document among the collection. 5.2.2 Dictionary and query rewriting engine at the scale. With this series of experiments, we try to push the dictionary and the query rewriting engine to their limits. To this end, we generated a heterogeneous synthetic collection of 1 GB. We use the primary 28 attributes from the IMDB flat films collection. The custom collections are generated in a way that each schema inside a document is composed of two grouping objects with no further nesting levels. We generated collection having 10, 100, 1k, 3k and 5k schemas. For this experiment, we keep on the use of the query Q 6 introduced earlier in this section. We present the time needed to build the rewritten query in the Table 5. It is notable that the time to build the rewritten query is very low, less than two seconds. Also, it is possible to construct a dictionary over a heterogeneous collection of documents, here our dictionary can support up to 5 k of distinct schemas. The resulting size of the materialized dictionary is very encouraging since it does not require significant storage space. Furthermore, we also believe that the time spent to build the rewritten query is really interesting and represent another advantage of our solution. In this series of experiments, on each time we try to find distinct navigational paths for eight predicates. Each rewritten query is composed by numerous disjunctive forms for each predicate. We notice 80 disjunctive forms while dealing with dataset having 10 schemas, 800 with 100 schemas, 8 k with 1 k schemas, 24 k with 3 k schemas and 40 k with 5k schemas. We believe that dictionary and the query rewriting engine scale well while dealing with heterogeneous collection of documents having an important number of schemas.

# of schemas rewriting time

Dictionary construction time

In this part, we give the interest to the study of the dictionary constructions process. EasyQ offers the possibility to build the dictionary over existing dataset or during data loading phase. The dictionary contains the latest version of the data once all document are inserted. So, the query rewriting engine enrich the queries based on the new dictionary, otherwise if the process of data loading is in progress, it may do not take into account the recent changes. In the following, we study both configurations. First, we start by the evaluation of the time required to build the dictionary among pre-loaded five collections of 100 GB having 2, 4, 6, 8 and 10 schemas respectively.

We notice from the results in the Table 6 that the time elapsed to build the dictionary increases when we start to deal with collections having more heterogeneity. In case of the collection with 10 structures, the time does not exceed 40% when we compare it to a collection with 2 structures. We can again notice in the Afterwards, we give the interest to evaluate the overhead that causes the generation of the dictionary at loading time. We generate five collections of 1GB having the same structures from the last experiment (2, 4, 6, 8 and 10 schemas respectively). We present two measurements in Table 7. First, we measure the time to simply load each collection (without the dictionary building). Second, we measure the overall time to build the dictionary while loading the collection. In this experiments, we find that the overhead measure does not exceed 0.5 the time required to only load data. The evolution of the time while adding more heterogeneity is linear and not exponential which is encouraging. Many factors may affects the query construction phase. The number of attributes, the nesting levels may increase or decrease the overhead. The advantage our solutions is once the data is loaded and the dictionary is built or updated, the rewritten query takes automatically all changes into account.

#of schemas

CONCLUSION

NoSQL databases are often called schemaless because they may contain variable schemas among stored data. Nowadays, this variability is becoming a common standard in many applications as for example web applications, social media applications or internet of things world. Nevertheless, the existence of such structural heterogeneity makes it very hard for users to formulate queries to find out relevant and coherent results.

In this paper, to deal with structural heterogeneity, we suggest a novel approach for querying heterogeneous documents describing a given entity over NoSQL document stores. The developed tool is called EasyQ. Our objective is to allow users to perform their queries using a minimal knowledge about data schemas. EasyQ is based on two pillars. The first one is a dictionary that contains all possible paths for any existing field. The second one is a rewriting module that modifies the user query to match all field paths existing in the dictionary. Our approach is a syntactic manipulation of queries. So, it is grounded on an important assumption: the collection describes homogeneous entities, i.e., a field has the same meaning in all document schemas. In case of ambiguity, the user should specify some sub-path in order to overcome the ambiguity. If this assumption is not guaranteed, users may face with irrelevant or incoherent results. Nevertheless this assumption may be acceptable in many applications, such as legacy collections, web applications or internet of things data.

In our first experiments, the evaluation consists in comparing the execution time cost of basic MongoDB queries and rewritten queries proposed by our approach. We conduct a set of tests by changing two primary parameters, the size of the dataset and the structural heterogeneity inside a collection (number of different schemas). Results show that the cost of executing rewritten queries proposed in this paper is higher when compared to the execution of basic user queries, but always less than twice. The overhead added to the performance of our query is due to the combination of multiple access paths to a queried field. Nevertheless, this time overhead is neglectful when compared to the execution of separated "by hand" queries for each schema while heterogeneity issues are automatically managed.

These first results are very encouraging to continue this research way and need to be strengthened. Short term perspectives are to continue evaluations and to identify the limitations regarding the number of paths and fields in the same query and regarding time cost. More experiments still to be performed on larger datasets and real case datasets. Another perspective is to enhance the current queries possibilities to introduce all existing classical operators of query languages (contains, etc.). It is also necessary to deal with other querying operators, particularly the aggregation operator.

The first long-term perspective consists in studying the realtime building of the dictionary when integrating data in order to take into account all possible queries: insert but also delete and update. It's likely the current simple structure of the dictionary will be transformed in depth to support more complex updates such as update and delete operations. The second long-term perspective consists in managing multi-store databases. The goal would be to extend the proposed approach to query data stored in different types of databases in a way independent from the various data schemas and stores. The final goal would be: how to query "transparently" any data store meanwhile being unaware of schemas or real fields names?

Definition 4 . 1 (41Collection). A collection C is defined as a set of documents C = {d 1 , . . . , d |C | } Each document d i is considered as a (key-value) pair where the value takes the form: v i = {a i,1 : v i,1 , . . . , a i,n : v i,n } Definition 4.2 (Document). A document d i , ∀i ∈ [1, c], is defined as a (key,value) pair

Figure 4 :4Figure 4: Query rewriting evaluations

{movie_title}), ( year, {year, details.year, versions.1.year, versions.2.year} ),(lanдuaдe, {lanдuaдe, details.lanдuaдe,versions.1.lanдuaдe, versions.2.lanдuaдe}),(details, {details}),(details.year , {details.year }),(details.lanдuaдe, {details.lanдuaдe}),(versions, {versions}),(versions.1, {version.1}),(versions.1.year , {versions.1.year }),(versions.1.lanдuaдe, {versions.1.lanдuaдe}),(versions.2, {versions.2}),(versions.2.year , {versions.2.year }),(versions.2.lanдuaдe, {versions.2.lanдuaдe})}For example, the entry(year , {year , details.year , versions.1.year ,versions.2.year }) gives all navigational paths leading tothe attribute "year ".

Table 1 :1Settings of the generated dataset3}

Table 3 :3Number of extracted documents per query

Table 4 :4Evaluations context 5.2 Queries evaluation 5.2.1 Queries execution time.

Table 5 :5Scale effects on query rewriting and dictionary size

Dictionary size

Table 66the negligible size of the generated dictionaries when compared to the 100 GB of the collection.# of schema246810Required time (minutes)96108127143156Size of the resultingdictionary (KB)4,154 9,458 13,587 17,478 22,997

Table 6 :6Time to build the dictionary of pre-loaded data

Table 7 :7Study of the overhead added during load time

Load (s) Load and dict. (s) Overhead2201s269s33%4205s277s35%6207s285s37%8208s300s44%10210s309s47%

ChrisAnderson JanLehnardt NoahSlater CouchDB: The Definitive Guide: Time to Relax O'Reilly Media, Inc 2010 Schema inference for massive json datasets Mohamed-AmineBaazizi HoussemBen Lahmar DarioColazzo GiorgioGhelli CarloSartiani Extending Database Technology (EDBT 2017 JSON: data model, query languages and schema specification PierreBourhis JuanLReutter FernandoSuárez DomagojVrgoč Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems ACM 2017 Implementation of multidimensional databases in columnoriented NoSQL systems MaxChevalier MohammedElMalki ArlindKopliku OlivierTeste RonanTournier East European Conference on Advances in Databases and Information Systems Springer 2015 KristinaChodorow MichaelDirolf MongoDB: The Definitive Guide O'Reilly Media 2010. 2010 Enabling Self-Service BI on Document Stores MohamedLamine Chouder StefanoRizzi RachidChalal EDBT/ICDT Workshops 2017 Persisting big-data: The NoSQL landscape AlejandroCorbellini CristianMateos AlejandroZunino DanielaGodoy SilviaSchiaffino Information Systems 63 2017. 2017 Can the elephants handle the nosql onslaught? AvriliaFloratou NikhilTeletia DavidJDewitt JigneshMPatel DonghuiZhang Proceedings of the VLDB Endowment 5 2012. 2012 JSONiq: The history of a query language DanielaFlorescu GhislainFourny IEEE internet computing 17 5 2013. 2013 XRANK: Ranked keyword search over XML documents LinGuo FengShao ChavdarBotev JayavelShanmugasundaram Proceedings of the 2003 ACM SIGMOD international conference on Management of data the 2003 ACM SIGMOD international conference on Management of data ACM 2003 Constance: An intelligent data lake system RihanHai SandraGeisler ChristophQuix Proceedings of the 2016 International Conference on Management of Data the 2016 International Conference on Management of Data ACM 2016 Apache drill: interactive ad-hoc analysis at scale MichaelHausenblas JacquesNadeau Big Data 1 2013. 2013 NoSQL evaluation: A use case oriented survey RobinHecht StefanJablonski Cloud and Service Computing (CSC), 2011 International Conference on. IEEE 2011 NOSQL design for analytical workloads: variability matters VictorHerrero AlbertoAbelló OscarRomero Conceptual Modeling: 35th International Conference, ER 2016

Gifu, Japan

Springer 2016. November 14-17, 2016 Proceedings 35 J-Logic: Logical Foundations for JSON Querying JanHidders JanParedaens JanVan Den Bussche Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems ACM 2017 A review on XML keyword query processing PrashantNPrashant R Lambole Chatur Innovative Mechanisms for Industry Applications ICIMIA 2017. 2017 International Conference on. IEEE Schema-free xquery YunyaoLi CongYu Jagadish Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 the Thirtieth international conference on Very large data bases-Volume 30 VLDB Endowment 2004 Tenzing a sql implementation on the mapreduce framework LiangLin VeraLychagina WeiranLiu YoungheeKwon SagarMittal MichaelWong 2011. 2011 The SQL++ query language: Configurable, unifying and semi-structured WinKian YannisOng RomainPapakonstantinou Vernoux arXiv:1405.3631 2014. 2014 arXiv preprint A survey of schema-based matching approaches PavelShvaiko JérômeEuzenat Journal on data semantics IV 2005. 2005 New opportunities for New SQL MStonebraker Commun. ACM 5 2012. 2012 Sinew: a SQL system for multi-structured data DanielTahara ThaddeusDiamond DanielJAbadi Proceedings of the 2014 ACM SIGMOD international conference on Management of data the 2014 ACM SIGMOD international conference on Management of data ACM 2014 Schema management for document stores LanjunWang ShuoZhang JuweiShi LimeiJiao OktieHassanzadeh JiaZou ChenWangz Proceedings of the VLDB Endowment the VLDB Endowment 2015. 2015 8 Fast ELCA computation for keyword queries on XML data RuiZhou ChengfeiLiu JianxinLi Proceedings of the 13th International Conference on Extending Database Technology the 13th International Conference on Extending Database Technology ACM 2010