=Paper= {{Paper |id=Vol-2062/paper1 |storemode=property |title=Towards Schema-independent Querying on Document Data Stores |pdfUrl=https://ceur-ws.org/Vol-2062/paper01.pdf |volume=Vol-2062 |authors=Hamdi Ben Hamadou,Faiza Ghozzi,André Péninou,Olivier Teste |dblpUrl=https://dblp.org/rec/conf/dolap/HamadouGPT18 }} ==Towards Schema-independent Querying on Document Data Stores== https://ceur-ws.org/Vol-2062/paper01.pdf

Towards schema-independent querying on document data
stores
Hamdi Ben Hamadou Faiza Ghozzi
Université de Toulouse, UT3, IRIT, (CNRS/UMR 5505) Université de Sfax, ISIMS, MIRACL
Toulouse, France Sfax, Tunisia
hamdi.ben-hamadou@irit.fr faiza.ghozzi@isims.usf.tn

André Péninou Olivier Teste
Université de Toulouse, UT2J, IRIT, (CNRS/UMR 5505) Université de Toulouse, UT2J, IRIT, (CNRS/UMR 5505)
Toulouse, France Toulouse, France
andre.peninou@irit.fr olivier.teste@irit.fr

ABSTRACT of documents describing a single entity with multiple structures.
Document is a pervasive semi-structured data model in today’s This characteristic points to several kinds of heterogeneity [20].
Web and the Internet of Things (IoT) applications where the The structural heterogeneity refers to diverse representation of
data structure is rapidly evolving over time. NoSQL document- documents, (e.g.: nested or flat structures, nesting levels, etc.)
oriented databases are well-tailored to efficiently load and man- as shown in Figure 2. The syntactic heterogeneity is a result
age massive collections of heterogeneous documents without any of differences in representation of data, (e.g. “movie_title” or
prior structural validations. However, this flexibility becomes a “movieTitle”). Moreover, the semantic heterogeneity is presented
serious challenge while querying a heterogeneous collection of when the same fields may rely on distinct concepts in separate
documents. Hence, it is mandatory for users to reformulate origi- documents. The aim of this paper is to focus on structural het-
nal query or to formulate new ones when more structures arrive erogeneity.
in the collection. In this paper, we propose a novel approach to The problem of schema-independent querying is a hot topic
build schema-independent queries designed for querying multi- in the study of document-oriented databases for both industry
structured documents. We introduce a query enrichment mech- and academia [6, 22, 23]. Previous work from the literature re-
anism that consults a pre-materialized dictionary defining all solved this issue with the following two approaches: (i) performs
possible underlying document structures. We automate the pro- physical integration of data by mapping integrated document
cess of query enrichment via an algorithm that rewrites select structures into a unified structure [22] and (ii) performs a virtual
and project operators to support multi-structured documents. To integration by introducing a custom interface that proposes new
study the performances of our proposed approach we conduct ex- virtual schema to be learned by the users while querying hetero-
periments on synthetic dataset. First results are promising when geneous data [23]. The first approach modifies the underlying
compared to the normal execution of queries on homogeneous data structure, which is not possible while supporting legacy ap-
dataset. plications designed to run over original data structures. Moreover,
this approach implies to define the mapping for any original data
structure. The second one requires more efforts from the user to
1 INTRODUCTION learn new global structures. This approach is a time-consuming
The popularity of NoSQL systems is growing in the database task and possibly prone to be error when there is a need to query
community thanks to their ability to store and query schema- new documents with new structures since all queries are subject
free data in flexible and efficient ways [8, 21]. The document data to revisions.
model is pervasive in the most Web and the Internet of Things In this paper, we propose a novel approach to build schema-
(IoT) applications [13], and several database systems support this independent queries designed for querying multi-structured doc-
data model in an efficient way [1, 4, 5]. Furthermore, in such ap- uments. We propose a virtual integration that runs in a trans-
plications, the structures of documents representing same entity parent way, hides the complexity to build expected queries, and
are subject to structural changes [7]. An application may face the supports structural heterogeneity evolution. Always, we rewrite
problem of dealing with multi-structured data [2]. To formulate the queries during the execution time to guarantee the usage of
relevant queries, there is a need to have a precise knowledge of the latest structures of documents as defined in the dictionary.
data structures because document stores do not provide native The problem of structural heterogeneity refers to the possi-
support for querying multi-structured data. Thus, it is manda- bility to find different navigational paths that lead to the same
tory to manually include all possible navigational paths for the attribute. The attributes are not located at the same position
attributes of interest to formulate relevant query. The structural inside documents, and having a limited knowledge of naviga-
changes require users to reformulate original query which is a tional paths is insufficient to retrieve the required information.
time-consuming and prone to error task. The challenge addressed In Figure 1, the attribute “country” in documents describing films
in this paper is how to support querying upon future structural may not be relevant to differentiate between “actor .country” or
heterogeneity without affecting the application code. “director .country.” Some sub-paths may help to resolve the ambi-
In the context of document-oriented databases and due to the guity such as “actors.country” and “director .country” anywhere
flexible nature of documents, it is possible to create a collection in the document. Therefore, some sub-paths may be used rather
© 2018 Copyright held by the owner/author(s). Published in the Workshop
than attributes names. In all cases, we hypothesise that there ex-
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, ist some navigational paths to differentiate the different entities
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted contained in the document.
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
{ d1: {
"movie_title":"Fast and Furious", "movie_title":"Fast and furious",
"country": "USA", "year":2017,
"actors": [ { "name": "Vin Diesel", "language":"English"
"country": "USA" },
}, ... ], d2: {
"director" : { "name": "F. Gray Gray", "movie_title": "Titanic",
"country": "USA" "details":
} {"year":1997,"language":"English"}
} },
d3: {
Figure 1: Descriptive fields ambiguity "movie_title": "Despicable Me 3",
"year":2017
},
We introduce the EasyQ, that stands for “Easy Query, ” as d4: {
"movie_title": "The Hobbit",
a tool to validate our approach. We give particular interest to
"versions":
MongoDB for the implementation and evaluation. The primary
[{"year":2012, "language":"English"},
contribution of our work is to reformulate users original queries {"year":2013, "language":"French"}]
formulated based on simple knowledge of the descriptive field }
or sub-paths that contains the desired information. The users’
queries are formulated based on a schema-independent fashion, Figure 2: Four documents of films collection
i.e. users can formulate an initial query based on a subset of
possible schemas without carrying about all available structures.
Query rewriting engine is responsible to transparently reformu- at several positions across documents. Thus, assuming that we
late the initial query to match with all existing schemas returning formulate a query with limited knowledge of the structure s 2
a relevant result. To deal with document schema heterogeneity, from d 2 , we build a query with the fields “movie_title” and
we define a dictionary that contains all possible paths for all ex- “details.lanдuaдe.”
isting fields. The query rewriting engine enriches the user query
with all possible paths found in the dictionary for each field used db.C.f ind({}, {“movie_title” : 1, “details.lanдuaдe” : 1})
in the user query. In this paper, we use interchangeably the terms
field, descriptive field and attribute. Executing such query in MongoDB leads to an incomplete
The rest of our paper is structured as follows: In section 2, we result since “details.lanдuaдe” field is not available in docu-
illustrate the paper issue. Section 3 reviews the most relevant ments d 1 , d 3 , and d 4 . The problem comes from the structural
works, providing support to query multi-structured documents. heterogeneity due to the different structural position of the field
Section 4 describes in details our approach. Section 5 presents our “lanдuaдe, ” i.e. “lanдuaдe” in document d 1 , “details.lanдuaдe”
first experiments and the performances of our approach while in document d 2 , and “versions.lanдuaдe” in document d 4 . Hence,
changing the size and the number of schemas per collection. In we may include all these paths in the query using specific and
Section 6, we summarize our findings. often complex syntax.
Moreover, we can try to formulate two different queries. The
2 QUERYING DOCUMENT STORES WITH first one is formulated over schema s 1 of the document d 1 in
MULTIPLE SCHEMA ISSUES order to retrieve the list of titles and “lanдuaдe” for each film.
We use the following MongoDB query:
As discussed earlier, querying multi-structured data is a complex db.C.f ind({}, {“movie_title” : 1, “lanдuaдe” : 1})
task. The problem is which structure to use while formulating We than get the following result:
queries and how this choice affects the results. In the following,
we present a simple illustrating example. C1 = [
Let C = {d 1 , d 2 , d 3 , d 4 } be a set of four films as documented {“movie_title” : “Fast and f urious”, “lanдuaдe” : Enдlish},
in Figure 2. In this example we represent documents using JSON {“movie_title” : “Titanic”},
(JavaScript Object Notation). Most of the NoSQL systems support {“movie_title” : “Despicable Me 3”},
this notation of representing semi-structured data. A document {“movie_title” : “The Hobbit”}]
di is defined by a key-value pair (i, vi ) where i is the key and vi
is the value described with JSON. We formulate the second using the schema s 4 of document d 4 :
Let us consider that we want to retrieve information related
db.C.f ind({}, {”movie_title” : 1, ”versions.lanдuaдe” : 1})
to available languages for each presented movie. We formulate
a projection query with the fields “movie_title” and “lanдuaдe.” We get the following result:
using MongoDB syntax as follows:
C2 = [
db.C.f ind({}, {“movie_title” : 1, “lanдuaдe” : 1}). {“movie_title” : “Fast and f urious”},
{“movie_title” : “Titanic”},
In this query, the field “movie_title” does not cause any dif- {“movie_title” : “Despicable Me 3”},
ficulty since it is always at the same structural level in the four {“movie_title” : “The Hobbit”, “versions” :
documents.Therefore, the query engine is able to locate all in- [{“lanдuaдe” : “Enдlish”}, {“lanдuaдe” : “French”}]
formation related to the field “movie_title.” However, the field ]
“lanдuaдe” may cause some information loss since it is founded
When executing both queries, the query engine returns two The above-studied systems are designed to support queries
results. As expected, all possible information related to the field over semi-structured data with known schemas. To formulate
“movie_title” is returned for all documents as it is located on queries user needs to know the exact underlying data structures.
document’s root. For the first query, only first document matches Also, they neglect the fact that user may have limited knowledge
with the field containing “lanдuaдe” information. The second about the data structure and hence may be unable to formulate
query succeeded to retrieve “lanдuaдe” information only from correct queries over the heterogeneous dataset.
the fourth document. The challenge is how to formulate a single To overcome these limitations, recent works were conducted
query and retrieve all information related to the field “lanдuaдe” to enable schema-independent querying; the second category
without any redundancy. For instance, the same information underlined at the beginning of this section. Thus, the schema is
related to the field “movie_title” is obtained twice in the resulted not mandatory to be known in advance at loading time. We clas-
collections C 1 & C 2 . sify the studied works according to two approaches: (i) performs
To solve this we introduce a transparent way to build relevant physical integration by refactoring integrated data structures
schema-independent queries that bypass structural heterogene- into an unified structure; and (ii) adopts virtual integration by in-
ity in documents stores. A simple knowledge of the required troducing either a custom interface and/or a new query language
attributes allows users to retrieve adequate documents regardless [23].
the structural heterogeneity in the collection. This ease simpli- In the first direction, several works were designed to deal
fies the task for end-users and provides them an efficient way with semi-structured data. Those works share the idea of the
to retrieve information of interest. In case of there-above ex- schema-on-read. There is no need to define schemas before load-
ample, we enrich the original user query by adding all possi- ing data, they infer the implicit schema later from stored datasets
ble navigational paths to retrieve relevant documents. For in- on query time. They expose for the users a relational view over
stance, we formulate the query db.C.f ind({}, {“movie_title” : the data to help them to build SQL queries. Sinew[22], is able to
1, “lanдuaдe” : 1, “details.lanдuaдe” : 1, “versions.lanдuaдe”}) infer schemas from semi-structured data. It defines for the user
in MongoDB syntax. It can bypass the structural heterogeneity a logical view on the inferred schema, and it flattens data into
in the current state of the collection and it projects all desired columns to be stored into relational database system (RDBMS).
values. Drill[12] enables schema-independent querying via SQL over het-
erogeneous data without first defining a schema. It gives support
for nested data. Tenzing[18] infers a relational schema from the
3 STATE OF THE ART
underlying data but can only do so for flat structures that can be
The widespread use of semi-structured data gives increased inter- trivially mapped to a relational schema.
est to build solutions enabling queries over semi-structured data. The principle of the previous solutions suggests heavy physi-
We distinguish existing solutions systems on the basis of the cal refactorization that requires flattening the underlying data
proposed querying approach: 1) schema-dependent querying ap- structures into a relational format using complex encoding tech-
proach that requires knowledge of the schema in a similar way as niques. Hence, the refactorization requires additional resources
conventional relational database systems, 2) schema-independent such as the need for external relational database and extra efforts
querying approach that does not need any prior schema knowl- to learn the unified inferred relational schema. Besides, some
edge from the user and is able to extract the schemas at querying solutions do not support the flexible nature of semi-structured
time. data [18] for instance they cannot handle nested data. User deal-
The first category of systems is designed to enable queries ing with those systems has to learn new schemas every time the
based on reliable knowledge about the schema or the navigational workload changes, or new data comes because there is a need
paths for desired values when dealing with nested data. Such to re-generate the relational view and the stored columns after
systems offer complicated querying language such as regular every change.
expressions with XQuery or Xpath [17] when dealing with XML Virtual integration gets also attention from researchers [14, 23].
data. XQuery works with the structure to retrieve precisely the Works are inspired by the data lake approach [11] where data
desired results. However, if the user does not know the structure, is collected in their original format for later use. We consider
it is impossible to write the relevant query. Moreover, a single two major classes: i) schema-oriented queries; and ii) keyword
query is generally not able to retrieve data when several schemas querying.
are to be considered simultaneously. We can notice the same con- Works from the first class infer the schema from a collection
siderations with JSONiq [9], the extension of XQuery, designed of data and offer for the users the possibility to query the inferred
to deal with large-scale data such as JSON data. Other systems schema and to check whether a field or sub-schema exists or not
suggest JavaScript queries API, the case of MongoDB [5], to build to guide them while developing their applications. In [23] the
a query by specifying a document with properties expected to authors propose to summarize all the document schema under
match with the results. It offers a broad range of querying capabil- a skeleton to discover the existence of fields or sub-schemas
ities, in particular data processing pipelines. The API requires a inside the collection. In [14] the authors suggest extracting all
complex syntax and it is necessary that queries explicitly include the schema that are present in the collection to help final users to
all the various schema structures within documents to access be aware of the schemas and all fields in the integrated collection
data. Otherwise, the query engine returns only documents that of documents. These solutions are limited only to type and field
match the supplied criteria even if the fields with the desired identification and are not used to determine the different paths
information exist but under other paths than those existing in to access a field in the collection.
the query. Another kind of works is SQL++ [19] relies on the Keyword querying has been adopted in the context of XML
rich SQL querying interface. In this case, it is also mandatory to [10, 24]. The process of answering a keyword query on XML
express all exact navigational paths in order to obtain the desired data starts by identifying the existence of the keywords within
results. the documents (possibly through some free-text search). They

take as input the searched keywords and return a subset from X = (“year ” = 1997) ∨ (“details.year ” = 1997)
the document that matches with the query keywords. A score is
computed based on the structure of sub-documents, and accord- ∨ (“versions.year ” = 1997)
ing to this score, the respective XML documents containing all
the keywords are returned. We rewrite the initial query into σ(X ) (C). Two conditions have
Works in Keyword querying suggest doing a pairwise com- to be satisfied to select one document, (i) it does exist at least
parison or binary search to identify the possible positions for one navigational path from the sub-conditions of X inside the
queried keywords. This concept is not well tailored for a large document and (ii) the result of evaluating at least one of these
number of documents with complex structures (different nested sub-conditions is equal to true. Otherwise, X is equal to false and
elements, numerous attributes, etc.). the document is not returned in the result.
From state of the art, we build our approach in the idea of offer- The challenges are how to enable schema-independent query-
ing virtual integration to enable schema-independent querying ing in transparent ways and how to support future new structures
via the usage of keyword based on the attributes and to support without revising the application code.
native semi-structured features such as nested attributes and Figure 3 gives a high-level illustration of our query rewriting
support for heterogeneous collections of documents. engine called EasyQ. EasyQ is designed to be used early in data
Our work relates in some way to previous attempts with XML loading phase to materialize a dictionary that tracks the differ-
keyword querying[16]. The most important contribution of these ent navigational paths, for all attributes. EasyQ is also used at
earlier efforts is to prevent users from learning complex under- querying time to enrich the query Q of the user and to bypass
lying schemas as well as a complex query language to manage the structural heterogeneity. It takes as input the user query for-
paths. We adopt this idea that the user may not be aware of mulated over final fields or sub-paths, and the desired collection.
all existing schemas and cannot manage too complex queries The query rewriting engine produces one extended query Q ex t
in order to enable schema-independent querying based on only that will be executed by the underlying document store system.
knowledge about the field with the desired information. The main The result of this extended query is a collection containing rel-
difference between our works and the keywords querying is that evant information. An important result of such architecture is
we require from the user some simple details about the queried that the same user query, evaluated at different moment, will be
data. For instance, if we execute a keyword query “Enдlish”. It rewritten each time. So, if new documents with new structures
is possible to have as result “lanдuaдe” : “Enдlish” and also have been inserted in the collection (or existing documents are
“movie_title” : “Johnny Enдlish.” With our approach, we will updated), these new structures are automatically handled and
specify that we give interest to the field “lanдuaдe” = “Enдlish” results remain relevant with the query.
or to the field “movie_title” contains “Enдlish.” In the rest of this section, we describe the formal model of
multi-structured documents, dictionary, and the querying oper-
4 QUERYING HETEROGENEOUS ators across multi-structured documents. Finally, we formally
COLLECTION OF DOCUMENTS define how we rewrite the queries.
In our proposal, we want to enable queries over multi-structured
documents by automatically handling the underlying structural
4.1 Multi-structured data modeling
heterogeneity. Thus, our query rewriting engine will give trans- Definition 4.1 (Collection). A collection C is defined as a set of
parent support for the heterogeneity on both stored and future documents
new data. C = {d 1 , . . . , d |C | }

Each document di is considered as a (key-value) pair where
the value takes the form: vi = {ai,1 : vi,1 , . . . , ai,n : vi,n }

Definition 4.2 (Document). A document di , ∀i ∈ [1, c], is de-
fined as a (key,value) pair
di = (kdi , vdi )
• kdi is a key that identifies the document (by abusive nota-
tion we noted i the key kdi in section 1 and 2;
• vdi = {adi ,1 : vdi ,1 , . . . , adi ,n : vdi ,n } is the document
value. The document value vdi is defined as an object com-
Figure 3: An overview of EasyQ posed by a set of (adi , j , vdi , j ) pairs, where each adi , j , is
a string called attribute and each vdi , j , is the value that
To give an overview of our approach, let us consider the fol- can be atomic (numeric, string, boolean, null) or complex
lowing selection query (selection operation is defined later in (object, array). A value vdi , j is defined below.
this section):
An atomic value is defined as follows ∀j ∈ [1..n]:
σ(“year ”=1997) (C)
We refer to the collection presented in Figure 2 in which we • vdi , j = n if n ∈ N∗ , the set of numeric values (integer,
notice that it exists three distinct navigational paths leading float);
to the attribute “year , ” i.e. “details.year ” “versions.year ” and • vdi , j = “s” if “s” is a string formulated in U nicodeA∗ ;
“year .” For each document, at least one path can lead to the • vdi , j = b if b ∈ B, the set of boolean {true, f alse};
attribute “year .” It is possible to express the selection predicate • vdi , j = ⊥ is a null value;
in disjunctive form of navigational paths. A complex value is defined as follows ∀j ∈ [1..n]:
• vdi , j = {adi , j,1 : vdi , j,1 , . . . , adi , j,m : vdi , j,m } is an dictC = {
object value where vdi , j,k , ∀k ∈ [1..m] are values, and (movie_title, {movie_title}),
adi , j,k , ∀k ∈ [1..m] are Strings in A∗ called attributes. ( year, {year, details.year, versions.1.year, versions.2.year} ),
This is a recursive definition identical to document value; (lanдuaдe, {lanдuaдe, details.lanдuaдe,
• vdi , j = [vdi , j,l , . . . , vdi , j,l ] represents an array of values versions.1.lanдuaдe, versions.2.lanдuaдe}),
vdi , j,k , ∀k ∈ [1..l], l =∥ vdi , j ∥ ; (details, {details}),
(details.year , {details.year }),
In case of having document values vdi , j as object or array,
(details.lanдuaдe, {details.lanдuaдe}),
their inner values vdi , j,k can be complex values too allowing
(versions, {versions}),
to have different nesting levels. To cope with nested documents
(versions.1, {version.1}),
and navigate through schemas, we adopt the navigational path
(versions.1.year , {versions.1.year }),
notations [3, 15].
(versions.1.lanдuaдe, {versions.1.lanдuaдe}),
Definition 4.3 (Schema). The schema, called sdi , is inferred (versions.2, {versions.2}),
from the document value vdi = {adi ,1 : vdi ,1 , . . . , adi ,n : vdi ,n } (versions.2.year , {versions.2.year }),
is defined as (versions.2.lanдuaдe, {versions.2.lanдuaдe})
sdi = {p1 , . . . , pmi } }
where p j , ∀j ∈ [1..mi ], is a path of each attribute of vdi , or For example, the entry
navigational path for nested values such as vdi , j,k . For multiple (year , {year , details.year , versions.1.year ,
nesting levels, the navigational path is extracted recursively to versions.2.year }) gives all navigational paths leading to
find the path from the root to the final atomic value that can be the attribute “year ”.
found in the document hierarchy.
A schema svdi of value vdi from document di is formally 4.2 Querying multi-structured data
defined as: Querying multi-structured data is possible via a combination of
• if vdi , j is atomic, sdi = sdi ∪ {ai, j }; a set of unary operators. In this paper, we limit the querying
• if vdi , j is object, sdi = sdi ∪ {adi , j } ∪ {∪p ∈sdi , j adi , j .p} process to projection and selection operators expressed by native
where sdi , j is the schema of vdi , j ; MongoDB operators “f ind” and “aддreдate”.
∥vd , j ∥
• if vdi , j is an array, sdi = sdi ∪ {adi , j } ∪j=1 i 4.2.1 Minimal closed kernel of unary operators. We define a
minimal closed kernel of unary operators. We call Cin the queried
{ adi , j .k} ∪ {∪p ∈sdi , j,k adi , j .k.p} where sdi , j,k is the collection, and Cout the resulting collection.
schema of the k t h value from the array vdi , j ; Definition 4.6 (Projection). The project operator helps to reduce
initial schemas of documents from the collection to a finite sub-
Example. Let us consider the documents d 1 and d 2 of Figure 2
set of attributes as;
. The underlying schema for both documents is described as
follows: πA (Cin ) = Cout
svd1 = {“movie_title”, “year ”, “lanдuaдe”}
where A ⊆ Sin is a sub-set of attributes from SCin ( the schema
svd2 = {“movie_title”, “details”, “details.year ”, “details.lanдuaдe”}
of the input collection Cin
We notice that the attribute “details” from document d 2 is Definition 4.7 (Selection). The select operator runs to retrieve
a complex one in which are nested the attributes “year ” and only documents that match some predicates; we call
“lanдuaдe” which leads to have two different navigational paths
σp (Cin ) = Cout
“details.year ” and “details.lanдuaдe”.
where p refers to the predicate (or condition) for the selection
Definition 4.4 (Collection Schema). The schema SC is inferred
operator. A simple predicate is expressed by ak ωk vk where ak ⊆
from collection C is defined by
SCin is an attribute, ωk ∈ {= ; > ; < ; , ; ≥ ; ≤ } is a comparison
c
Ø operator, and vki is a value. It is possible to combine predicates by
SC = sv d i
these operator from Ω = { ∨, ∧, ¬} and this leads to a complex
i=1
predicate.
Definition 4.5 (Dictionary). The dictionary dictC of a collection
C is defined by We call Normp the normal conjunctive form of the predicates
p defined as follows:
∀pk ∈ SC , dictC = {(pk , △k )}
Û Ü
• pk ∈ SC is a path for an attribute which is present at least Normp = ai ,j ϖ i ,j vi, j
in one document of the collection; i j
• △k = {ppk, 1 , . . . , ppk,q } ⊆ SC is a set of navigational
We consider that all predicates in selection operators as in
paths leading to pk ;
normal conjunctive form.
For the rest of this paper, we will call equally any path pk as
Definition 4.8 (Query). A query Q can be formulated by com-
attribute. We will use dictionary paths and dictionary attributes
posing operators.
accordingly.
Example. The dictionary dictC constructed from the collec-
tion C is defined below, each dictionary entry pk refers to the set Q = q 1 ◦ · · · ◦ qr (C)
of all extracted navigational paths. where ∀i ∈ [1, r ] qi ∈ {π , σ }
Example. Let us consider the collection presented in Figure Example. Let us consider the query q 3 from the previous
2. example. First, the query rewriting engine starts by extending
the project operator. (line “projection” in Algorithm 1)
q 1 : σ(“l anдuaдe”=“Enдl ish”) (C) = [ π(“movie_t itl e”,“year ”) (C)
{“movie_title” : “Fast and f urious”, “year ” : 2017,
“lanдuaдe” : “Enдlish”} For each projected field, the process consults the dictionary
] and extracts all the possible navigational paths. The dictionary
q 2 : π(“movie_t itl e”,“year ”) (C) = [ entry, for the field movie_title, corresponds to:
{“movie_title” : “Fast and f urious”, year : 2017} (movie_title, {movie_title})
{“movie_title” : “Titanic”} So, Aex t = {movie_title}
{“movie_title” : “Despicable Me 3”, year : 2017} The dictionary entry, for the field year, corresponds to:
{“movie_title” : “The Hobbit”} (year, {year , details.year , versions.1.year,
] versions.2.year })
q 3 : π(“movie_t itl e”,“year ”) (σ(“l anдuaдe”=“Enдl ish”) (C)) = [ So, Aex t = {movie_title, year, details.year, versions.1.year ,
{“movie_title” : “Fast and f urious”, “year ” : 2017} versions.2.year }
] The projection query is then rewritten as:
Here, the query q 3 is constructed by combining select and
project operators. π(“movie_t itl e”, “year ”, “det ails .year ”, “ver sions .1.year ”,
“ver sions .2.year ”) (C)
4.2.2 Query extension for multi-structured data. In this sec-
tion, we introduce a new query extension algorithm that au- Next, the process continues with the selection query (line
tomatically enriches the user query. The native query engine "selection" in Algorithm 1)
of document-oriented stores such as MongoDB can efficiently
execute our rewritten queries. Then, it is possible to find out σ(“l anдuaдe”=”Enдlish”) (C)
all desired information regardless the structural heterogeneity The dictionary entry for the field “language" corresponds to:
inside the collection. (lanдuaдe, {lanдuaдe, details.lanдuaдe, versions.1.lanдuaдe,
versions.2.lanдuaдe})
Algorithm 1: Automatic extension for the initial user So, Pex t = {
query “lanдuaдe” = “Enдlish”
input: Q ∨ “details.lanдuaдe” = “Enдlish”
output: Q ex t ∨ “versions.1.lanдuaдe” = “Enдlish”
Q ex t ← id // identity ∨ “versions.2.lanдuaдe” = “Enдlish”}
foreach qi ∈ Q do
switch qi do The selection is then rewritten as:
case πAi : // projection
do σ(“l anдuaдe”=”Enдlish”∨“det ails .l anдuaдe”=“Enдlish”∨“ver sions .
Aex t ← ∀ak ∈Ai △k 1.l anдuaдe”=“Enдlish” ∨ “ver sions .2.l anдuaдe”=“Enдlish”) (C)
Ð
Q ex t ← Q ex t ◦ πAe x t
end Finally, the query rewriting engine generates the final query
case σp : // selection by combining the generated queries:
do π(“movie_t itl e”,“year ”,“det ails .year ”,“ver sions .1.year ”,“ver sions .2.year ”)
Pex t ← i j a k ∈△i, j , ak ϖ i, j vi, j
Ó Ô Ô
(σ(“l anдuaдe”=“Enдlish”∨“det ails .l anдuaдe”=“Enдlish” ∨“ver sions
Q ex t ← Q ex t ◦ σPe x t .1.l anдuaдe”=“Enдlish”∨“ver sions .2.l anдuaдe”=“Enдlish”) )(C))
=[
end
{“movie_title” : “Fast and f urious”, “year ” : 2017},
end
{“movie_title” : “Titanic”, ”details” : {“year ” : 2017}},
end {“movie_title” : “The Hobbit”, “versions” : [
{“year ” : 2017}]}
Our approach aims to enable transparent querying on a multi- ]
structured collection of documents via automatic query rewriting. The query rewriting process injects additional complexity to
This process employs the materialized dictionary to enrich the the original user’s queries.
original query by including the different navigational paths that
lead to desired attributes. The algorithm 1 describes the query 5 EXPERIMENTS
extension process as: In this section, we conduct a series of experiments to study the
• In case of projection operation, the algorithm extends the aforementioned points:
list of attributes Ai by uniting different navigational paths • Which are the effects on the execution time of the rewrit-
△k for each projected ak . ten queries while varying the size of the collection and is
• In case of the selection operation, the algorithm enriches this cost acceptable or not?
the predicate p, expressed in the normal conjunctive form, • Is the time to build the dictionary acceptable and what
with the set of extended dis-junctions built from the navi- about the size of the dictionary according to structural
gational paths △i, j for each attribute ai, j . variability?
Next, we explain the experimental protocol, then we study query returns the identical results from both heterogeneous or
the queries execution cost, and finally we evaluate the dictionary flat datasets. The same result implies: i) the same number of
generation time and its size. documents, and -0ii) the same values for their attributes (leaf
fields).
5.1 Experimental protocol
5.1.2 Queries. we choose to build a synthetic set of queries
We choose to run all the queries on synthetic datasets loaded
based on the different comparison operators supported by Mon-
into the document store MongoDB. In this section, we introduce
goDB. We employed the classical comparison operators, i.e {<
the details of the experimental setup, the process of generating
, >, ≤, ≥, =, ,} for numerical values as well as classical logical
the synthetic datasets and the evaluation queries set. Later on,
operators, i.e {and, or } between query predicates. Also, we em-
we present the results of executing the evaluation set in three
ployed a regular expression to deal with string values. We select
separate contexts. The goal is to compare the cost of executing
8 attributes of different types and under different levels inside
the rewritten queries; (i) the cost of executing the original queries
the documents in heterogeneous datasets. The Table 2 shows
on homogeneous documents, (ii) the execution time of several
that for each attribute its type and the selection operator that we
distinct queries that we build manually based on each schema.
used later while formulating the synthetic queries. In addition,
Then, we study the effects of the heterogeneity on the dictio-
we present for each attribute the number of possible paths as
nary in terms of size and construction time. Finally, we evaluate
found in the synthetic heterogeneous collection, the different
the scale of the heterogeneity and its impact on generating the
nesting levels and the selectivity of the predicate.
rewritten queries.
We conducted our experiments on MongoDB v3.4. We used
an I5 3.4GHZ machine coupled with 16GB of RAM with 4 TB of Predicate Attribute Type Operator Paths Depths selectivity
storage space that runs CentOS7. p1 DirectorName String Regex{^A} 8 {8,2,3,9,6,5,4,7} 0,06 %
p2 Gross Int > 100 k 7 {7,8,2,3,9,6,4} 66 %
5.1.1 Dataset. To study the structural heterogeneity, we gen- p3 Language String = "English" 7 {7,8,3,9,6,5,4} 0,018%
p4 Imdb_score Float <4,7 8 {8,7,2,3,4,5,6,9} 29 %
erate a custom synthetic datasets. First, we collected a JSON p5 Duration Int ≤ 200 7 {7,8,2,3,6,5,4} 77%
collection of documents from imdb1 that describe movies. The p6 Country String , Null 6 {7,2,3,9,5,4} 100 %
p7 year Int < 1950 7 {7,8,2,3,6,5,4} 23 %
original dataset has only flat documents with 28 attributes in p8 FB_likes Int ≥ 500 7 {6,2,3,8,5,4,3} 83 %
each document. Then, we reuse this flat collection to produce Table 2: Query predicates
documents with structural heterogeneity. For each generated
dataset, we can define several parameters such as the number
of schemas to produce in the collection, the percentage of the
presence of every generated schema. For each schema, we can We formulate the following 6 queries:
adjust the number of grouping objects. We mean by grouping ob-
• Q1 : p1 ∧ p2
ject, a compound field in which we nest a subset of the document
• Q2 : p1 ∨ p2
attributes. In other words, we cannot find the same grouping
objects inside two structures. To make sure about the heterogene- With the queries Q1&Q2 the rewritten queries contain 15
ity within documents, the grouping objects are unique in every predicates unlike the original queries that contains 2 predicates.
schema. Only the original fields from the flat dataset are common 15 predicates are due to the 8 existing paths for Director N ame in
to all documents. The values of those fields are randomly chosen p1 plus 7 paths for Gross in p2 that are included in a disjunctive
from the original film collection. To add more complexity, we can form as described in rewriting algorithm.
set the nesting level used for each structure. For the rest of the • Q3 : p1 ∧ p2 ∧ p5 ∧ p7
experiments, we built our dataset based on the characteristics • Q4 : p1 ∨ p2 ∨ p5 ∨ p7
that we describe in the Table 1. We generate collections of 10, 25,
The rewritten versions of Q3 & Q4 contain 29 predicates unlike
50 and 100 GB of data.
the original queries that contain 4 predicates.
Setting Value • Q5 : p1 ∧ p2 ∧ p5 ∧ p7 ∧ p6 ∧ p3 ∧ p4 ∧ p8
# of schema 10 • Q6 : p1 ∨ p2 ∨ p5 ∨ p7 ∨ p6 ∨ p3 ∨ p4 ∨ p8
# of grouping objects per schema {5,6,1,3,4,2,7,2,1,3} Finally, the rewritten versions of the queries Q5&Q6 contain 57
Nesting levels per schema {4,2,6,1,5,7,2,8,3,4} predicates unlike the original queries that contain 8 predicates.
Percentage of schema presence 10% The Table 3 presents for each dataset: i) the number of doc-
# of attributes per schema Random uments inside the collection, ii) the number of expected results
# of attributes per grouping objects Random regarding each executed query.
Table 1: Settings of the generated dataset
Collection # of
Q1 Q2 Q3 Q4 Q5 Q6
size in GB documents
We generate a flat collection with same leaf attributes and their 10 GB 12 M 520 K 8,4 M 87,2 K 11,8 M 340 12 M
25 GB 30 M 1,3 M 21 M 218 K 29,5 M 850 30 M
corresponding values as found in the heterogeneous datasets. 50 GB 60 M 2.6 M 42 M 436 K 59 M 1,7 K 60 M
This new collection helps us to have a proper environment and 100 GB 120 M 5,2 M 84 M 872 K 118 M 3,4 K 120 M
to compare; the execution time of the rewritten query on the Table 3: Number of extracted documents per query
heterogeneous datasets, versus the execution time of the original
query on homogeneous datasets. Therefore, we ensure that every
1 imdb.com
Figure 4: Query rewriting evaluations

Collection Query Usage requires more CPU loads and more intensive disk I/O operations.
Homogeneous QBase
Baseline
We move now to study the efficiency of the rewritten query when
Flat documents user query compared to the baseline query Q Base . We can notice that the
Heterogeneous QRewritten Our solution
overhead of our solution is up to two times (e.g., disjunctive form)
documents QAccumulated
the sum of when compared to the native execution of the baseline query on
"By hand" query context the homogeneous dataset. Moreover, we score an overall over-
the ten separated
Used as a maximum cost line
queries each one based head that does not exceed 1,5 times in all six queries. We believe
on a specific schema. that this overhead is acceptable since we bypass the needed costs
Table 4: Evaluations context for refactoring the underlying data structures. Unlike the base-
line, our synthetic dataset contains different grouping objects
with varying nesting levels. Then, the rewritten query include
several navigational paths which will be processed by the na-
5.2 Queries evaluation tive query engine of MongoDB to find matches in each visited
5.2.1 Queries execution time. We define three contexts on document among the collection.
which we run the above-defined queries. For each context, we 5.2.2 Dictionary and query rewriting engine at the scale. With
measure the average of execution time after executing each query this series of experiments, we try to push the dictionary and the
at least five times. The order of query execution is set to be query rewriting engine to their limits. To this end, we generated
random. a heterogeneous synthetic collection of 1 GB. We use the primary
In the following, we present the details of the three evaluations 28 attributes from the IMDB flat films collection. The custom
contexts: collections are generated in a way that each schema inside a
• We call “Q Base ” the query that refers to the initial user document is composed of two grouping objects with no further
query (one of the above-defined queries), and it is executed nesting levels. We generated collection having 10, 100, 1k, 3k and
over the homogeneous versions of the datasets. The pur- 5k schemas. For this experiment, we keep on the use of the query
pose of this first context is to study the native behavior of Q 6 introduced earlier in this section.
the document store. We use this first context as a baseline
for our experimentation.
# of schemas rewriting time Dictionary size
• The “Q Rewr it t en ” refers to the query “Q Base ” rewritten
by our approach. It is executed over the heterogeneous 10 0.0005s 40 KB
versions of the datasets. 100 0.0025s 74 KB
• The “Q Accumul at ed ” refers to the set of equivalent queries 1K 0.139s 2 MB
formulated on each possible schema from the collection. 3K 0.6s 7.2 MB
In our case, it is mad of 10 separated queries since we 5K 1.52s 12 MB
are dealing with collections having ten schemas. These Table 5: Scale effects on query rewriting and dictionary
queries are build "by hand" as should have done any user size
without any assisting tool. We do not consider the time
necessary to merge the results of each query as the goal
is to compare the time to find the set of result documents. We present the time needed to build the rewritten query in the
“Q Accumul at ed ” is obviously executed over the heteroge- Table 5. It is notable that the time to build the rewritten query is
neous versions of the datasets. very low, less than two seconds. Also, it is possible to construct
Table 4 synthesizes our execution contexts. a dictionary over a heterogeneous collection of documents, here
As shown in Figure 4, we can notice that our rewritten query, our dictionary can support up to 5 k of distinct schemas. The
Q Rewr it t en , outperforms the accumulated one, Q Accumul at ed . resulting size of the materialized dictionary is very encouraging
The difference between the two execution scenarios come from since it does not require significant storage space. Furthermore,
the capabilities of our rewritten query to include automatically we also believe that the time spent to build the rewritten query
all navigational paths extracted from the collection. Hence, the is really interesting and represent another advantage of our solu-
query is executed only once when the accumulated query may tion. In this series of experiments, on each time we try to find
require several passes through the stored collection. This solution distinct navigational paths for eight predicates. Each rewritten
query is composed by numerous disjunctive forms for each pred- 6 CONCLUSION
icate. We notice 80 disjunctive forms while dealing with dataset NoSQL databases are often called schemaless because they may
having 10 schemas, 800 with 100 schemas, 8 k with 1 k schemas, contain variable schemas among stored data. Nowadays, this
24 k with 3 k schemas and 40 k with 5k schemas. We believe variability is becoming a common standard in many applications
that dictionary and the query rewriting engine scale well while as for example web applications, social media applications or
dealing with heterogeneous collection of documents having an internet of things world. Nevertheless, the existence of such
important number of schemas. structural heterogeneity makes it very hard for users to formulate
queries to find out relevant and coherent results.
5.3 Dictionary construction time In this paper, to deal with structural heterogeneity, we suggest
In this part, we give the interest to the study of the dictionary a novel approach for querying heterogeneous documents describ-
constructions process. EasyQ offers the possibility to build the ing a given entity over NoSQL document stores. The developed
dictionary over existing dataset or during data loading phase. tool is called EasyQ. Our objective is to allow users to perform
The dictionary contains the latest version of the data once all their queries using a minimal knowledge about data schemas.
document are inserted. So, the query rewriting engine enrich the EasyQ is based on two pillars. The first one is a dictionary that
queries based on the new dictionary, otherwise if the process of contains all possible paths for any existing field. The second one
data loading is in progress, it may do not take into account the is a rewriting module that modifies the user query to match all
recent changes. In the following, we study both configurations. field paths existing in the dictionary. Our approach is a syntac-
First, we start by the evaluation of the time required to build the tic manipulation of queries. So, it is grounded on an important
dictionary among pre-loaded five collections of 100 GB having 2, assumption: the collection describes homogeneous entities, i.e.,
4, 6, 8 and 10 schemas respectively. a field has the same meaning in all document schemas. In case
We notice from the results in the Table 6 that the time elapsed of ambiguity, the user should specify some sub-path in order to
to build the dictionary increases when we start to deal with col- overcome the ambiguity. If this assumption is not guaranteed,
lections having more heterogeneity. In case of the collection with users may face with irrelevant or incoherent results. Nevertheless
10 structures, the time does not exceed 40% when we compare this assumption may be acceptable in many applications, such as
it to a collection with 2 structures. We can again notice in the legacy collections, web applications or internet of things data.
Table 6 the negligible size of the generated dictionaries when In our first experiments, the evaluation consists in comparing
compared to the 100 GB of the collection. the execution time cost of basic MongoDB queries and rewritten
queries proposed by our approach. We conduct a set of tests by
# of schema 2 4 6 8 10 changing two primary parameters, the size of the dataset and
Required time (minutes) 96 108 127 143 156 the structural heterogeneity inside a collection (number of differ-
Size of the resulting
ent schemas). Results show that the cost of executing rewritten
dictionary (KB) 4,154 9,458 13,587 17,478 22,997
queries proposed in this paper is higher when compared to the
Table 6: Time to build the dictionary of pre-loaded data execution of basic user queries, but always less than twice. The
overhead added to the performance of our query is due to the
combination of multiple access paths to a queried field. Never-
Afterwards, we give the interest to evaluate the overhead theless, this time overhead is neglectful when compared to the
that causes the generation of the dictionary at loading time. We execution of separated “by hand” queries for each schema while
generate five collections of 1GB having the same structures from heterogeneity issues are automatically managed.
the last experiment (2, 4, 6, 8 and 10 schemas respectively). We These first results are very encouraging to continue this re-
present two measurements in Table 7. First, we measure the time search way and need to be strengthened. Short term perspectives
to simply load each collection (without the dictionary building). are to continue evaluations and to identify the limitations re-
Second, we measure the overall time to build the dictionary while garding the number of paths and fields in the same query and
loading the collection. regarding time cost. More experiments still to be performed on
larger datasets and real case datasets. Another perspective is to
#of schemas Load (s) Load and dict. (s) Overhead enhance the current queries possibilities to introduce all existing
2 201s 269s 33% classical operators of query languages (contains, etc.). It is also
4 205s 277s 35% necessary to deal with other querying operators, particularly the
6 207s 285s 37% aggregation operator.
8 208s 300s 44% The first long-term perspective consists in studying the real-
10 210s 309s 47% time building of the dictionary when integrating data in order to
Table 7: Study of the overhead added during load time take into account all possible queries: insert but also delete and
update. It’s likely the current simple structure of the dictionary
will be transformed in depth to support more complex updates
In this experiments, we find that the overhead measure does such as update and delete operations. The second long-term
not exceed 0.5 the time required to only load data. The evolution perspective consists in managing multi-store databases. The goal
of the time while adding more heterogeneity is linear and not would be to extend the proposed approach to query data stored
exponential which is encouraging. Many factors may affects the in different types of databases in a way independent from the
query construction phase. The number of attributes, the nesting various data schemas and stores. The final goal would be: how to
levels may increase or decrease the overhead. The advantage our query “transparently” any data store meanwhile being unaware
solutions is once the data is loaded and the dictionary is built of schemas or real fields names?
or updated, the rewritten query takes automatically all changes
into account.
REFERENCES
[1] J Chris Anderson, Jan Lehnardt, and Noah Slater. 2010. CouchDB: The Definitive
Guide: Time to Relax. " O’Reilly Media, Inc.".
[2] Mohamed-Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio
Ghelli, and Carlo Sartiani. 2017. Schema inference for massive json datasets.
In Extending Database Technology (EDBT).
[3] Pierre Bourhis, Juan L Reutter, Fernando Suárez, and Domagoj Vrgoč. 2017.
JSON: data model, query languages and schema specification. In Proceedings
of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database
Systems. ACM, 123–135.
[4] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan
Tournier. 2015. Implementation of multidimensional databases in column-
oriented NoSQL systems. In East European Conference on Advances in Databases
and Information Systems. Springer, 79–91.
[5] Kristina Chodorow and Michael Dirolf. 2010. MongoDB: The Definitive Guide
O’Reilly Media. (2010).
[6] Mohamed Lamine Chouder, Stefano Rizzi, and Rachid Chalal. 2017. Enabling
Self-Service BI on Document Stores.. In EDBT/ICDT Workshops.
[7] Alejandro Corbellini, Cristian Mateos, Alejandro Zunino, Daniela Godoy, and
Silvia Schiaffino. 2017. Persisting big-data: The NoSQL landscape. Information
Systems 63 (2017), 1–23.
[8] Avrilia Floratou, Nikhil Teletia, David J DeWitt, Jignesh M Patel, and Donghui
Zhang. 2012. Can the elephants handle the nosql onslaught? Proceedings of
the VLDB Endowment 5, 12 (2012), 1712–1723.
[9] Daniela Florescu and Ghislain Fourny. 2013. JSONiq: The history of a query
language. IEEE internet computing 17, 5 (2013), 86–90.
[10] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. 2003.
XRANK: Ranked keyword search over XML documents. In Proceedings of the
2003 ACM SIGMOD international conference on Management of data. ACM,
16–27.
[11] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelli-
gent data lake system. In Proceedings of the 2016 International Conference on
Management of Data. ACM, 2097–2100.
[12] Michael Hausenblas and Jacques Nadeau. 2013. Apache drill: interactive
ad-hoc analysis at scale. Big Data 1, 2 (2013), 100–104.
[13] Robin Hecht and Stefan Jablonski. 2011. NoSQL evaluation: A use case oriented
survey. In Cloud and Service Computing (CSC), 2011 International Conference
on. IEEE, 336–341.
[14] Victor Herrero, Alberto Abelló, and Oscar Romero. 2016. NOSQL design for
analytical workloads: variability matters. In Conceptual Modeling: 35th Inter-
national Conference, ER 2016, Gifu, Japan, November 14-17, 2016, Proceedings
35. Springer, 50–64.
[15] Jan Hidders, Jan Paredaens, and Jan Van den Bussche. 2017. J-Logic: Logical
Foundations for JSON Querying. In Proceedings of the 36th ACM SIGMOD-
SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 137–149.
[16] Prashant R Lambole and Prashant N Chatur. 2017. A review on XML keyword
query processing. In Innovative Mechanisms for Industry Applications (ICIMIA),
2017 International Conference on. IEEE, 238–241.
[17] Yunyao Li, Cong Yu, and HV Jagadish. 2004. Schema-free xquery. In Proceed-
ings of the Thirtieth international conference on Very large data bases-Volume
30. VLDB Endowment, 72–83.
[18] Liang Lin, Vera Lychagina, Weiran Liu, Younghee Kwon, Sagar Mittal, and
Michael Wong. 2011. Tenzing a sql implementation on the mapreduce frame-
work. (2011).
[19] Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The
SQL++ query language: Configurable, unifying and semi-structured. arXiv
preprint arXiv:1405.3631 (2014).
[20] Pavel Shvaiko and Jérôme Euzenat. 2005. A survey of schema-based matching
approaches. Journal on data semantics IV (2005), 146–171.
[21] M. Stonebraker. 2012. New opportunities for New SQL. Commun. ACM 5, 11
(2012), 10–11.
[22] Daniel Tahara, Thaddeus Diamond, and Daniel J Abadi. 2014. Sinew: a SQL
system for multi-structured data. In Proceedings of the 2014 ACM SIGMOD
international conference on Management of data. ACM, 815–826.
[23] Lanjun Wang, Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassanzadeh, Jia Zou,
and Chen Wangz. 2015. Schema management for document stores. Proceedings
of the VLDB Endowment 8, 9 (2015), 922–933.
[24] Rui Zhou, Chengfei Liu, and Jianxin Li. 2010. Fast ELCA computation for key-
word queries on XML data. In Proceedings of the 13th International Conference
on Extending Database Technology. ACM, 549–560.