<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards schema-independent querying on document data stores</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hamdi Ben Hamadou</string-name>
          <email>hamdi.ben-hamadou@irit.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faiza Ghozzi</string-name>
          <email>faiza.ghozzi@isims.usf.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>André Péninou</string-name>
          <email>andre.peninou@irit.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Teste</string-name>
          <email>olivier.teste@irit.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université de Sfax</institution>
          ,
          <addr-line>ISIMS, MIRACL, Sfax</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>UT2J, IRIT</addr-line>
          ,
          <institution>(CNRS/UMR 5505)</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>UT2J, IRIT</addr-line>
          ,
          <institution>(CNRS/UMR 5505)</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Université de Toulouse</institution>
          ,
          <addr-line>UT3, IRIT</addr-line>
          ,
          <institution>(CNRS/UMR 5505)</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Document is a pervasive semi-structured data model in today's Web and the Internet of Things (IoT) applications where the data structure is rapidly evolving over time. NoSQL documentoriented databases are well-tailored to eficiently load and manage massive collections of heterogeneous documents without any prior structural validations. However, this flexibility becomes a serious challenge while querying a heterogeneous collection of documents. Hence, it is mandatory for users to reformulate original query or to formulate new ones when more structures arrive in the collection. In this paper, we propose a novel approach to build schema-independent queries designed for querying multistructured documents. We introduce a query enrichment mechanism that consults a pre-materialized dictionary defining all possible underlying document structures. We automate the process of query enrichment via an algorithm that rewrites select and project operators to support multi-structured documents. To study the performances of our proposed approach we conduct experiments on synthetic dataset. First results are promising when compared to the normal execution of queries on homogeneous dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The popularity of NoSQL systems is growing in the database
community thanks to their ability to store and query
schemafree data in flexible and eficient ways [
        <xref ref-type="bibr" rid="ref21 ref8">8, 21</xref>
        ]. The document data
model is pervasive in the most Web and the Internet of Things
(IoT) applications [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and several database systems support this
data model in an eficient way [
        <xref ref-type="bibr" rid="ref1 ref4 ref5">1, 4, 5</xref>
        ]. Furthermore, in such
applications, the structures of documents representing same entity
are subject to structural changes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. An application may face the
problem of dealing with multi-structured data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To formulate
relevant queries, there is a need to have a precise knowledge of
data structures because document stores do not provide native
support for querying multi-structured data. Thus, it is
mandatory to manually include all possible navigational paths for the
attributes of interest to formulate relevant query. The structural
changes require users to reformulate original query which is a
time-consuming and prone to error task. The challenge addressed
in this paper is how to support querying upon future structural
heterogeneity without afecting the application code.
      </p>
      <p>
        In the context of document-oriented databases and due to the
lfexible nature of documents, it is possible to create a collection
of documents describing a single entity with multiple structures.
This characteristic points to several kinds of heterogeneity [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
The structural heterogeneity refers to diverse representation of
documents, (e.g.: nested or flat structures, nesting levels, etc.)
as shown in Figure 2. The syntactic heterogeneity is a result
of diferences in representation of data, (e.g. “movie_title” or
“movieT itle”). Moreover, the semantic heterogeneity is presented
when the same fields may rely on distinct concepts in separate
documents. The aim of this paper is to focus on structural
heterogeneity.
      </p>
      <p>
        The problem of schema-independent querying is a hot topic
in the study of document-oriented databases for both industry
and academia [
        <xref ref-type="bibr" rid="ref22 ref23 ref6">6, 22, 23</xref>
        ]. Previous work from the literature
resolved this issue with the following two approaches: (i) performs
physical integration of data by mapping integrated document
structures into a unified structure [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and (ii) performs a virtual
integration by introducing a custom interface that proposes new
virtual schema to be learned by the users while querying
heterogeneous data [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The first approach modifies the underlying
data structure, which is not possible while supporting legacy
applications designed to run over original data structures. Moreover,
this approach implies to define the mapping for any original data
structure. The second one requires more eforts from the user to
learn new global structures. This approach is a time-consuming
task and possibly prone to be error when there is a need to query
new documents with new structures since all queries are subject
to revisions.
      </p>
      <p>In this paper, we propose a novel approach to build
schemaindependent queries designed for querying multi-structured
documents. We propose a virtual integration that runs in a
transparent way, hides the complexity to build expected queries, and
supports structural heterogeneity evolution. Always, we rewrite
the queries during the execution time to guarantee the usage of
the latest structures of documents as defined in the dictionary.</p>
      <p>The problem of structural heterogeneity refers to the
possibility to find diferent navigational paths that lead to the same
attribute. The attributes are not located at the same position
inside documents, and having a limited knowledge of
navigational paths is insuficient to retrieve the required information.
In Figure 1, the attribute “country” in documents describing films
may not be relevant to diferentiate between “actor .country” or
“director .country.” Some sub-paths may help to resolve the
ambiguity such as “actors .country” and “director .country” anywhere
in the document. Therefore, some sub-paths may be used rather
than attributes names. In all cases, we hypothesise that there
exist some navigational paths to diferentiate the diferent entities
contained in the document.
{
}
"movie_title":"Fast and Furious",
"country": "USA",
"actors": [ { "name": "Vin Diesel",
"country": "USA"
}, ... ],
"director" : { "name": "F. Gray Gray",
"country": "USA"
}</p>
      <p>We introduce the EasyQ, that stands for “Easy Query, ” as
a tool to validate our approach. We give particular interest to
MongoDB for the implementation and evaluation. The primary
contribution of our work is to reformulate users original queries
formulated based on simple knowledge of the descriptive field
or sub-paths that contains the desired information. The users’
queries are formulated based on a schema-independent fashion,
i.e. users can formulate an initial query based on a subset of
possible schemas without carrying about all available structures.
Query rewriting engine is responsible to transparently
reformulate the initial query to match with all existing schemas returning
a relevant result. To deal with document schema heterogeneity,
we define a dictionary that contains all possible paths for all
existing fields. The query rewriting engine enriches the user query
with all possible paths found in the dictionary for each field used
in the user query. In this paper, we use interchangeably the terms
ifeld, descriptive field and attribute.</p>
      <p>The rest of our paper is structured as follows: In section 2, we
illustrate the paper issue. Section 3 reviews the most relevant
works, providing support to query multi-structured documents.
Section 4 describes in details our approach. Section 5 presents our
ifrst experiments and the performances of our approach while
changing the size and the number of schemas per collection. In
Section 6, we summarize our findings.
2</p>
      <p>QUERYING DOCUMENT STORES WITH
MULTIPLE SCHEMA ISSUES
As discussed earlier, querying multi-structured data is a complex
task. The problem is which structure to use while formulating
queries and how this choice afects the results. In the following,
we present a simple illustrating example.</p>
      <p>Let C = {d1, d2, d3, d4} be a set of four films as documented
in Figure 2. In this example we represent documents using JSON
(JavaScript Object Notation). Most of the NoSQL systems support
this notation of representing semi-structured data. A document
di is defined by a key-value pair (i, vi ) where i is the key and vi
is the value described with JSON.</p>
      <p>Let us consider that we want to retrieve information related
to available languages for each presented movie. We formulate
a projection query with the fields “movie_title” and “lanдuaдe.”
using MongoDB syntax as follows:
db.C. f ind({}, {“movie_title” : 1, “lanдuaдe” : 1}).</p>
      <p>In this query, the field “movie_title” does not cause any
dififculty since it is always at the same structural level in the four
documents.Therefore, the query engine is able to locate all
information related to the field “movie_title.” However, the field
“lanдuaдe” may cause some information loss since it is founded
at several positions across documents. Thus, assuming that we
formulate a query with limited knowledge of the structure s2
from d2, we build a query with the fields “movie_title” and
“details.lanдuaдe.”
db.C. f ind({}, {“movie_title” : 1, “details.lanдuaдe” : 1})
Executing such query in MongoDB leads to an incomplete
result since “details.lanдuaдe” field is not available in
documents d1, d3, and d4. The problem comes from the structural
heterogeneity due to the diferent structural position of the field
“lanдuaдe, ” i.e. “lanдuaдe” in document d1, “details.lanдuaдe”
in document d2, and “versions.lanдuaдe” in document d4. Hence,
we may include all these paths in the query using specific and
often complex syntax.</p>
      <p>Moreover, we can try to formulate two diferent queries. The
ifrst one is formulated over schema s1 of the document d1 in
order to retrieve the list of titles and “lanдuaдe” for each film.
We use the following MongoDB query:
db.C. f ind({}, {“movie_title” : 1, “lanдuaдe” : 1})
We than get the following result:
C1 = [
{“movie_title” : “Fast and f urious”, “lanдuaдe” : Enдlish},
{“movie_title” : “T itanic”},
{“movie_title” : “Despicable Me 3”},
{“movie_title” : “T he Hobbit ”}]
We formulate the second using the schema s4 of document d4:
db.C. f ind({}, {”movie_title” : 1, ”versions.lanдuaдe” : 1})</p>
      <sec id="sec-1-1">
        <title>We get the following result:</title>
        <p>C2 = [
{“movie_title” : “Fast and f urious”},
{“movie_title” : “T itanic”},
{“movie_title” : “Despicable Me 3”},
{“movie_title” : “T he Hobbit ”, “versions” :
[{“lanдuaдe” : “Enдlish”}, {“lanдuaдe” : “French”}]
When executing both queries, the query engine returns two
results. As expected, all possible information related to the field
“movie_title” is returned for all documents as it is located on
document’s root. For the first query, only first document matches
with the field containing “lanдuaдe” information. The second
query succeeded to retrieve “lanдuaдe” information only from
the fourth document. The challenge is how to formulate a single
query and retrieve all information related to the field “lanдuaдe”
without any redundancy. For instance, the same information
related to the field “movie_title” is obtained twice in the resulted
collections C1 &amp; C2.</p>
        <p>To solve this we introduce a transparent way to build relevant
schema-independent queries that bypass structural
heterogeneity in documents stores. A simple knowledge of the required
attributes allows users to retrieve adequate documents regardless
the structural heterogeneity in the collection. This ease
simpliifes the task for end-users and provides them an eficient way
to retrieve information of interest. In case of there-above
example, we enrich the original user query by adding all
possible navigational paths to retrieve relevant documents. For
instance, we formulate the query db.C . f ind({}, {“movie_title” :
1, “lanдuaдe” : 1, “details .lanдuaдe” : 1, “versions .lanдuaдe”})
in MongoDB syntax. It can bypass the structural heterogeneity
in the current state of the collection and it projects all desired
values.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>STATE OF THE ART</title>
      <p>The widespread use of semi-structured data gives increased
interest to build solutions enabling queries over semi-structured data.
We distinguish existing solutions systems on the basis of the
proposed querying approach: 1) schema-dependent querying
approach that requires knowledge of the schema in a similar way as
conventional relational database systems, 2) schema-independent
querying approach that does not need any prior schema
knowledge from the user and is able to extract the schemas at querying
time.</p>
      <p>
        The first category of systems is designed to enable queries
based on reliable knowledge about the schema or the navigational
paths for desired values when dealing with nested data. Such
systems ofer complicated querying language such as regular
expressions with XQuery or Xpath [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] when dealing with XML
data. XQuery works with the structure to retrieve precisely the
desired results. However, if the user does not know the structure,
it is impossible to write the relevant query. Moreover, a single
query is generally not able to retrieve data when several schemas
are to be considered simultaneously. We can notice the same
considerations with JSONiq [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the extension of XQuery, designed
to deal with large-scale data such as JSON data. Other systems
suggest JavaScript queries API, the case of MongoDB [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], to build
a query by specifying a document with properties expected to
match with the results. It ofers a broad range of querying
capabilities, in particular data processing pipelines. The API requires a
complex syntax and it is necessary that queries explicitly include
all the various schema structures within documents to access
data. Otherwise, the query engine returns only documents that
match the supplied criteria even if the fields with the desired
information exist but under other paths than those existing in
the query. Another kind of works is SQL++ [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] relies on the
rich SQL querying interface. In this case, it is also mandatory to
express all exact navigational paths in order to obtain the desired
results.
      </p>
      <p>The above-studied systems are designed to support queries
over semi-structured data with known schemas. To formulate
queries user needs to know the exact underlying data structures.
Also, they neglect the fact that user may have limited knowledge
about the data structure and hence may be unable to formulate
correct queries over the heterogeneous dataset.</p>
      <p>
        To overcome these limitations, recent works were conducted
to enable schema-independent querying; the second category
underlined at the beginning of this section. Thus, the schema is
not mandatory to be known in advance at loading time. We
classify the studied works according to two approaches: (i) performs
physical integration by refactoring integrated data structures
into an unified structure; and (ii) adopts virtual integration by
introducing either a custom interface and/or a new query language
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        In the first direction, several works were designed to deal
with semi-structured data. Those works share the idea of the
schema-on-read. There is no need to define schemas before
loading data, they infer the implicit schema later from stored datasets
on query time. They expose for the users a relational view over
the data to help them to build SQL queries. Sinew[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], is able to
infer schemas from semi-structured data. It defines for the user
a logical view on the inferred schema, and it flattens data into
columns to be stored into relational database system (RDBMS).
Drill[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] enables schema-independent querying via SQL over
heterogeneous data without first defining a schema. It gives support
for nested data. Tenzing[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] infers a relational schema from the
underlying data but can only do so for flat structures that can be
trivially mapped to a relational schema.
      </p>
      <p>
        The principle of the previous solutions suggests heavy
physical refactorization that requires flattening the underlying data
structures into a relational format using complex encoding
techniques. Hence, the refactorization requires additional resources
such as the need for external relational database and extra eforts
to learn the unified inferred relational schema. Besides, some
solutions do not support the flexible nature of semi-structured
data [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] for instance they cannot handle nested data. User
dealing with those systems has to learn new schemas every time the
workload changes, or new data comes because there is a need
to re-generate the relational view and the stored columns after
every change.
      </p>
      <p>
        Virtual integration gets also attention from researchers [
        <xref ref-type="bibr" rid="ref14 ref23">14, 23</xref>
        ].
Works are inspired by the data lake approach [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] where data
is collected in their original format for later use. We consider
two major classes: i) schema-oriented queries; and ii) keyword
querying.
      </p>
      <p>
        Works from the first class infer the schema from a collection
of data and ofer for the users the possibility to query the inferred
schema and to check whether a field or sub-schema exists or not
to guide them while developing their applications. In [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] the
authors propose to summarize all the document schema under
a skeleton to discover the existence of fields or sub-schemas
inside the collection. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] the authors suggest extracting all
the schema that are present in the collection to help final users to
be aware of the schemas and all fields in the integrated collection
of documents. These solutions are limited only to type and field
identification and are not used to determine the diferent paths
to access a field in the collection.
      </p>
      <p>
        Keyword querying has been adopted in the context of XML
[
        <xref ref-type="bibr" rid="ref10 ref24">10, 24</xref>
        ]. The process of answering a keyword query on XML
data starts by identifying the existence of the keywords within
the documents (possibly through some free-text search). They
take as input the searched keywords and return a subset from
the document that matches with the query keywords. A score is
computed based on the structure of sub-documents, and
according to this score, the respective XML documents containing all
the keywords are returned.
      </p>
      <p>Works in Keyword querying suggest doing a pairwise
comparison or binary search to identify the possible positions for
queried keywords. This concept is not well tailored for a large
number of documents with complex structures (diferent nested
elements, numerous attributes, etc.).</p>
      <p>From state of the art, we build our approach in the idea of
ofering virtual integration to enable schema-independent querying
via the usage of keyword based on the attributes and to support
native semi-structured features such as nested attributes and
support for heterogeneous collections of documents.</p>
      <p>
        Our work relates in some way to previous attempts with XML
keyword querying[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The most important contribution of these
earlier eforts is to prevent users from learning complex
underlying schemas as well as a complex query language to manage
paths. We adopt this idea that the user may not be aware of
all existing schemas and cannot manage too complex queries
in order to enable schema-independent querying based on only
knowledge about the field with the desired information. The main
diference between our works and the keywords querying is that
we require from the user some simple details about the queried
data. For instance, if we execute a keyword query “Enдlish”. It
is possible to have as result “lanдuaдe” : “Enдlish” and also
“movie_title” : “Johnny Enдlish.” With our approach, we will
specify that we give interest to the field “lanдuaдe” = “Enдlish”
or to the field “movie_title” contains “Enдlish.”
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>QUERYING HETEROGENEOUS COLLECTION OF DOCUMENTS</title>
      <p>In our proposal, we want to enable queries over multi-structured
documents by automatically handling the underlying structural
heterogeneity. Thus, our query rewriting engine will give
transparent support for the heterogeneity on both stored and future
new data.</p>
      <p>To give an overview of our approach, let us consider the
following selection query (selection operation is defined later in
this section):</p>
      <p>σ(“year ”=1997)(C)
We refer to the collection presented in Figure 2 in which we
notice that it exists three distinct navigational paths leading
to the attribute “year , ” i.e. “details .year ” “versions .year ” and
“year .” For each document, at least one path can lead to the
attribute “year .” It is possible to express the selection predicate
in disjunctive form of navigational paths.</p>
      <p>X =</p>
      <p>(“year ” = 1997) ∨ (“details .year ” = 1997)
∨ (“versions .year ” = 1997)
We rewrite the initial query into σ(X )(C). Two conditions have
to be satisfied to select one document, (i) it does exist at least
one navigational path from the sub-conditions of X inside the
document and (ii) the result of evaluating at least one of these
sub-conditions is equal to true. Otherwise, X is equal to false and
the document is not returned in the result.</p>
      <p>The challenges are how to enable schema-independent
querying in transparent ways and how to support future new structures
without revising the application code.</p>
      <p>Figure 3 gives a high-level illustration of our query rewriting
engine called EasyQ. EasyQ is designed to be used early in data
loading phase to materialize a dictionary that tracks the
diferent navigational paths, for all attributes. EasyQ is also used at
querying time to enrich the query Q of the user and to bypass
the structural heterogeneity. It takes as input the user query
formulated over final fields or sub-paths, and the desired collection.
The query rewriting engine produces one extended query Qex t
that will be executed by the underlying document store system.
The result of this extended query is a collection containing
relevant information. An important result of such architecture is
that the same user query, evaluated at diferent moment, will be
rewritten each time. So, if new documents with new structures
have been inserted in the collection (or existing documents are
updated), these new structures are automatically handled and
results remain relevant with the query.</p>
      <p>In the rest of this section, we describe the formal model of
multi-structured documents, dictionary, and the querying
operators across multi-structured documents. Finally, we formally
define how we rewrite the queries.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Multi-structured data modeling</title>
      <sec id="sec-4-1">
        <title>Definition 4.1 (Collection).</title>
        <p>documents</p>
        <sec id="sec-4-1-1">
          <title>A collection C is defined as a set of</title>
          <p>C = {d1, . . . , d |C | }</p>
          <p>Each document di is considered as a (key-value) pair where
the value takes the form: vi = {ai,1 : vi,1, . . . , ai,n : vi,n }</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Definition 4.2 (Document).</title>
        <p>ifned as a (key,value) pair</p>
        <p>A document di , ∀i ∈ [1, c], is
dedi = (kdi , vdi )
• kdi is a key that identifies the document (by abusive
notation we noted i the key kdi in section 1 and 2;
• vdi = {adi,1 : vdi,1, . . . , adi,n : vdi,n } is the document
value. The document value vdi is defined as an object
composed by a set of (adi, j , vdi, j ) pairs, where each adi, j , is
a string called attribute and each vdi, j , is the value that
can be atomic (numeric, string, boolean, null) or complex
(object, array). A value vdi, j is defined below.</p>
        <p>An atomic value is defined as follows
∀j ∈ [1..n]:
• vdi, j = n if n ∈ N∗, the set of numeric values (integer,
lfoat) ;
• vdi, j = “s” if “s” is a string formulated in U nicodeA∗;
• vdi, j = b if b ∈ B, the set of boolean {true, f alse };
• vdi, j = ⊥ is a null value;</p>
        <sec id="sec-4-2-1">
          <title>A complex value is defined as follows</title>
          <p>∀j ∈ [1..n]:
• vdi, j = {adi, j,1 : vdi, j,1, . . . , adi, j,m : vdi, j,m } is an
object value where vdi, j,k , ∀k ∈ [1..m] are values, and
adi, j,k , ∀k ∈ [1..m] are Strings in A∗ called attributes.</p>
          <p>This is a recursive definition identical to document value;
• vdi, j = [vdi, j,l , . . . , vdi, j,l ] represents an array of values
vdi, j,k , ∀k ∈ [1..l ], l =∥ vdi, j ∥ ;</p>
          <p>
            In case of having document values vdi, j as object or array,
their inner values vdi, j,k can be complex values too allowing
to have diferent nesting levels. To cope with nested documents
and navigate through schemas, we adopt the navigational path
notations [
            <xref ref-type="bibr" rid="ref15 ref3">3, 15</xref>
            ].
          </p>
          <p>Definition 4.3 (Schema). The schema, called sdi , is inferred
from the document value vdi = {adi,1 : vdi,1, . . . , adi,n : vdi,n }
is defined as</p>
          <p>sdi = {p1, . . . , pmi }
where pj , ∀j ∈ [1..mi ], is a path of each attribute of vdi , or
navigational path for nested values such as vdi, j,k . For multiple
nesting levels, the navigational path is extracted recursively to
ifnd the path from the root to the final atomic value that can be
found in the document hierarchy.</p>
          <p>A schema svdi of value vdi from document di is formally
defined as:
• if vdi, j is atomic, sdi = sdi ∪ {ai, j };
• if vdi, j is object, sdi = sdi ∪ {adi, j } ∪ {∪p ∈sdi , j adi, j .p}
where sdi, j is the schema of vdi, j ;
∥vdi , j ∥
• if vdi, j is an array, sdi = sdi ∪ {adi, j } ∪j=1
{ adi, j .k } ∪ {∪p ∈sdi , j,k adi, j .k .p} where sdi, j,k is the
schema of the kth value from the array vdi, j ;</p>
          <p>Example. Let us consider the documents d1 and d2 of Figure 2
. The underlying schema for both documents is described as
follows:
svd1 = {“movie_title”, “year ”, “lanдuaдe”}
svd2 = {“movie_title”, “details”, “details.year ”, “details.lanдuaдe”}
We notice that the attribute “details” from document d2 is
a complex one in which are nested the attributes “year ” and
“lanдuaдe” which leads to have two diferent navigational paths
“details .year ” and “details .lanдuaдe”.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Definition 4.4 (Collection Schema).</title>
        <p>from collection C is defined by</p>
        <p>The schema SC is inferred
SC =
c
Ø
i=1</p>
        <p>svdi</p>
        <p>Definition 4.5 (Dictionary). The dictionary dictC of a collection
C is defined by</p>
        <p>∀pk ∈ SC , dictC = {(pk , △k )}
• pk ∈ SC is a path for an attribute which is present at least
in one document of the collection;
• △k = {ppk,1 , . . . , ppk,q } ⊆ SC is a set of navigational
paths leading to pk ;</p>
        <p>For the rest of this paper, we will call equally any path pk as
attribute. We will use dictionary paths and dictionary attributes
accordingly.</p>
        <p>Example. The dictionary dictC constructed from the
collection C is defined below, each dictionary entry pk refers to the set
of all extracted navigational paths.</p>
        <p>dictC = {
(movie_title, {movie_title }),
( year, {year, details.year, versions.1.year, versions.2.year} ),
(lanдuaдe, {lanдuaдe, details .lanдuaдe,</p>
        <p>versions .1.lanдuaдe, versions .2.lanдuaдe }),
(details, {details }),
(details .year , {details .year }),
(details .lanдuaдe, {details .lanдuaдe }),
(versions, {versions }),
(versions .1, {version.1}),
(versions .1.year , {versions .1.year }),
(versions .1.lanдuaдe, {versions .1.lanдuaдe }),
(versions .2, {versions .2}),
(versions .2.year , {versions .2.year }),
(versions .2.lanдuaдe, {versions .2.lanдuaдe })
}
For example, the entry
(year , {year , details .year , versions .1.year ,</p>
        <p>versions .2.year }) gives all navigational paths leading to
the attribute “year ”.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Querying multi-structured data</title>
      <p>Querying multi-structured data is possible via a combination of
a set of unary operators. In this paper, we limit the querying
process to projection and selection operators expressed by native
MongoDB operators “f ind” and “aддreдate”.</p>
      <p>4.2.1 Minimal closed kernel of unary operators. We define a
minimal closed kernel of unary operators. We call Cin the queried
collection, and Cout the resulting collection.</p>
      <p>Definition 4.6 (Projection). The project operator helps to reduce
initial schemas of documents from the collection to a finite
subset of attributes as;</p>
      <p>πA(Cin ) = Cout
where A ⊆ Sin is a sub-set of attributes from SCin ( the schema
of the input collection Cin</p>
      <p>Definition 4.7 (Selection). The select operator runs to retrieve
only documents that match some predicates; we call
σp (Cin ) = Cout
where p refers to the predicate (or condition) for the selection
operator. A simple predicate is expressed by ak ωk vk where ak ⊆
SCin is an attribute, ωk ∈ {= ; &gt; ; &lt; ; , ; ≥ ; ≤ } is a comparison
operator, and vki is a value. It is possible to combine predicates by
these operator from Ω = { ∨, ∧, ¬} and this leads to a complex
predicate.</p>
      <p>We call N ormp the normal conjunctive form of the predicates
p defined as follows:
Û Ü
i</p>
      <p>j
N ormp =</p>
      <p>ai ,j ϖi ,j vi, j</p>
      <p>We consider that all predicates in selection operators as in
normal conjunctive form.</p>
      <sec id="sec-5-1">
        <title>A query Q can be formulated by com</title>
        <sec id="sec-5-1-1">
          <title>Definition 4.8 (Query).</title>
          <p>posing operators.
where ∀i ∈ [1, r ] qi ∈ {π , σ }</p>
          <p>Q = q1 ◦ · · · ◦ qr (C)</p>
          <p>Example. Let us consider the collection presented in Figure
2.
Here, the query q3 is constructed by combining select and
project operators.</p>
          <p>4.2.2 Query extension for multi-structured data. In this
section, we introduce a new query extension algorithm that
automatically enriches the user query. The native query engine
of document-oriented stores such as MongoDB can eficiently
execute our rewritten queries. Then, it is possible to find out
all desired information regardless the structural heterogeneity
inside the collection.</p>
          <p>Algorithm 1: Automatic extension for the initial user
query
input: Q
output: Qex t
Qex t ← id
foreach qi ∈ Q do
switch qi do
case πAi :
do
// projection
// identity
Aex t ← Ð∀ak ∈Ai △k
Qex t ← Qex t ◦ πAext</p>
          <p>Example. Let us consider the query q3 from the previous
example. First, the query rewriting engine starts by extending
the project operator. (line “projection” in Algorithm 1)
π(“movie_t itl e”,“year ”)(C)</p>
          <p>For each projected field, the process consults the dictionary
and extracts all the possible navigational paths. The dictionary
entry, for the field movie_title, corresponds to:</p>
          <p>(movie_title, {movie_title })
So, Aex t = {movie_title }
The dictionary entry, for the field year, corresponds to:
(year , {year , details .year , versions .1.year ,</p>
          <p>versions .2.year })
So, Aex t = {movie_title, year , details .year , versions .1.year ,
versions .2.year }
The projection query is then rewritten as:
π(“movie_t itl e”, “year ”, “det ails .year ”, “ver sions .1.year ”,
“ver sions .2.year ”)(C)</p>
          <p>Next, the process continues with the selection query (line
"selection" in Algorithm 1)</p>
          <p>σ(“l anдuaдe”=”Enдlish”)(C)</p>
          <p>The dictionary entry for the field “language" corresponds to:
(lanдuaдe, {lanдuaдe, details .lanдuaдe, versions .1.lanдuaдe,
versions .2.lanдuaдe })
So, Pex t = {
“lanдuaдe” = “Enдlish”
∨ “details .lanдuaдe” = “Enдlish”
∨ “versions .1.lanдuaдe” = “Enдlish”
∨ “versions .2.lanдuaдe” = “Enдlish”}</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>The selection is then rewritten as:</title>
        <p>σ(“l anдuaдe”=”Enдlish”∨“det ails .l anдuaдe”=“Enдlish”∨“ver sions .
1.l anдuaдe”=“Enдlish” ∨ “ver sions .2.l anдuaдe”=“Enдlish”)(C)</p>
        <p>Finally, the query rewriting engine generates the final query
by combining the generated queries:</p>
        <p>π(“movie_title”,“year ”,“details .year ”,“ver sions .1.year ”,“ver sions .2.year ”)
(σ(“l anдuaдe”=“Enдlish”∨“det ails .l anдuaдe”=“Enдlish” ∨“ver sions
.1.l anдuaдe”=“Enдlish”∨“ver sions .2.l anдuaдe”=“Enдlish”))(C))
= [
{“movie_title” : “F ast and f urious”, “year ” : 2017},
{“movie_title” : “T itanic”, ”details” : {“year ” : 2017}},
{“movie_title” : “T he Hobbit ”, “versions” : [</p>
        <p>{“year ” : 2017}]}
]</p>
        <p>The query rewriting process injects additional complexity to
the original user’s queries.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 EXPERIMENTS</title>
      <p>In this section, we conduct a series of experiments to study the
aforementioned points:
• Which are the efects on the execution time of the
rewritten queries while varying the size of the collection and is
this cost acceptable or not?
• Is the time to build the dictionary acceptable and what
about the size of the dictionary according to structural
variability?
// selection
Pex t ← Ói</p>
      <p>Ôj Ôak ∈△i, j , ak ϖi, j vi, j</p>
      <p>Qex t ← Qex t ◦ σPext
end
case σp :
do
end
end
end</p>
      <p>Our approach aims to enable transparent querying on a
multistructured collection of documents via automatic query rewriting.
This process employs the materialized dictionary to enrich the
original query by including the diferent navigational paths that
lead to desired attributes. The algorithm 1 describes the query
extension process as:
• In case of projection operation, the algorithm extends the
list of attributes Ai by uniting diferent navigational paths
△k for each projected ak .
• In case of the selection operation, the algorithm enriches
the predicate p, expressed in the normal conjunctive form,
with the set of extended dis-junctions built from the
navigational paths △i, j for each attribute ai, j .</p>
      <p>Next, we explain the experimental protocol, then we study
the queries execution cost, and finally we evaluate the dictionary
generation time and its size.</p>
    </sec>
    <sec id="sec-7">
      <title>5.1 Experimental protocol</title>
      <p>We choose to run all the queries on synthetic datasets loaded
into the document store MongoDB. In this section, we introduce
the details of the experimental setup, the process of generating
the synthetic datasets and the evaluation queries set. Later on,
we present the results of executing the evaluation set in three
separate contexts. The goal is to compare the cost of executing
the rewritten queries; (i) the cost of executing the original queries
on homogeneous documents, (ii) the execution time of several
distinct queries that we build manually based on each schema.
Then, we study the efects of the heterogeneity on the
dictionary in terms of size and construction time. Finally, we evaluate
the scale of the heterogeneity and its impact on generating the
rewritten queries.</p>
      <p>We conducted our experiments on MongoDB v3.4. We used
an I5 3.4GHZ machine coupled with 16GB of RAM with 4 TB of
storage space that runs CentOS7.</p>
      <p>5.1.1 Dataset. To study the structural heterogeneity, we
generate a custom synthetic datasets. First, we collected a JSON
collection of documents from imdb1 that describe movies. The
original dataset has only flat documents with 28 attributes in
each document. Then, we reuse this flat collection to produce
documents with structural heterogeneity. For each generated
dataset, we can define several parameters such as the number
of schemas to produce in the collection, the percentage of the
presence of every generated schema. For each schema, we can
adjust the number of grouping objects. We mean by grouping
object, a compound field in which we nest a subset of the document
attributes. In other words, we cannot find the same grouping
objects inside two structures. To make sure about the
heterogeneity within documents, the grouping objects are unique in every
schema. Only the original fields from the flat dataset are common
to all documents. The values of those fields are randomly chosen
from the original film collection. To add more complexity, we can
set the nesting level used for each structure. For the rest of the
experiments, we built our dataset based on the characteristics
that we describe in the Table 1. We generate collections of 10, 25,
50 and 100 GB of data.</p>
      <p>We generate a flat collection with same leaf attributes and their
corresponding values as found in the heterogeneous datasets.
This new collection helps us to have a proper environment and
to compare; the execution time of the rewritten query on the
heterogeneous datasets, versus the execution time of the original
query on homogeneous datasets. Therefore, we ensure that every
query returns the identical results from both heterogeneous or
lfat datasets. The same result implies: i) the same number of
documents, and -0ii) the same values for their attributes (leaf
ifelds).</p>
      <p>5.1.2 Queries. we choose to build a synthetic set of queries
based on the diferent comparison operators supported by
MongoDB. We employed the classical comparison operators, i.e {&lt;
, &gt;, ≤, ≥, =, ,} for numerical values as well as classical logical
operators, i.e {and, or } between query predicates. Also, we
employed a regular expression to deal with string values. We select
8 attributes of diferent types and under diferent levels inside
the documents in heterogeneous datasets. The Table 2 shows
that for each attribute its type and the selection operator that we
used later while formulating the synthetic queries. In addition,
we present for each attribute the number of possible paths as
found in the synthetic heterogeneous collection, the diferent
nesting levels and the selectivity of the predicate.</p>
      <p>Predicate Attribute Type Operator Paths Depths selectivity
p1 DirectorName String Regex{^A} 8 {8,2,3,9,6,5,4,7} 0,06 %
p2 Gross Int &gt; 100 k 7 {7,8,2,3,9,6,4} 66 %
p3 Language String = "English" 7 {7,8,3,9,6,5,4} 0,018%
p4 Imdb_score Float &lt;4,7 8 {8,7,2,3,4,5,6,9} 29 %
p5 Duration Int ≤ 200 7 {7,8,2,3,6,5,4} 77%
p6 Country String , Null 6 {7,2,3,9,5,4} 100 %
p7 year Int &lt; 1950 7 {7,8,2,3,6,5,4} 23 %
p8 FB_likes Int ≥ 500 7 {6,2,3,8,5,4,3} 83 %</p>
      <p>Table 2: Query predicates</p>
      <sec id="sec-7-1">
        <title>We formulate the following 6 queries:</title>
        <p>• Q1 : p1 ∧ p2
• Q2 : p1 ∨ p2</p>
        <p>With the queries Q1&amp;Q2 the rewritten queries contain 15
predicates unlike the original queries that contains 2 predicates.
15 predicates are due to the 8 existing paths for Director N ame in
p1 plus 7 paths for Gross in p2 that are included in a disjunctive
form as described in rewriting algorithm.</p>
        <p>• Q3 : p1 ∧ p2 ∧ p5 ∧ p7
• Q4 : p1 ∨ p2 ∨ p5 ∨ p7
The rewritten versions of Q3 &amp; Q4 contain 29 predicates unlike
the original queries that contain 4 predicates.</p>
        <p>• Q5 : p1 ∧ p2 ∧ p5 ∧ p7 ∧ p6 ∧ p3 ∧ p4 ∧ p8
• Q6 : p1 ∨ p2 ∨ p5 ∨ p7 ∨ p6 ∨ p3 ∨ p4 ∨ p8
Finally, the rewritten versions of the queries Q5&amp;Q6 contain 57
predicates unlike the original queries that contain 8 predicates.</p>
        <p>The Table 3 presents for each dataset: i) the number of
documents inside the collection, ii) the number of expected results
regarding each executed query.
sCiozelleinctGioBn #doocfuments Q1 Q2 Q3 Q4 Q5 Q6
10 GB 12 M 520 K 8,4 M 87,2 K 11,8 M 340 12 M
25 GB 30 M 1,3 M 21 M 218 K 29,5 M 850 30 M
50 GB 60 M 2.6 M 42 M 436 K 59 M 1,7 K 60 M
100 GB 120 M 5,2 M 84 M 872 K 118 M 3,4 K 120 M</p>
        <p>Table 3: Number of extracted documents per query
requires more CPU loads and more intensive disk I/O operations.
We move now to study the eficiency of the rewritten query when
compared to the baseline query QBase . We can notice that the
overhead of our solution is up to two times (e.g., disjunctive form)
when compared to the native execution of the baseline query on
the homogeneous dataset. Moreover, we score an overall
overhead that does not exceed 1,5 times in all six queries. We believe
that this overhead is acceptable since we bypass the needed costs
for refactoring the underlying data structures. Unlike the
baseline, our synthetic dataset contains diferent grouping objects
with varying nesting levels. Then, the rewritten query include
several navigational paths which will be processed by the
native query engine of MongoDB to find matches in each visited
document among the collection.</p>
        <p>5.2.2 Dictionary and query rewriting engine at the scale. With
this series of experiments, we try to push the dictionary and the
query rewriting engine to their limits. To this end, we generated
a heterogeneous synthetic collection of 1 GB. We use the primary
28 attributes from the IMDB flat films collection. The custom
collections are generated in a way that each schema inside a
document is composed of two grouping objects with no further
nesting levels. We generated collection having 10, 100, 1k, 3k and
5k schemas. For this experiment, we keep on the use of the query
Q6 introduced earlier in this section.</p>
        <p># of schemas rewriting time Dictionary size
10 0.0005s 40 KB
100 0.0025s 74 KB
1 K 0.139s 2 MB
3 K 0.6s 7.2 MB
5 K 1.52s 12 MB
Table 5: Scale efects on query rewriting and dictionary
size</p>
        <p>We present the time needed to build the rewritten query in the
Table 5. It is notable that the time to build the rewritten query is
very low, less than two seconds. Also, it is possible to construct
a dictionary over a heterogeneous collection of documents, here
our dictionary can support up to 5 k of distinct schemas. The
resulting size of the materialized dictionary is very encouraging
since it does not require significant storage space. Furthermore,
we also believe that the time spent to build the rewritten query
is really interesting and represent another advantage of our
solution. In this series of experiments, on each time we try to find
distinct navigational paths for eight predicates. Each rewritten
query is composed by numerous disjunctive forms for each
predicate. We notice 80 disjunctive forms while dealing with dataset
having 10 schemas, 800 with 100 schemas, 8 k with 1 k schemas,
24 k with 3 k schemas and 40 k with 5k schemas. We believe
that dictionary and the query rewriting engine scale well while
dealing with heterogeneous collection of documents having an
important number of schemas.
5.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Dictionary construction time</title>
      <p>In this part, we give the interest to the study of the dictionary
constructions process. EasyQ ofers the possibility to build the
dictionary over existing dataset or during data loading phase.
The dictionary contains the latest version of the data once all
document are inserted. So, the query rewriting engine enrich the
queries based on the new dictionary, otherwise if the process of
data loading is in progress, it may do not take into account the
recent changes. In the following, we study both configurations.
First, we start by the evaluation of the time required to build the
dictionary among pre-loaded five collections of 100 GB having 2,
4, 6, 8 and 10 schemas respectively.</p>
      <p>We notice from the results in the Table 6 that the time elapsed
to build the dictionary increases when we start to deal with
collections having more heterogeneity. In case of the collection with
10 structures, the time does not exceed 40% when we compare
it to a collection with 2 structures. We can again notice in the
Table 6 the negligible size of the generated dictionaries when
compared to the 100 GB of the collection.
# of schema
Required time (minutes)
Size of the resulting
dictionary (KB)</p>
      <p>Afterwards, we give the interest to evaluate the overhead
that causes the generation of the dictionary at loading time. We
generate five collections of 1GB having the same structures from
the last experiment (2, 4, 6, 8 and 10 schemas respectively). We
present two measurements in Table 7. First, we measure the time
to simply load each collection (without the dictionary building).
Second, we measure the overall time to build the dictionary while
loading the collection.
#of schemas Load (s) Load and dict. (s) Overhead
2 201s 269s 33%
4 205s 277s 35%
6 207s 285s 37%
8 208s 300s 44%
10 210s 309s 47%
Table 7: Study of the overhead added during load time
In this experiments, we find that the overhead measure does
not exceed 0.5 the time required to only load data. The evolution
of the time while adding more heterogeneity is linear and not
exponential which is encouraging. Many factors may afects the
query construction phase. The number of attributes, the nesting
levels may increase or decrease the overhead. The advantage our
solutions is once the data is loaded and the dictionary is built
or updated, the rewritten query takes automatically all changes
into account.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>NoSQL databases are often called schemaless because they may
contain variable schemas among stored data. Nowadays, this
variability is becoming a common standard in many applications
as for example web applications, social media applications or
internet of things world. Nevertheless, the existence of such
structural heterogeneity makes it very hard for users to formulate
queries to find out relevant and coherent results.</p>
      <p>In this paper, to deal with structural heterogeneity, we suggest
a novel approach for querying heterogeneous documents
describing a given entity over NoSQL document stores. The developed
tool is called EasyQ. Our objective is to allow users to perform
their queries using a minimal knowledge about data schemas.
EasyQ is based on two pillars. The first one is a dictionary that
contains all possible paths for any existing field. The second one
is a rewriting module that modifies the user query to match all
ifeld paths existing in the dictionary. Our approach is a
syntactic manipulation of queries. So, it is grounded on an important
assumption: the collection describes homogeneous entities, i.e.,
a field has the same meaning in all document schemas. In case
of ambiguity, the user should specify some sub-path in order to
overcome the ambiguity. If this assumption is not guaranteed,
users may face with irrelevant or incoherent results. Nevertheless
this assumption may be acceptable in many applications, such as
legacy collections, web applications or internet of things data.</p>
      <p>In our first experiments, the evaluation consists in comparing
the execution time cost of basic MongoDB queries and rewritten
queries proposed by our approach. We conduct a set of tests by
changing two primary parameters, the size of the dataset and
the structural heterogeneity inside a collection (number of
diferent schemas). Results show that the cost of executing rewritten
queries proposed in this paper is higher when compared to the
execution of basic user queries, but always less than twice. The
overhead added to the performance of our query is due to the
combination of multiple access paths to a queried field.
Nevertheless, this time overhead is neglectful when compared to the
execution of separated “by hand” queries for each schema while
heterogeneity issues are automatically managed.</p>
      <p>These first results are very encouraging to continue this
research way and need to be strengthened. Short term perspectives
are to continue evaluations and to identify the limitations
regarding the number of paths and fields in the same query and
regarding time cost. More experiments still to be performed on
larger datasets and real case datasets. Another perspective is to
enhance the current queries possibilities to introduce all existing
classical operators of query languages (contains, etc .). It is also
necessary to deal with other querying operators, particularly the
aggregation operator.</p>
      <p>The first long-term perspective consists in studying the
realtime building of the dictionary when integrating data in order to
take into account all possible queries: insert but also delete and
update. It’s likely the current simple structure of the dictionary
will be transformed in depth to support more complex updates
such as update and delete operations. The second long-term
perspective consists in managing multi-store databases. The goal
would be to extend the proposed approach to query data stored
in diferent types of databases in a way independent from the
various data schemas and stores. The final goal would be: how to
query “transparently” any data store meanwhile being unaware
of schemas or real fields names?</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J</given-names>
            <surname>Chris Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jan</given-names>
            <surname>Lehnardt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Noah</given-names>
            <surname>Slater</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>CouchDB: The Definitive Guide: Time to Relax. "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc.".</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mohamed-Amine</surname>
            <given-names>Baazizi</given-names>
          </string-name>
          , Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Sartiani</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Schema inference for massive json datasets</article-title>
          .
          <source>In Extending Database Technology (EDBT).</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Bourhis</surname>
          </string-name>
          , Juan L Reutter,
          <string-name>
            <surname>Fernando Suárez</surname>
            , and
            <given-names>Domagoj</given-names>
          </string-name>
          <string-name>
            <surname>Vrgoč</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>JSON: data model, query languages and schema specification</article-title>
          .
          <source>In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM</source>
          ,
          <volume>123</volume>
          -
          <fpage>135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Max</given-names>
            <surname>Chevalier</surname>
          </string-name>
          , Mohammed El Malki, Arlind Kopliku, Olivier Teste, and
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Tournier</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Implementation of multidimensional databases in columnoriented NoSQL systems</article-title>
          .
          <source>In East European Conference on Advances in Databases and Information Systems</source>
          . Springer,
          <fpage>79</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Chodorow</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Dirolf</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <string-name>
            <surname>MongoDB: The Definitive Guide O'Reilly</surname>
            <given-names>Media.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Lamine</surname>
          </string-name>
          <string-name>
            <surname>Chouder</surname>
          </string-name>
          , Stefano Rizzi, and
          <string-name>
            <given-names>Rachid</given-names>
            <surname>Chalal</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enabling Self-Service BI on Document Stores.</article-title>
          . In EDBT/ICDT Workshops.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Alejandro</given-names>
            <surname>Corbellini</surname>
          </string-name>
          , Cristian Mateos, Alejandro Zunino, Daniela Godoy, and
          <string-name>
            <given-names>Silvia</given-names>
            <surname>Schiafino</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Persisting big-data: The NoSQL landscape</article-title>
          .
          <source>Information Systems</source>
          <volume>63</volume>
          (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Avrilia</given-names>
            <surname>Floratou</surname>
          </string-name>
          , Nikhil Teletia, David J DeWitt, Jignesh M Patel,
          <string-name>
            <given-names>and Donghui</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Can the elephants handle the nosql onslaught?</article-title>
          <source>Proceedings of the VLDB Endowment 5</source>
          ,
          <issue>12</issue>
          (
          <year>2012</year>
          ),
          <fpage>1712</fpage>
          -
          <lpage>1723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Daniela</given-names>
            <surname>Florescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ghislain</given-names>
            <surname>Fourny</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>JSONiq: The history of a query language</article-title>
          .
          <source>IEEE internet computing 17</source>
          ,
          <issue>5</issue>
          (
          <year>2013</year>
          ),
          <fpage>86</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Lin</surname>
            <given-names>Guo</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng Shao</surname>
            , Chavdar Botev, and
            <given-names>Jayavel</given-names>
          </string-name>
          <string-name>
            <surname>Shanmugasundaram</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>XRANK: Ranked keyword search over XML documents</article-title>
          .
          <source>In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM</source>
          ,
          <volume>16</volume>
          -
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rihan</surname>
            <given-names>Hai</given-names>
          </string-name>
          , Sandra Geisler, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Quix</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Constance: An intelligent data lake system</article-title>
          .
          <source>In Proceedings of the 2016 International Conference on Management of Data. ACM</source>
          ,
          <year>2097</year>
          -
          <fpage>2100</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Nadeau</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Apache drill: interactive ad-hoc analysis at scale</article-title>
          .
          <source>Big Data</source>
          <volume>1</volume>
          ,
          <issue>2</issue>
          (
          <year>2013</year>
          ),
          <fpage>100</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Hecht</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Jablonski</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>NoSQL evaluation: A use case oriented survey</article-title>
          .
          <source>In Cloud and Service Computing (CSC)</source>
          ,
          <source>2011 International Conference on. IEEE</source>
          ,
          <fpage>336</fpage>
          -
          <lpage>341</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Victor</surname>
            <given-names>Herrero</given-names>
          </string-name>
          , Alberto Abelló, and
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Romero</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>NOSQL design for analytical workloads: variability matters</article-title>
          .
          <source>In Conceptual Modeling: 35th International Conference, ER</source>
          <year>2016</year>
          , Gifu, Japan,
          <source>November 14-17</source>
          ,
          <year>2016</year>
          , Proceedings 35. Springer,
          <fpage>50</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jan</surname>
            <given-names>Hidders</given-names>
          </string-name>
          , Jan Paredaens, and Jan Van den Bussche.
          <year>2017</year>
          .
          <article-title>J-Logic: Logical Foundations for JSON Querying</article-title>
          .
          <source>In Proceedings of the 36th ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems. ACM</source>
          ,
          <volume>137</volume>
          -
          <fpage>149</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Prashant R Lambole and Prashant N Chatur</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A review on XML keyword query processing</article-title>
          .
          <source>In Innovative Mechanisms for Industry Applications (ICIMIA)</source>
          ,
          <source>2017 International Conference on. IEEE</source>
          ,
          <fpage>238</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Yunyao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Cong</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>HV</given-names>
            <surname>Jagadish</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Schema-free xquery</article-title>
          .
          <source>In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment</source>
          ,
          <volume>72</volume>
          -
          <fpage>83</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Liang</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Vera Lychagina, Weiran Liu, Younghee Kwon, Sagar Mittal, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Wong</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Tenzing a sql implementation on the mapreduce framework</article-title>
          . (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Kian</given-names>
            <surname>Win</surname>
          </string-name>
          <string-name>
            <surname>Ong</surname>
          </string-name>
          , Yannis Papakonstantinou, and
          <string-name>
            <given-names>Romain</given-names>
            <surname>Vernoux</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The SQL++ query language: Configurable, unifying and semi-structured</article-title>
          .
          <source>arXiv preprint arXiv:1405.3631</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jérôme</given-names>
            <surname>Euzenat</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>A survey of schema-based matching approaches</article-title>
          .
          <source>Journal on data semantics IV</source>
          (
          <year>2005</year>
          ),
          <fpage>146</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>New opportunities for New SQL</article-title>
          .
          <source>Commun. ACM 5</source>
          ,
          <issue>11</issue>
          (
          <year>2012</year>
          ),
          <fpage>10</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Tahara</surname>
          </string-name>
          , Thaddeus Diamond, and
          <string-name>
            <surname>Daniel</surname>
          </string-name>
          J Abadi.
          <year>2014</year>
          .
          <article-title>Sinew: a SQL system for multi-structured data</article-title>
          .
          <source>In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM</source>
          ,
          <volume>815</volume>
          -
          <fpage>826</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Lanjun</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassanzadeh, Jia Zou, and
          <string-name>
            <given-names>Chen</given-names>
            <surname>Wangz</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Schema management for document stores</article-title>
          .
          <source>Proceedings of the VLDB Endowment 8</source>
          ,
          <issue>9</issue>
          (
          <year>2015</year>
          ),
          <fpage>922</fpage>
          -
          <lpage>933</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Rui</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Chengfei Liu, and
          <string-name>
            <given-names>Jianxin</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Fast ELCA computation for keyword queries on XML data</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Extending Database Technology. ACM</source>
          ,
          <volume>549</volume>
          -
          <fpage>560</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>