=Paper= {{Paper |id=Vol-2062/paper1 |storemode=property |title=Towards Schema-independent Querying on Document Data Stores |pdfUrl=https://ceur-ws.org/Vol-2062/paper01.pdf |volume=Vol-2062 |authors=Hamdi Ben Hamadou,Faiza Ghozzi,André Péninou,Olivier Teste |dblpUrl=https://dblp.org/rec/conf/dolap/HamadouGPT18 }} ==Towards Schema-independent Querying on Document Data Stores== https://ceur-ws.org/Vol-2062/paper01.pdf
     Towards schema-independent querying on document data
                            stores
                         Hamdi Ben Hamadou                                                                Faiza Ghozzi
      Université de Toulouse, UT3, IRIT, (CNRS/UMR 5505)                                      Université de Sfax, ISIMS, MIRACL
                        Toulouse, France                                                                 Sfax, Tunisia
                  hamdi.ben-hamadou@irit.fr                                                       faiza.ghozzi@isims.usf.tn

                              André Péninou                                                               Olivier Teste
     Université de Toulouse, UT2J, IRIT, (CNRS/UMR 5505)                            Université de Toulouse, UT2J, IRIT, (CNRS/UMR 5505)
                       Toulouse, France                                                               Toulouse, France
                     andre.peninou@irit.fr                                                           olivier.teste@irit.fr

ABSTRACT                                                                            of documents describing a single entity with multiple structures.
Document is a pervasive semi-structured data model in today’s                       This characteristic points to several kinds of heterogeneity [20].
Web and the Internet of Things (IoT) applications where the                         The structural heterogeneity refers to diverse representation of
data structure is rapidly evolving over time. NoSQL document-                       documents, (e.g.: nested or flat structures, nesting levels, etc.)
oriented databases are well-tailored to efficiently load and man-                   as shown in Figure 2. The syntactic heterogeneity is a result
age massive collections of heterogeneous documents without any                      of differences in representation of data, (e.g. “movie_title” or
prior structural validations. However, this flexibility becomes a                   “movieTitle”). Moreover, the semantic heterogeneity is presented
serious challenge while querying a heterogeneous collection of                      when the same fields may rely on distinct concepts in separate
documents. Hence, it is mandatory for users to reformulate origi-                   documents. The aim of this paper is to focus on structural het-
nal query or to formulate new ones when more structures arrive                      erogeneity.
in the collection. In this paper, we propose a novel approach to                        The problem of schema-independent querying is a hot topic
build schema-independent queries designed for querying multi-                       in the study of document-oriented databases for both industry
structured documents. We introduce a query enrichment mech-                         and academia [6, 22, 23]. Previous work from the literature re-
anism that consults a pre-materialized dictionary defining all                      solved this issue with the following two approaches: (i) performs
possible underlying document structures. We automate the pro-                       physical integration of data by mapping integrated document
cess of query enrichment via an algorithm that rewrites select                      structures into a unified structure [22] and (ii) performs a virtual
and project operators to support multi-structured documents. To                     integration by introducing a custom interface that proposes new
study the performances of our proposed approach we conduct ex-                      virtual schema to be learned by the users while querying hetero-
periments on synthetic dataset. First results are promising when                    geneous data [23]. The first approach modifies the underlying
compared to the normal execution of queries on homogeneous                          data structure, which is not possible while supporting legacy ap-
dataset.                                                                            plications designed to run over original data structures. Moreover,
                                                                                    this approach implies to define the mapping for any original data
                                                                                    structure. The second one requires more efforts from the user to
1    INTRODUCTION                                                                   learn new global structures. This approach is a time-consuming
The popularity of NoSQL systems is growing in the database                          task and possibly prone to be error when there is a need to query
community thanks to their ability to store and query schema-                        new documents with new structures since all queries are subject
free data in flexible and efficient ways [8, 21]. The document data                 to revisions.
model is pervasive in the most Web and the Internet of Things                           In this paper, we propose a novel approach to build schema-
(IoT) applications [13], and several database systems support this                  independent queries designed for querying multi-structured doc-
data model in an efficient way [1, 4, 5]. Furthermore, in such ap-                  uments. We propose a virtual integration that runs in a trans-
plications, the structures of documents representing same entity                    parent way, hides the complexity to build expected queries, and
are subject to structural changes [7]. An application may face the                  supports structural heterogeneity evolution. Always, we rewrite
problem of dealing with multi-structured data [2]. To formulate                     the queries during the execution time to guarantee the usage of
relevant queries, there is a need to have a precise knowledge of                    the latest structures of documents as defined in the dictionary.
data structures because document stores do not provide native                           The problem of structural heterogeneity refers to the possi-
support for querying multi-structured data. Thus, it is manda-                      bility to find different navigational paths that lead to the same
tory to manually include all possible navigational paths for the                    attribute. The attributes are not located at the same position
attributes of interest to formulate relevant query. The structural                  inside documents, and having a limited knowledge of naviga-
changes require users to reformulate original query which is a                      tional paths is insufficient to retrieve the required information.
time-consuming and prone to error task. The challenge addressed                     In Figure 1, the attribute “country” in documents describing films
in this paper is how to support querying upon future structural                     may not be relevant to differentiate between “actor .country” or
heterogeneity without affecting the application code.                               “director .country.” Some sub-paths may help to resolve the ambi-
   In the context of document-oriented databases and due to the                     guity such as “actors.country” and “director .country” anywhere
flexible nature of documents, it is possible to create a collection                 in the document. Therefore, some sub-paths may be used rather
© 2018 Copyright held by the owner/author(s). Published in the Workshop
                                                                                    than attributes names. In all cases, we hypothesise that there ex-
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,         ist some navigational paths to differentiate the different entities
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted   contained in the document.
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
{                                                                           d1: {
     "movie_title":"Fast and Furious",                                                 "movie_title":"Fast and furious",
     "country": "USA",                                                                 "year":2017,
     "actors": [ { "name": "Vin Diesel",                                               "language":"English"
                  "country": "USA"                                              },
                 }, ... ],                                                  d2: {
     "director" : { "name": "F. Gray Gray",                                       "movie_title": "Titanic",
                   "country": "USA"                                               "details":
                  }                                                                 {"year":1997,"language":"English"}
}                                                                               },
                                                                            d3: {
            Figure 1: Descriptive fields ambiguity                                  "movie_title": "Despicable Me 3",
                                                                                    "year":2017
                                                                                },
    We introduce the EasyQ, that stands for “Easy Query, ” as               d4: {
                                                                                    "movie_title": "The Hobbit",
a tool to validate our approach. We give particular interest to
                                                                                    "versions":
MongoDB for the implementation and evaluation. The primary
                                                                                    [{"year":2012, "language":"English"},
contribution of our work is to reformulate users original queries                   {"year":2013, "language":"French"}]
formulated based on simple knowledge of the descriptive field                   }
or sub-paths that contains the desired information. The users’
queries are formulated based on a schema-independent fashion,                        Figure 2: Four documents of films collection
i.e. users can formulate an initial query based on a subset of
possible schemas without carrying about all available structures.
Query rewriting engine is responsible to transparently reformu-             at several positions across documents. Thus, assuming that we
late the initial query to match with all existing schemas returning         formulate a query with limited knowledge of the structure s 2
a relevant result. To deal with document schema heterogeneity,              from d 2 , we build a query with the fields “movie_title” and
we define a dictionary that contains all possible paths for all ex-         “details.lanдuaдe.”
isting fields. The query rewriting engine enriches the user query
with all possible paths found in the dictionary for each field used             db.C.f ind({}, {“movie_title” : 1, “details.lanдuaдe” : 1})
in the user query. In this paper, we use interchangeably the terms
field, descriptive field and attribute.                                        Executing such query in MongoDB leads to an incomplete
    The rest of our paper is structured as follows: In section 2, we        result since “details.lanдuaдe” field is not available in docu-
illustrate the paper issue. Section 3 reviews the most relevant             ments d 1 , d 3 , and d 4 . The problem comes from the structural
works, providing support to query multi-structured documents.               heterogeneity due to the different structural position of the field
Section 4 describes in details our approach. Section 5 presents our         “lanдuaдe, ” i.e. “lanдuaдe” in document d 1 , “details.lanдuaдe”
first experiments and the performances of our approach while                in document d 2 , and “versions.lanдuaдe” in document d 4 . Hence,
changing the size and the number of schemas per collection. In              we may include all these paths in the query using specific and
Section 6, we summarize our findings.                                       often complex syntax.
                                                                               Moreover, we can try to formulate two different queries. The
2     QUERYING DOCUMENT STORES WITH                                         first one is formulated over schema s 1 of the document d 1 in
      MULTIPLE SCHEMA ISSUES                                                order to retrieve the list of titles and “lanдuaдe” for each film.
                                                                            We use the following MongoDB query:
As discussed earlier, querying multi-structured data is a complex              db.C.f ind({}, {“movie_title” : 1, “lanдuaдe” : 1})
task. The problem is which structure to use while formulating                  We than get the following result:
queries and how this choice affects the results. In the following,
we present a simple illustrating example.                                       C1 = [
    Let C = {d 1 , d 2 , d 3 , d 4 } be a set of four films as documented          {“movie_title” : “Fast and f urious”, “lanдuaдe” : Enдlish},
in Figure 2. In this example we represent documents using JSON                     {“movie_title” : “Titanic”},
(JavaScript Object Notation). Most of the NoSQL systems support                    {“movie_title” : “Despicable Me 3”},
this notation of representing semi-structured data. A document                     {“movie_title” : “The Hobbit”}]
di is defined by a key-value pair (i, vi ) where i is the key and vi
is the value described with JSON.                                                We formulate the second using the schema s 4 of document d 4 :
    Let us consider that we want to retrieve information related
                                                                                db.C.f ind({}, {”movie_title” : 1, ”versions.lanдuaдe” : 1})
to available languages for each presented movie. We formulate
a projection query with the fields “movie_title” and “lanдuaдe.”            We get the following result:
using MongoDB syntax as follows:
                                                                                C2 = [
    db.C.f ind({}, {“movie_title” : 1, “lanдuaдe” : 1}).                        {“movie_title” : “Fast and f urious”},
                                                                                {“movie_title” : “Titanic”},
   In this query, the field “movie_title” does not cause any dif-               {“movie_title” : “Despicable Me 3”},
ficulty since it is always at the same structural level in the four             {“movie_title” : “The Hobbit”, “versions” :
documents.Therefore, the query engine is able to locate all in-                   [{“lanдuaдe” : “Enдlish”}, {“lanдuaдe” : “French”}]
formation related to the field “movie_title.” However, the field            ]
“lanдuaдe” may cause some information loss since it is founded
    When executing both queries, the query engine returns two             The above-studied systems are designed to support queries
results. As expected, all possible information related to the field    over semi-structured data with known schemas. To formulate
“movie_title” is returned for all documents as it is located on        queries user needs to know the exact underlying data structures.
document’s root. For the first query, only first document matches      Also, they neglect the fact that user may have limited knowledge
with the field containing “lanдuaдe” information. The second           about the data structure and hence may be unable to formulate
query succeeded to retrieve “lanдuaдe” information only from           correct queries over the heterogeneous dataset.
the fourth document. The challenge is how to formulate a single           To overcome these limitations, recent works were conducted
query and retrieve all information related to the field “lanдuaдe”     to enable schema-independent querying; the second category
without any redundancy. For instance, the same information             underlined at the beginning of this section. Thus, the schema is
related to the field “movie_title” is obtained twice in the resulted   not mandatory to be known in advance at loading time. We clas-
collections C 1 & C 2 .                                                sify the studied works according to two approaches: (i) performs
    To solve this we introduce a transparent way to build relevant     physical integration by refactoring integrated data structures
schema-independent queries that bypass structural heterogene-          into an unified structure; and (ii) adopts virtual integration by in-
ity in documents stores. A simple knowledge of the required            troducing either a custom interface and/or a new query language
attributes allows users to retrieve adequate documents regardless      [23].
the structural heterogeneity in the collection. This ease simpli-         In the first direction, several works were designed to deal
fies the task for end-users and provides them an efficient way         with semi-structured data. Those works share the idea of the
to retrieve information of interest. In case of there-above ex-        schema-on-read. There is no need to define schemas before load-
ample, we enrich the original user query by adding all possi-          ing data, they infer the implicit schema later from stored datasets
ble navigational paths to retrieve relevant documents. For in-         on query time. They expose for the users a relational view over
stance, we formulate the query db.C.f ind({}, {“movie_title” :         the data to help them to build SQL queries. Sinew[22], is able to
1, “lanдuaдe” : 1, “details.lanдuaдe” : 1, “versions.lanдuaдe”})       infer schemas from semi-structured data. It defines for the user
in MongoDB syntax. It can bypass the structural heterogeneity          a logical view on the inferred schema, and it flattens data into
in the current state of the collection and it projects all desired     columns to be stored into relational database system (RDBMS).
values.                                                                Drill[12] enables schema-independent querying via SQL over het-
                                                                       erogeneous data without first defining a schema. It gives support
                                                                       for nested data. Tenzing[18] infers a relational schema from the
3    STATE OF THE ART
                                                                       underlying data but can only do so for flat structures that can be
The widespread use of semi-structured data gives increased inter-      trivially mapped to a relational schema.
est to build solutions enabling queries over semi-structured data.        The principle of the previous solutions suggests heavy physi-
We distinguish existing solutions systems on the basis of the          cal refactorization that requires flattening the underlying data
proposed querying approach: 1) schema-dependent querying ap-           structures into a relational format using complex encoding tech-
proach that requires knowledge of the schema in a similar way as       niques. Hence, the refactorization requires additional resources
conventional relational database systems, 2) schema-independent        such as the need for external relational database and extra efforts
querying approach that does not need any prior schema knowl-           to learn the unified inferred relational schema. Besides, some
edge from the user and is able to extract the schemas at querying      solutions do not support the flexible nature of semi-structured
time.                                                                  data [18] for instance they cannot handle nested data. User deal-
    The first category of systems is designed to enable queries        ing with those systems has to learn new schemas every time the
based on reliable knowledge about the schema or the navigational       workload changes, or new data comes because there is a need
paths for desired values when dealing with nested data. Such           to re-generate the relational view and the stored columns after
systems offer complicated querying language such as regular            every change.
expressions with XQuery or Xpath [17] when dealing with XML               Virtual integration gets also attention from researchers [14, 23].
data. XQuery works with the structure to retrieve precisely the        Works are inspired by the data lake approach [11] where data
desired results. However, if the user does not know the structure,     is collected in their original format for later use. We consider
it is impossible to write the relevant query. Moreover, a single       two major classes: i) schema-oriented queries; and ii) keyword
query is generally not able to retrieve data when several schemas      querying.
are to be considered simultaneously. We can notice the same con-          Works from the first class infer the schema from a collection
siderations with JSONiq [9], the extension of XQuery, designed         of data and offer for the users the possibility to query the inferred
to deal with large-scale data such as JSON data. Other systems         schema and to check whether a field or sub-schema exists or not
suggest JavaScript queries API, the case of MongoDB [5], to build      to guide them while developing their applications. In [23] the
a query by specifying a document with properties expected to           authors propose to summarize all the document schema under
match with the results. It offers a broad range of querying capabil-   a skeleton to discover the existence of fields or sub-schemas
ities, in particular data processing pipelines. The API requires a     inside the collection. In [14] the authors suggest extracting all
complex syntax and it is necessary that queries explicitly include     the schema that are present in the collection to help final users to
all the various schema structures within documents to access           be aware of the schemas and all fields in the integrated collection
data. Otherwise, the query engine returns only documents that          of documents. These solutions are limited only to type and field
match the supplied criteria even if the fields with the desired        identification and are not used to determine the different paths
information exist but under other paths than those existing in         to access a field in the collection.
the query. Another kind of works is SQL++ [19] relies on the              Keyword querying has been adopted in the context of XML
rich SQL querying interface. In this case, it is also mandatory to     [10, 24]. The process of answering a keyword query on XML
express all exact navigational paths in order to obtain the desired    data starts by identifying the existence of the keywords within
results.                                                               the documents (possibly through some free-text search). They
                                                                                  
take as input the searched keywords and return a subset from              X =    (“year ” = 1997) ∨ (“details.year ” = 1997)
the document that matches with the query keywords. A score is                                       
computed based on the structure of sub-documents, and accord-           ∨ (“versions.year ” = 1997)
ing to this score, the respective XML documents containing all
the keywords are returned.                                              We rewrite the initial query into σ(X ) (C). Two conditions have
   Works in Keyword querying suggest doing a pairwise com-              to be satisfied to select one document, (i) it does exist at least
parison or binary search to identify the possible positions for         one navigational path from the sub-conditions of X inside the
queried keywords. This concept is not well tailored for a large         document and (ii) the result of evaluating at least one of these
number of documents with complex structures (different nested           sub-conditions is equal to true. Otherwise, X is equal to false and
elements, numerous attributes, etc.).                                   the document is not returned in the result.
   From state of the art, we build our approach in the idea of offer-      The challenges are how to enable schema-independent query-
ing virtual integration to enable schema-independent querying           ing in transparent ways and how to support future new structures
via the usage of keyword based on the attributes and to support         without revising the application code.
native semi-structured features such as nested attributes and              Figure 3 gives a high-level illustration of our query rewriting
support for heterogeneous collections of documents.                     engine called EasyQ. EasyQ is designed to be used early in data
   Our work relates in some way to previous attempts with XML           loading phase to materialize a dictionary that tracks the differ-
keyword querying[16]. The most important contribution of these          ent navigational paths, for all attributes. EasyQ is also used at
earlier efforts is to prevent users from learning complex under-        querying time to enrich the query Q of the user and to bypass
lying schemas as well as a complex query language to manage             the structural heterogeneity. It takes as input the user query for-
paths. We adopt this idea that the user may not be aware of             mulated over final fields or sub-paths, and the desired collection.
all existing schemas and cannot manage too complex queries              The query rewriting engine produces one extended query Q ex t
in order to enable schema-independent querying based on only            that will be executed by the underlying document store system.
knowledge about the field with the desired information. The main        The result of this extended query is a collection containing rel-
difference between our works and the keywords querying is that          evant information. An important result of such architecture is
we require from the user some simple details about the queried          that the same user query, evaluated at different moment, will be
data. For instance, if we execute a keyword query “Enдlish”. It         rewritten each time. So, if new documents with new structures
is possible to have as result “lanдuaдe” : “Enдlish” and also           have been inserted in the collection (or existing documents are
“movie_title” : “Johnny Enдlish.” With our approach, we will            updated), these new structures are automatically handled and
specify that we give interest to the field “lanдuaдe” = “Enдlish”       results remain relevant with the query.
or to the field “movie_title” contains “Enдlish.”                          In the rest of this section, we describe the formal model of
                                                                        multi-structured documents, dictionary, and the querying oper-
4    QUERYING HETEROGENEOUS                                             ators across multi-structured documents. Finally, we formally
     COLLECTION OF DOCUMENTS                                            define how we rewrite the queries.
In our proposal, we want to enable queries over multi-structured
documents by automatically handling the underlying structural
                                                                        4.1     Multi-structured data modeling
heterogeneity. Thus, our query rewriting engine will give trans-          Definition 4.1 (Collection). A collection C is defined as a set of
parent support for the heterogeneity on both stored and future          documents
new data.                                                                                         C = {d 1 , . . . , d |C | }

                                                                           Each document di is considered as a (key-value) pair where
                                                                        the value takes the form: vi = {ai,1 : vi,1 , . . . , ai,n : vi,n }

                                                                           Definition 4.2 (Document). A document di , ∀i ∈ [1, c], is de-
                                                                        fined as a (key,value) pair
                                                                                                      di = (kdi , vdi )
                                                                              • kdi is a key that identifies the document (by abusive nota-
                                                                                tion we noted i the key kdi in section 1 and 2;
                                                                              • vdi = {adi ,1 : vdi ,1 , . . . , adi ,n : vdi ,n } is the document
                                                                                value. The document value vdi is defined as an object com-
               Figure 3: An overview of EasyQ                                   posed by a set of (adi , j , vdi , j ) pairs, where each adi , j , is
                                                                                a string called attribute and each vdi , j , is the value that
   To give an overview of our approach, let us consider the fol-                can be atomic (numeric, string, boolean, null) or complex
lowing selection query (selection operation is defined later in                 (object, array). A value vdi , j is defined below.
this section):
                                                                          An atomic value is defined as follows ∀j ∈ [1..n]:
                           σ(“year ”=1997) (C)
We refer to the collection presented in Figure 2 in which we                  • vdi , j = n if n ∈ N∗ , the set of numeric values (integer,
notice that it exists three distinct navigational paths leading                 float);
to the attribute “year , ” i.e. “details.year ” “versions.year ” and          • vdi , j = “s” if “s” is a string formulated in U nicodeA∗ ;
“year .” For each document, at least one path can lead to the                 • vdi , j = b if b ∈ B, the set of boolean {true, f alse};
attribute “year .” It is possible to express the selection predicate          • vdi , j = ⊥ is a null value;
in disjunctive form of navigational paths.                                A complex value is defined as follows ∀j ∈ [1..n]:
     • vdi , j = {adi , j,1 : vdi , j,1 , . . . , adi , j,m : vdi , j,m } is an     dictC = {
       object value where vdi , j,k , ∀k ∈ [1..m] are values, and                 (movie_title, {movie_title}),
       adi , j,k , ∀k ∈ [1..m] are Strings in A∗ called attributes.               ( year, {year, details.year, versions.1.year, versions.2.year} ),
       This is a recursive definition identical to document value;                (lanдuaдe, {lanдuaдe, details.lanдuaдe,
     • vdi , j = [vdi , j,l , . . . , vdi , j,l ] represents an array of values         versions.1.lanдuaдe, versions.2.lanдuaдe}),
       vdi , j,k , ∀k ∈ [1..l], l =∥ vdi , j ∥ ;                                  (details, {details}),
                                                                                  (details.year , {details.year }),
   In case of having document values vdi , j as object or array,
                                                                                  (details.lanдuaдe, {details.lanдuaдe}),
their inner values vdi , j,k can be complex values too allowing
                                                                                  (versions, {versions}),
to have different nesting levels. To cope with nested documents
                                                                                  (versions.1, {version.1}),
and navigate through schemas, we adopt the navigational path
                                                                                  (versions.1.year , {versions.1.year }),
notations [3, 15].
                                                                                  (versions.1.lanдuaдe, {versions.1.lanдuaдe}),
   Definition 4.3 (Schema). The schema, called sdi , is inferred                  (versions.2, {versions.2}),
from the document value vdi = {adi ,1 : vdi ,1 , . . . , adi ,n : vdi ,n }        (versions.2.year , {versions.2.year }),
is defined as                                                                     (versions.2.lanдuaдe, {versions.2.lanдuaдe})
                          sdi = {p1 , . . . , pmi }                                  }
   where p j , ∀j ∈ [1..mi ], is a path of each attribute of vdi , or                For example, the entry
navigational path for nested values such as vdi , j,k . For multiple                 (year , {year , details.year , versions.1.year ,
nesting levels, the navigational path is extracted recursively to                       versions.2.year }) gives all navigational paths leading to
find the path from the root to the final atomic value that can be                 the attribute “year ”.
found in the document hierarchy.
   A schema svdi of value vdi from document di is formally                        4.2    Querying multi-structured data
defined as:                                                                       Querying multi-structured data is possible via a combination of
     • if vdi , j is atomic, sdi = sdi ∪ {ai, j };                                a set of unary operators. In this paper, we limit the querying
     • if vdi , j is object, sdi = sdi ∪ {adi , j } ∪ {∪p ∈sdi , j adi , j .p}    process to projection and selection operators expressed by native
       where sdi , j is the schema of vdi , j ;                                   MongoDB operators “f ind” and “aддreдate”.
                                                           ∥vd , j ∥
     • if vdi , j is an array, sdi = sdi ∪ {adi , j } ∪j=1 i                         4.2.1 Minimal closed kernel of unary operators. We define a
                                                                                minimal closed kernel of unary operators. We call Cin the queried
         { adi , j .k} ∪ {∪p ∈sdi , j,k adi , j .k.p} where sdi , j,k is the      collection, and Cout the resulting collection.
        schema of the k t h value from the array vdi , j ;                           Definition 4.6 (Projection). The project operator helps to reduce
                                                                                  initial schemas of documents from the collection to a finite sub-
   Example. Let us consider the documents d 1 and d 2 of Figure 2
                                                                                  set of attributes as;
. The underlying schema for both documents is described as
follows:                                                                                                    πA (Cin ) = Cout
   svd1 = {“movie_title”, “year ”, “lanдuaдe”}
                                                                                     where A ⊆ Sin is a sub-set of attributes from SCin ( the schema
  svd2 = {“movie_title”, “details”, “details.year ”, “details.lanдuaдe”}
                                                                                  of the input collection Cin
   We notice that the attribute “details” from document d 2 is                      Definition 4.7 (Selection). The select operator runs to retrieve
a complex one in which are nested the attributes “year ” and                      only documents that match some predicates; we call
“lanдuaдe” which leads to have two different navigational paths
                                                                                                             σp (Cin ) = Cout
“details.year ” and “details.lanдuaдe”.
                                                                                     where p refers to the predicate (or condition) for the selection
   Definition 4.4 (Collection Schema). The schema SC is inferred
                                                                                  operator. A simple predicate is expressed by ak ωk vk where ak ⊆
from collection C is defined by
                                                                                  SCin is an attribute, ωk ∈ {= ; > ; < ; , ; ≥ ; ≤ } is a comparison
                                  c
                                 Ø                                                operator, and vki is a value. It is possible to combine predicates by
                            SC =    sv d i
                                                                                  these operator from Ω = { ∨, ∧, ¬} and this leads to a complex
                                     i=1
                                                                                  predicate.
   Definition 4.5 (Dictionary). The dictionary dictC of a collection
C is defined by                                                                      We call Normp the normal conjunctive form of the predicates
                                                                                  p defined as follows:
                    ∀pk ∈ SC , dictC = {(pk , △k )}
                                                                                                              Û Ü                            
     • pk ∈ SC is a path for an attribute which is present at least                                Normp =               ai ,j ϖ i ,j vi, j
       in one document of the collection;                                                                      i     j
     • △k = {ppk, 1 , . . . , ppk,q } ⊆ SC is a set of navigational
                                                                                    We consider that all predicates in selection operators as in
       paths leading to pk ;
                                                                                  normal conjunctive form.
   For the rest of this paper, we will call equally any path pk as
                                                                                    Definition 4.8 (Query). A query Q can be formulated by com-
attribute. We will use dictionary paths and dictionary attributes
                                                                                  posing operators.
accordingly.
   Example. The dictionary dictC constructed from the collec-
tion C is defined below, each dictionary entry pk refers to the set                                       Q = q 1 ◦ · · · ◦ qr (C)
of all extracted navigational paths.                                              where ∀i ∈ [1, r ] qi ∈ {π , σ }
     Example. Let us consider the collection presented in Figure          Example. Let us consider the query q 3 from the previous
2.                                                                     example. First, the query rewriting engine starts by extending
                                                                       the project operator. (line “projection” in Algorithm 1)
q 1 : σ(“l anдuaдe”=“Enдl ish”) (C) = [                                                          π(“movie_t itl e”,“year ”) (C)
    {“movie_title” : “Fast and f urious”, “year ” : 2017,
     “lanдuaдe” : “Enдlish”}                                              For each projected field, the process consults the dictionary
]                                                                      and extracts all the possible navigational paths. The dictionary
q 2 : π(“movie_t itl e”,“year ”) (C) = [                               entry, for the field movie_title, corresponds to:
    {“movie_title” : “Fast and f urious”, year : 2017}                       (movie_title, {movie_title})
    {“movie_title” : “Titanic”}                                        So, Aex t = {movie_title}
    {“movie_title” : “Despicable Me 3”, year : 2017}                   The dictionary entry, for the field year, corresponds to:
    {“movie_title” : “The Hobbit”}                                        (year, {year , details.year , versions.1.year,
]                                                                             versions.2.year })
q 3 : π(“movie_t itl e”,“year ”) (σ(“l anдuaдe”=“Enдl ish”) (C)) = [   So, Aex t = {movie_title, year, details.year, versions.1.year ,
    {“movie_title” : “Fast and f urious”, “year ” : 2017}                versions.2.year }
]                                                                      The projection query is then rewritten as:
    Here, the query q 3 is constructed by combining select and
project operators.                                                         π(“movie_t itl e”, “year ”, “det ails .year ”, “ver sions .1.year ”,
                                                                           “ver sions .2.year ”) (C)
   4.2.2 Query extension for multi-structured data. In this sec-
tion, we introduce a new query extension algorithm that au-               Next, the process continues with the selection query (line
tomatically enriches the user query. The native query engine           "selection" in Algorithm 1)
of document-oriented stores such as MongoDB can efficiently
execute our rewritten queries. Then, it is possible to find out                                 σ(“l anдuaдe”=”Enдlish”) (C)
all desired information regardless the structural heterogeneity           The dictionary entry for the field “language" corresponds to:
inside the collection.                                                 (lanдuaдe, {lanдuaдe, details.lanдuaдe, versions.1.lanдuaдe,
                                                                                versions.2.lanдuaдe})
 Algorithm 1: Automatic extension for the initial user                    So, Pex t = {
 query                                                                       “lanдuaдe” = “Enдlish”
  input: Q                                                                    ∨ “details.lanдuaдe” = “Enдlish”
  output: Q ex t                                                              ∨ “versions.1.lanдuaдe” = “Enдlish”
  Q ex t ← id                                     // identity                 ∨ “versions.2.lanдuaдe” = “Enдlish”}
   foreach qi ∈ Q do
      switch qi do                                                         The selection is then rewritten as:
          case πAi :                            // projection
           do                                                          σ(“l anдuaдe”=”Enдlish”∨“det ails .l anдuaдe”=“Enдlish”∨“ver sions .
              Aex t ← ∀ak ∈Ai △k                                       1.l anдuaдe”=“Enдlish” ∨ “ver sions .2.l anдuaдe”=“Enдlish”) (C)
                       Ð
              Q ex t ← Q ex t ◦ πAe x t
          end                                                            Finally, the query rewriting engine generates the final query
          case σp :                              // selection          by combining the generated queries:
           do                                                            π(“movie_t itl e”,“year ”,“det ails .year ”,“ver sions .1.year ”,“ver sions .2.year ”)
              Pex t ← i        j a k ∈△i, j , ak ϖ i, j vi, j
                      Ó Ô Ô
                                                                       (σ(“l anдuaдe”=“Enдlish”∨“det ails .l anдuaдe”=“Enдlish” ∨“ver sions
              Q ex t ← Q ex t ◦ σPe x t                                .1.l anдuaдe”=“Enдlish”∨“ver sions .2.l anдuaдe”=“Enдlish”) )(C))
                                                                           =[
           end
                                                                              {“movie_title” : “Fast and f urious”, “year ” : 2017},
        end
                                                                              {“movie_title” : “Titanic”, ”details” : {“year ” : 2017}},
     end                                                                      {“movie_title” : “The Hobbit”, “versions” : [
                                                                                {“year ” : 2017}]}
   Our approach aims to enable transparent querying on a multi-        ]
structured collection of documents via automatic query rewriting.          The query rewriting process injects additional complexity to
This process employs the materialized dictionary to enrich the         the original user’s queries.
original query by including the different navigational paths that
lead to desired attributes. The algorithm 1 describes the query        5     EXPERIMENTS
extension process as:                                                  In this section, we conduct a series of experiments to study the
     • In case of projection operation, the algorithm extends the      aforementioned points:
       list of attributes Ai by uniting different navigational paths        • Which are the effects on the execution time of the rewrit-
       △k for each projected ak .                                             ten queries while varying the size of the collection and is
     • In case of the selection operation, the algorithm enriches             this cost acceptable or not?
       the predicate p, expressed in the normal conjunctive form,           • Is the time to build the dictionary acceptable and what
       with the set of extended dis-junctions built from the navi-            about the size of the dictionary according to structural
       gational paths △i, j for each attribute ai, j .                        variability?
   Next, we explain the experimental protocol, then we study            query returns the identical results from both heterogeneous or
the queries execution cost, and finally we evaluate the dictionary      flat datasets. The same result implies: i) the same number of
generation time and its size.                                           documents, and -0ii) the same values for their attributes (leaf
                                                                        fields).
5.1     Experimental protocol
                                                                           5.1.2 Queries. we choose to build a synthetic set of queries
We choose to run all the queries on synthetic datasets loaded
                                                                        based on the different comparison operators supported by Mon-
into the document store MongoDB. In this section, we introduce
                                                                        goDB. We employed the classical comparison operators, i.e {<
the details of the experimental setup, the process of generating
                                                                        , >, ≤, ≥, =, ,} for numerical values as well as classical logical
the synthetic datasets and the evaluation queries set. Later on,
                                                                        operators, i.e {and, or } between query predicates. Also, we em-
we present the results of executing the evaluation set in three
                                                                        ployed a regular expression to deal with string values. We select
separate contexts. The goal is to compare the cost of executing
                                                                        8 attributes of different types and under different levels inside
the rewritten queries; (i) the cost of executing the original queries
                                                                        the documents in heterogeneous datasets. The Table 2 shows
on homogeneous documents, (ii) the execution time of several
                                                                        that for each attribute its type and the selection operator that we
distinct queries that we build manually based on each schema.
                                                                        used later while formulating the synthetic queries. In addition,
Then, we study the effects of the heterogeneity on the dictio-
                                                                        we present for each attribute the number of possible paths as
nary in terms of size and construction time. Finally, we evaluate
                                                                        found in the synthetic heterogeneous collection, the different
the scale of the heterogeneity and its impact on generating the
                                                                        nesting levels and the selectivity of the predicate.
rewritten queries.
   We conducted our experiments on MongoDB v3.4. We used
an I5 3.4GHZ machine coupled with 16GB of RAM with 4 TB of               Predicate   Attribute      Type     Operator       Paths   Depths               selectivity
storage space that runs CentOS7.                                         p1          DirectorName   String   Regex{^A}      8       {8,2,3,9,6,5,4,7}    0,06 %
                                                                         p2          Gross          Int      > 100 k        7       {7,8,2,3,9,6,4}      66 %
   5.1.1 Dataset. To study the structural heterogeneity, we gen-         p3          Language       String   = "English"    7       {7,8,3,9,6,5,4}      0,018%
                                                                         p4          Imdb_score     Float    <4,7           8       {8,7,2,3,4,5,6,9}    29 %
erate a custom synthetic datasets. First, we collected a JSON            p5          Duration       Int      ≤ 200          7       {7,8,2,3,6,5,4}      77%
collection of documents from imdb1 that describe movies. The             p6          Country        String   , Null         6       {7,2,3,9,5,4}        100 %
                                                                         p7          year           Int      < 1950         7       {7,8,2,3,6,5,4}      23 %
original dataset has only flat documents with 28 attributes in           p8          FB_likes       Int      ≥ 500          7       {6,2,3,8,5,4,3}      83 %
each document. Then, we reuse this flat collection to produce                                 Table 2: Query predicates
documents with structural heterogeneity. For each generated
dataset, we can define several parameters such as the number
of schemas to produce in the collection, the percentage of the
presence of every generated schema. For each schema, we can                We formulate the following 6 queries:
adjust the number of grouping objects. We mean by grouping ob-
                                                                             • Q1 : p1 ∧ p2
ject, a compound field in which we nest a subset of the document
                                                                             • Q2 : p1 ∨ p2
attributes. In other words, we cannot find the same grouping
objects inside two structures. To make sure about the heterogene-          With the queries Q1&Q2 the rewritten queries contain 15
ity within documents, the grouping objects are unique in every          predicates unlike the original queries that contains 2 predicates.
schema. Only the original fields from the flat dataset are common       15 predicates are due to the 8 existing paths for Director N ame in
to all documents. The values of those fields are randomly chosen        p1 plus 7 paths for Gross in p2 that are included in a disjunctive
from the original film collection. To add more complexity, we can       form as described in rewriting algorithm.
set the nesting level used for each structure. For the rest of the           • Q3 : p1 ∧ p2 ∧ p5 ∧ p7
experiments, we built our dataset based on the characteristics               • Q4 : p1 ∨ p2 ∨ p5 ∨ p7
that we describe in the Table 1. We generate collections of 10, 25,
                                                                        The rewritten versions of Q3 & Q4 contain 29 predicates unlike
50 and 100 GB of data.
                                                                        the original queries that contain 4 predicates.
      Setting                              Value                             • Q5 : p1 ∧ p2 ∧ p5 ∧ p7 ∧ p6 ∧ p3 ∧ p4 ∧ p8
      # of schema                          10                                • Q6 : p1 ∨ p2 ∨ p5 ∨ p7 ∨ p6 ∨ p3 ∨ p4 ∨ p8
      # of grouping objects per schema     {5,6,1,3,4,2,7,2,1,3}        Finally, the rewritten versions of the queries Q5&Q6 contain 57
      Nesting levels per schema            {4,2,6,1,5,7,2,8,3,4}        predicates unlike the original queries that contain 8 predicates.
      Percentage of schema presence        10%                             The Table 3 presents for each dataset: i) the number of doc-
      # of attributes per schema           Random                       uments inside the collection, ii) the number of expected results
      # of attributes per grouping objects Random                       regarding each executed query.
            Table 1: Settings of the generated dataset
                                                                         Collection    # of
                                                                                                     Q1        Q2          Q3       Q4           Q5         Q6
                                                                         size in GB    documents
   We generate a flat collection with same leaf attributes and their     10 GB         12 M          520 K     8,4 M       87,2 K   11,8 M       340        12 M
                                                                         25 GB         30 M          1,3 M     21 M        218 K    29,5 M       850        30 M
corresponding values as found in the heterogeneous datasets.             50 GB         60 M          2.6 M     42 M        436 K    59 M         1,7 K      60 M
This new collection helps us to have a proper environment and            100 GB        120 M         5,2 M     84 M        872 K    118 M        3,4 K      120 M
to compare; the execution time of the rewritten query on the                Table 3: Number of extracted documents per query
heterogeneous datasets, versus the execution time of the original
query on homogeneous datasets. Therefore, we ensure that every
1 imdb.com
                                                      Figure 4: Query rewriting evaluations


 Collection       Query                    Usage                           requires more CPU loads and more intensive disk I/O operations.
 Homogeneous      QBase
                                           Baseline
                                                                           We move now to study the efficiency of the rewritten query when
 Flat documents   user query                                               compared to the baseline query Q Base . We can notice that the
 Heterogeneous    QRewritten               Our solution
                                                                           overhead of our solution is up to two times (e.g., disjunctive form)
 documents        QAccumulated
                  the sum of                                               when compared to the native execution of the baseline query on
                                           "By hand" query context         the homogeneous dataset. Moreover, we score an overall over-
                  the ten separated
                                           Used as a maximum cost line
                  queries each one based                                   head that does not exceed 1,5 times in all six queries. We believe
                  on a specific schema.                                    that this overhead is acceptable since we bypass the needed costs
                  Table 4: Evaluations context                             for refactoring the underlying data structures. Unlike the base-
                                                                           line, our synthetic dataset contains different grouping objects
                                                                           with varying nesting levels. Then, the rewritten query include
                                                                           several navigational paths which will be processed by the na-
5.2     Queries evaluation                                                 tive query engine of MongoDB to find matches in each visited
   5.2.1 Queries execution time. We define three contexts on               document among the collection.
which we run the above-defined queries. For each context, we                  5.2.2 Dictionary and query rewriting engine at the scale. With
measure the average of execution time after executing each query           this series of experiments, we try to push the dictionary and the
at least five times. The order of query execution is set to be             query rewriting engine to their limits. To this end, we generated
random.                                                                    a heterogeneous synthetic collection of 1 GB. We use the primary
   In the following, we present the details of the three evaluations       28 attributes from the IMDB flat films collection. The custom
contexts:                                                                  collections are generated in a way that each schema inside a
      • We call “Q Base ” the query that refers to the initial user        document is composed of two grouping objects with no further
        query (one of the above-defined queries), and it is executed       nesting levels. We generated collection having 10, 100, 1k, 3k and
        over the homogeneous versions of the datasets. The pur-            5k schemas. For this experiment, we keep on the use of the query
        pose of this first context is to study the native behavior of      Q 6 introduced earlier in this section.
        the document store. We use this first context as a baseline
        for our experimentation.
                                                                                  # of schemas rewriting time Dictionary size
      • The “Q Rewr it t en ” refers to the query “Q Base ” rewritten
        by our approach. It is executed over the heterogeneous                    10             0.0005s       40 KB
        versions of the datasets.                                                 100            0.0025s       74 KB
      • The “Q Accumul at ed ” refers to the set of equivalent queries            1K             0.139s        2 MB
        formulated on each possible schema from the collection.                   3K             0.6s          7.2 MB
        In our case, it is mad of 10 separated queries since we                   5K             1.52s         12 MB
        are dealing with collections having ten schemas. These             Table 5: Scale effects on query rewriting and dictionary
        queries are build "by hand" as should have done any user           size
        without any assisting tool. We do not consider the time
        necessary to merge the results of each query as the goal
        is to compare the time to find the set of result documents.            We present the time needed to build the rewritten query in the
        “Q Accumul at ed ” is obviously executed over the heteroge-        Table 5. It is notable that the time to build the rewritten query is
        neous versions of the datasets.                                    very low, less than two seconds. Also, it is possible to construct
   Table 4 synthesizes our execution contexts.                             a dictionary over a heterogeneous collection of documents, here
   As shown in Figure 4, we can notice that our rewritten query,           our dictionary can support up to 5 k of distinct schemas. The
Q Rewr it t en , outperforms the accumulated one, Q Accumul at ed .        resulting size of the materialized dictionary is very encouraging
The difference between the two execution scenarios come from               since it does not require significant storage space. Furthermore,
the capabilities of our rewritten query to include automatically           we also believe that the time spent to build the rewritten query
all navigational paths extracted from the collection. Hence, the           is really interesting and represent another advantage of our solu-
query is executed only once when the accumulated query may                 tion. In this series of experiments, on each time we try to find
require several passes through the stored collection. This solution        distinct navigational paths for eight predicates. Each rewritten
query is composed by numerous disjunctive forms for each pred-        6   CONCLUSION
icate. We notice 80 disjunctive forms while dealing with dataset      NoSQL databases are often called schemaless because they may
having 10 schemas, 800 with 100 schemas, 8 k with 1 k schemas,        contain variable schemas among stored data. Nowadays, this
24 k with 3 k schemas and 40 k with 5k schemas. We believe            variability is becoming a common standard in many applications
that dictionary and the query rewriting engine scale well while       as for example web applications, social media applications or
dealing with heterogeneous collection of documents having an          internet of things world. Nevertheless, the existence of such
important number of schemas.                                          structural heterogeneity makes it very hard for users to formulate
                                                                      queries to find out relevant and coherent results.
5.3    Dictionary construction time                                      In this paper, to deal with structural heterogeneity, we suggest
In this part, we give the interest to the study of the dictionary     a novel approach for querying heterogeneous documents describ-
constructions process. EasyQ offers the possibility to build the      ing a given entity over NoSQL document stores. The developed
dictionary over existing dataset or during data loading phase.        tool is called EasyQ. Our objective is to allow users to perform
The dictionary contains the latest version of the data once all       their queries using a minimal knowledge about data schemas.
document are inserted. So, the query rewriting engine enrich the      EasyQ is based on two pillars. The first one is a dictionary that
queries based on the new dictionary, otherwise if the process of      contains all possible paths for any existing field. The second one
data loading is in progress, it may do not take into account the      is a rewriting module that modifies the user query to match all
recent changes. In the following, we study both configurations.       field paths existing in the dictionary. Our approach is a syntac-
First, we start by the evaluation of the time required to build the   tic manipulation of queries. So, it is grounded on an important
dictionary among pre-loaded five collections of 100 GB having 2,      assumption: the collection describes homogeneous entities, i.e.,
4, 6, 8 and 10 schemas respectively.                                  a field has the same meaning in all document schemas. In case
    We notice from the results in the Table 6 that the time elapsed   of ambiguity, the user should specify some sub-path in order to
to build the dictionary increases when we start to deal with col-     overcome the ambiguity. If this assumption is not guaranteed,
lections having more heterogeneity. In case of the collection with    users may face with irrelevant or incoherent results. Nevertheless
10 structures, the time does not exceed 40% when we compare           this assumption may be acceptable in many applications, such as
it to a collection with 2 structures. We can again notice in the      legacy collections, web applications or internet of things data.
Table 6 the negligible size of the generated dictionaries when           In our first experiments, the evaluation consists in comparing
compared to the 100 GB of the collection.                             the execution time cost of basic MongoDB queries and rewritten
                                                                      queries proposed by our approach. We conduct a set of tests by
 # of schema                2       4        6        8       10      changing two primary parameters, the size of the dataset and
 Required time (minutes)    96     108      127      143      156     the structural heterogeneity inside a collection (number of differ-
 Size of the resulting
                                                                      ent schemas). Results show that the cost of executing rewritten
 dictionary (KB)           4,154   9,458   13,587   17,478   22,997
                                                                      queries proposed in this paper is higher when compared to the
 Table 6: Time to build the dictionary of pre-loaded data             execution of basic user queries, but always less than twice. The
                                                                      overhead added to the performance of our query is due to the
                                                                      combination of multiple access paths to a queried field. Never-
   Afterwards, we give the interest to evaluate the overhead          theless, this time overhead is neglectful when compared to the
that causes the generation of the dictionary at loading time. We      execution of separated “by hand” queries for each schema while
generate five collections of 1GB having the same structures from      heterogeneity issues are automatically managed.
the last experiment (2, 4, 6, 8 and 10 schemas respectively). We         These first results are very encouraging to continue this re-
present two measurements in Table 7. First, we measure the time       search way and need to be strengthened. Short term perspectives
to simply load each collection (without the dictionary building).     are to continue evaluations and to identify the limitations re-
Second, we measure the overall time to build the dictionary while     garding the number of paths and fields in the same query and
loading the collection.                                               regarding time cost. More experiments still to be performed on
                                                                      larger datasets and real case datasets. Another perspective is to
 #of schemas Load (s) Load and dict. (s) Overhead                     enhance the current queries possibilities to introduce all existing
 2            201s      269s             33%                          classical operators of query languages (contains, etc.). It is also
 4            205s      277s             35%                          necessary to deal with other querying operators, particularly the
 6            207s      285s             37%                          aggregation operator.
 8            208s      300s             44%                             The first long-term perspective consists in studying the real-
 10           210s      309s             47%                          time building of the dictionary when integrating data in order to
 Table 7: Study of the overhead added during load time                take into account all possible queries: insert but also delete and
                                                                      update. It’s likely the current simple structure of the dictionary
                                                                      will be transformed in depth to support more complex updates
   In this experiments, we find that the overhead measure does        such as update and delete operations. The second long-term
not exceed 0.5 the time required to only load data. The evolution     perspective consists in managing multi-store databases. The goal
of the time while adding more heterogeneity is linear and not         would be to extend the proposed approach to query data stored
exponential which is encouraging. Many factors may affects the        in different types of databases in a way independent from the
query construction phase. The number of attributes, the nesting       various data schemas and stores. The final goal would be: how to
levels may increase or decrease the overhead. The advantage our       query “transparently” any data store meanwhile being unaware
solutions is once the data is loaded and the dictionary is built      of schemas or real fields names?
or updated, the rewritten query takes automatically all changes
into account.
REFERENCES
 [1] J Chris Anderson, Jan Lehnardt, and Noah Slater. 2010. CouchDB: The Definitive
     Guide: Time to Relax. " O’Reilly Media, Inc.".
 [2] Mohamed-Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio
     Ghelli, and Carlo Sartiani. 2017. Schema inference for massive json datasets.
     In Extending Database Technology (EDBT).
 [3] Pierre Bourhis, Juan L Reutter, Fernando Suárez, and Domagoj Vrgoč. 2017.
     JSON: data model, query languages and schema specification. In Proceedings
     of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database
     Systems. ACM, 123–135.
 [4] Max Chevalier, Mohammed El Malki, Arlind Kopliku, Olivier Teste, and Ronan
     Tournier. 2015. Implementation of multidimensional databases in column-
     oriented NoSQL systems. In East European Conference on Advances in Databases
     and Information Systems. Springer, 79–91.
 [5] Kristina Chodorow and Michael Dirolf. 2010. MongoDB: The Definitive Guide
     O’Reilly Media. (2010).
 [6] Mohamed Lamine Chouder, Stefano Rizzi, and Rachid Chalal. 2017. Enabling
     Self-Service BI on Document Stores.. In EDBT/ICDT Workshops.
 [7] Alejandro Corbellini, Cristian Mateos, Alejandro Zunino, Daniela Godoy, and
     Silvia Schiaffino. 2017. Persisting big-data: The NoSQL landscape. Information
     Systems 63 (2017), 1–23.
 [8] Avrilia Floratou, Nikhil Teletia, David J DeWitt, Jignesh M Patel, and Donghui
     Zhang. 2012. Can the elephants handle the nosql onslaught? Proceedings of
     the VLDB Endowment 5, 12 (2012), 1712–1723.
 [9] Daniela Florescu and Ghislain Fourny. 2013. JSONiq: The history of a query
     language. IEEE internet computing 17, 5 (2013), 86–90.
[10] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. 2003.
     XRANK: Ranked keyword search over XML documents. In Proceedings of the
     2003 ACM SIGMOD international conference on Management of data. ACM,
     16–27.
[11] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelli-
     gent data lake system. In Proceedings of the 2016 International Conference on
     Management of Data. ACM, 2097–2100.
[12] Michael Hausenblas and Jacques Nadeau. 2013. Apache drill: interactive
     ad-hoc analysis at scale. Big Data 1, 2 (2013), 100–104.
[13] Robin Hecht and Stefan Jablonski. 2011. NoSQL evaluation: A use case oriented
     survey. In Cloud and Service Computing (CSC), 2011 International Conference
     on. IEEE, 336–341.
[14] Victor Herrero, Alberto Abelló, and Oscar Romero. 2016. NOSQL design for
     analytical workloads: variability matters. In Conceptual Modeling: 35th Inter-
     national Conference, ER 2016, Gifu, Japan, November 14-17, 2016, Proceedings
     35. Springer, 50–64.
[15] Jan Hidders, Jan Paredaens, and Jan Van den Bussche. 2017. J-Logic: Logical
     Foundations for JSON Querying. In Proceedings of the 36th ACM SIGMOD-
     SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 137–149.
[16] Prashant R Lambole and Prashant N Chatur. 2017. A review on XML keyword
     query processing. In Innovative Mechanisms for Industry Applications (ICIMIA),
     2017 International Conference on. IEEE, 238–241.
[17] Yunyao Li, Cong Yu, and HV Jagadish. 2004. Schema-free xquery. In Proceed-
     ings of the Thirtieth international conference on Very large data bases-Volume
     30. VLDB Endowment, 72–83.
[18] Liang Lin, Vera Lychagina, Weiran Liu, Younghee Kwon, Sagar Mittal, and
     Michael Wong. 2011. Tenzing a sql implementation on the mapreduce frame-
     work. (2011).
[19] Kian Win Ong, Yannis Papakonstantinou, and Romain Vernoux. 2014. The
     SQL++ query language: Configurable, unifying and semi-structured. arXiv
     preprint arXiv:1405.3631 (2014).
[20] Pavel Shvaiko and Jérôme Euzenat. 2005. A survey of schema-based matching
     approaches. Journal on data semantics IV (2005), 146–171.
[21] M. Stonebraker. 2012. New opportunities for New SQL. Commun. ACM 5, 11
     (2012), 10–11.
[22] Daniel Tahara, Thaddeus Diamond, and Daniel J Abadi. 2014. Sinew: a SQL
     system for multi-structured data. In Proceedings of the 2014 ACM SIGMOD
     international conference on Management of data. ACM, 815–826.
[23] Lanjun Wang, Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassanzadeh, Jia Zou,
     and Chen Wangz. 2015. Schema management for document stores. Proceedings
     of the VLDB Endowment 8, 9 (2015), 922–933.
[24] Rui Zhou, Chengfei Liu, and Jianxin Li. 2010. Fast ELCA computation for key-
     word queries on XML data. In Proceedings of the 13th International Conference
     on Extending Database Technology. ACM, 549–560.