<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards schema-independent querying on document data stores</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hamdi</forename><forename type="middle">Ben</forename><surname>Hamadou</surname></persName>
							<email>hamdi.ben-hamadou@irit.fr</email>
							<affiliation key="aff0">
								<orgName type="laboratory" key="lab1">IRIT</orgName>
								<orgName type="laboratory" key="lab2">UMR 5505</orgName>
								<orgName type="institution" key="instit1">Université de Toulouse</orgName>
								<orgName type="institution" key="instit2">UT3</orgName>
								<orgName type="institution" key="instit3">CNRS</orgName>
								<address>
									<settlement>Toulouse</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Faiza</forename><surname>Ghozzi</surname></persName>
							<email>faiza.ghozzi@isims.usf.tn</email>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">Université de Sfax</orgName>
								<orgName type="institution" key="instit2">ISIMS</orgName>
								<address>
									<settlement>Sfax</settlement>
									<region>MIRACL</region>
									<country key="TN">Tunisia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">André</forename><surname>Péninou</surname></persName>
							<email>andre.peninou@irit.fr</email>
							<affiliation key="aff2">
								<orgName type="laboratory" key="lab1">UT2J</orgName>
								<orgName type="laboratory" key="lab2">IRIT</orgName>
								<orgName type="laboratory" key="lab3">UMR 5505</orgName>
								<orgName type="institution" key="instit1">Université de Toulouse</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<address>
									<settlement>Toulouse</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olivier</forename><surname>Teste</surname></persName>
							<email>olivier.teste@irit.fr</email>
							<affiliation key="aff3">
								<orgName type="laboratory" key="lab1">UT2J</orgName>
								<orgName type="laboratory" key="lab2">IRIT</orgName>
								<orgName type="laboratory" key="lab3">UMR 5505</orgName>
								<orgName type="institution" key="instit1">Université de Toulouse</orgName>
								<orgName type="institution" key="instit2">CNRS</orgName>
								<address>
									<settlement>Toulouse</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Towards schema-independent querying on document data stores</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073)</idno>
					</monogr>
					<idno type="MD5">9B546076C85CEADD202721545C88709E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T03:33+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Document is a pervasive semi-structured data model in today's Web and the Internet of Things (IoT) applications where the data structure is rapidly evolving over time. NoSQL documentoriented databases are well-tailored to efficiently load and manage massive collections of heterogeneous documents without any prior structural validations. However, this flexibility becomes a serious challenge while querying a heterogeneous collection of documents. Hence, it is mandatory for users to reformulate original query or to formulate new ones when more structures arrive in the collection. In this paper, we propose a novel approach to build schema-independent queries designed for querying multistructured documents. We introduce a query enrichment mechanism that consults a pre-materialized dictionary defining all possible underlying document structures. We automate the process of query enrichment via an algorithm that rewrites select and project operators to support multi-structured documents. To study the performances of our proposed approach we conduct experiments on synthetic dataset. First results are promising when compared to the normal execution of queries on homogeneous dataset.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The popularity of NoSQL systems is growing in the database community thanks to their ability to store and query schemafree data in flexible and efficient ways <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b21">21]</ref>. The document data model is pervasive in the most Web and the Internet of Things (IoT) applications <ref type="bibr" target="#b12">[13]</ref>, and several database systems support this data model in an efficient way <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>. Furthermore, in such applications, the structures of documents representing same entity are subject to structural changes <ref type="bibr" target="#b6">[7]</ref>. An application may face the problem of dealing with multi-structured data <ref type="bibr" target="#b1">[2]</ref>. To formulate relevant queries, there is a need to have a precise knowledge of data structures because document stores do not provide native support for querying multi-structured data. Thus, it is mandatory to manually include all possible navigational paths for the attributes of interest to formulate relevant query. The structural changes require users to reformulate original query which is a time-consuming and prone to error task. The challenge addressed in this paper is how to support querying upon future structural heterogeneity without affecting the application code.</p><p>In the context of document-oriented databases and due to the flexible nature of documents, it is possible to create a collection of documents describing a single entity with multiple structures. This characteristic points to several kinds of heterogeneity <ref type="bibr" target="#b20">[20]</ref>. The structural heterogeneity refers to diverse representation of documents, (e.g.: nested or flat structures, nesting levels, etc.) as shown in Figure <ref type="figure">2</ref>. The syntactic heterogeneity is a result of differences in representation of data, (e.g. "movie_title" or "movieTitle"). Moreover, the semantic heterogeneity is presented when the same fields may rely on distinct concepts in separate documents. The aim of this paper is to focus on structural heterogeneity.</p><p>The problem of schema-independent querying is a hot topic in the study of document-oriented databases for both industry and academia <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b22">22,</ref><ref type="bibr" target="#b23">23]</ref>. Previous work from the literature resolved this issue with the following two approaches: (i) performs physical integration of data by mapping integrated document structures into a unified structure <ref type="bibr" target="#b22">[22]</ref> and (ii) performs a virtual integration by introducing a custom interface that proposes new virtual schema to be learned by the users while querying heterogeneous data <ref type="bibr" target="#b23">[23]</ref>. The first approach modifies the underlying data structure, which is not possible while supporting legacy applications designed to run over original data structures. Moreover, this approach implies to define the mapping for any original data structure. The second one requires more efforts from the user to learn new global structures. This approach is a time-consuming task and possibly prone to be error when there is a need to query new documents with new structures since all queries are subject to revisions.</p><p>In this paper, we propose a novel approach to build schemaindependent queries designed for querying multi-structured documents. We propose a virtual integration that runs in a transparent way, hides the complexity to build expected queries, and supports structural heterogeneity evolution. Always, we rewrite the queries during the execution time to guarantee the usage of the latest structures of documents as defined in the dictionary.</p><p>The problem of structural heterogeneity refers to the possibility to find different navigational paths that lead to the same attribute. The attributes are not located at the same position inside documents, and having a limited knowledge of navigational paths is insufficient to retrieve the required information. In Figure <ref type="figure">1</ref>, the attribute "country" in documents describing films may not be relevant to differentiate between "actor .country" or "director .country." Some sub-paths may help to resolve the ambiguity such as "actors.country" and "director .country" anywhere in the document. Therefore, some sub-paths may be used rather than attributes names. In all cases, we hypothesise that there exist some navigational paths to differentiate the different entities contained in the document. { "movie_title":"Fast and Furious", "country": "USA", "actors": [ { "name": "Vin Diesel", "country": "USA" }, ... ], "director" : { "name": "F. Gray Gray", "country": "USA" } }</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 1: Descriptive fields ambiguity</head><p>We introduce the EasyQ, that stands for "Easy Query, " as a tool to validate our approach. We give particular interest to MongoDB for the implementation and evaluation. The primary contribution of our work is to reformulate users original queries formulated based on simple knowledge of the descriptive field or sub-paths that contains the desired information. The users' queries are formulated based on a schema-independent fashion, i.e. users can formulate an initial query based on a subset of possible schemas without carrying about all available structures. Query rewriting engine is responsible to transparently reformulate the initial query to match with all existing schemas returning a relevant result. To deal with document schema heterogeneity, we define a dictionary that contains all possible paths for all existing fields. The query rewriting engine enriches the user query with all possible paths found in the dictionary for each field used in the user query. In this paper, we use interchangeably the terms field, descriptive field and attribute.</p><p>The rest of our paper is structured as follows: In section 2, we illustrate the paper issue. Section 3 reviews the most relevant works, providing support to query multi-structured documents. Section 4 describes in details our approach. Section 5 presents our first experiments and the performances of our approach while changing the size and the number of schemas per collection. In Section 6, we summarize our findings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">QUERYING DOCUMENT STORES WITH MULTIPLE SCHEMA ISSUES</head><p>As discussed earlier, querying multi-structured data is a complex task. The problem is which structure to use while formulating queries and how this choice affects the results. In the following, we present a simple illustrating example. Let C = {d 1 , d 2 , d 3 , d 4 } be a set of four films as documented in Figure <ref type="figure">2</ref>. In this example we represent documents using JSON (JavaScript Object Notation). Most of the NoSQL systems support this notation of representing semi-structured data. A document d i is defined by a key-value pair (i, v i ) where i is the key and v i is the value described with JSON.</p><p>Let us consider that we want to retrieve information related to available languages for each presented movie. We formulate a projection query with the fields "movie_title" and "lanдuaдe." using MongoDB syntax as follows: db.C.f ind({}, {"movie_title" : 1, "lanдuaдe" : 1}).</p><p>In this query, the field "movie_title" does not cause any difficulty since it is always at the same structural level in the four documents.Therefore, the query engine is able to locate all information related to the field "movie_title." However, the field "lanдuaдe" may cause some information loss since it is founded d1: { "movie_title":"Fast and furious", "year":2017, "language":"English" }, d2: { "movie_title": "Titanic", "details": {"year":1997,"language":"English"} }, d3: { "movie_title": "Despicable Me 3", "year":2017 }, d4: { "movie_title": "The Hobbit", "versions": [{"year":2012, "language":"English"}, {"year":2013, "language":"French"}] } Figure <ref type="figure">2</ref>: Four documents of films collection at several positions across documents. Thus, assuming that we formulate a query with limited knowledge of the structure s 2 from d 2 , we build a query with the fields "movie_title" and "details.lanдuaдe." db.C.f ind({}, {"movie_title" : 1, "details.lanдuaдe" : 1}) Executing such query in MongoDB leads to an incomplete result since "details.lanдuaдe" field is not available in documents d 1 , d 3 , and d 4 . The problem comes from the structural heterogeneity due to the different structural position of the field "lanдuaдe, " i.e. "lanдuaдe" in document d 1 , "details.lanдuaдe" in document d 2 , and "versions.lanдuaдe" in document d 4 . Hence, we may include all these paths in the query using specific and often complex syntax.</p><p>Moreover, we can try to formulate two different queries. The first one is formulated over schema s 1 of the document d 1 in order to retrieve the list of titles and "lanдuaдe" for each film. We use the following MongoDB query: db.C.f ind({}, {"movie_title" : 1, "lanдuaдe" : 1}) We than get the following result:</p><formula xml:id="formula_0">C 1 = [ {"</formula><p>movie_title" : "Fast and f urious", "lanдuaдe" : Enдlish}, {"movie_title" : "Titanic"}, {"movie_title" : "Despicable Me 3"}, {"movie_title" : "The Hobbit"}]</p><p>We formulate the second using the schema s 4 of document d 4 :</p><p>db.C.f ind({}, {"movie_title" : 1, "versions.lanдuaдe" : 1})</p><p>We get the following result:</p><formula xml:id="formula_1">C 2 = [ {"</formula><p>movie_title" : "Fast and f urious"}, {"movie_title" : "Titanic"}, {"movie_title" : "Despicable Me 3"}, {"movie_title" : "The Hobbit", "versions" : [{"lanдuaдe" : "Enдlish"}, {"lanдuaдe" : "French"}] ] When executing both queries, the query engine returns two results. As expected, all possible information related to the field "movie_title" is returned for all documents as it is located on document's root. For the first query, only first document matches with the field containing "lanдuaдe" information. The second query succeeded to retrieve "lanдuaдe" information only from the fourth document. The challenge is how to formulate a single query and retrieve all information related to the field "lanдuaдe" without any redundancy. For instance, the same information related to the field "movie_title" is obtained twice in the resulted collections</p><formula xml:id="formula_2">C 1 &amp; C 2 .</formula><p>To solve this we introduce a transparent way to build relevant schema-independent queries that bypass structural heterogeneity in documents stores. A simple knowledge of the required attributes allows users to retrieve adequate documents regardless the structural heterogeneity in the collection. This ease simplifies the task for end-users and provides them an efficient way to retrieve information of interest. In case of there-above example, we enrich the original user query by adding all possible navigational paths to retrieve relevant documents. For instance, we formulate the query db.C.f ind({}, {"movie_title" : 1, "lanдuaдe" : 1, "details.lanдuaдe" : 1, "versions.lanдuaдe"}) in MongoDB syntax. It can bypass the structural heterogeneity in the current state of the collection and it projects all desired values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">STATE OF THE ART</head><p>The widespread use of semi-structured data gives increased interest to build solutions enabling queries over semi-structured data. We distinguish existing solutions systems on the basis of the proposed querying approach: 1) schema-dependent querying approach that requires knowledge of the schema in a similar way as conventional relational database systems, 2) schema-independent querying approach that does not need any prior schema knowledge from the user and is able to extract the schemas at querying time.</p><p>The first category of systems is designed to enable queries based on reliable knowledge about the schema or the navigational paths for desired values when dealing with nested data. Such systems offer complicated querying language such as regular expressions with XQuery or Xpath <ref type="bibr" target="#b17">[17]</ref> when dealing with XML data. XQuery works with the structure to retrieve precisely the desired results. However, if the user does not know the structure, it is impossible to write the relevant query. Moreover, a single query is generally not able to retrieve data when several schemas are to be considered simultaneously. We can notice the same considerations with JSONiq <ref type="bibr" target="#b8">[9]</ref>, the extension of XQuery, designed to deal with large-scale data such as JSON data. Other systems suggest JavaScript queries API, the case of MongoDB <ref type="bibr" target="#b4">[5]</ref>, to build a query by specifying a document with properties expected to match with the results. It offers a broad range of querying capabilities, in particular data processing pipelines. The API requires a complex syntax and it is necessary that queries explicitly include all the various schema structures within documents to access data. Otherwise, the query engine returns only documents that match the supplied criteria even if the fields with the desired information exist but under other paths than those existing in the query. Another kind of works is SQL++ <ref type="bibr" target="#b19">[19]</ref> relies on the rich SQL querying interface. In this case, it is also mandatory to express all exact navigational paths in order to obtain the desired results.</p><p>The above-studied systems are designed to support queries over semi-structured data with known schemas. To formulate queries user needs to know the exact underlying data structures. Also, they neglect the fact that user may have limited knowledge about the data structure and hence may be unable to formulate correct queries over the heterogeneous dataset.</p><p>To overcome these limitations, recent works were conducted to enable schema-independent querying; the second category underlined at the beginning of this section. Thus, the schema is not mandatory to be known in advance at loading time. We classify the studied works according to two approaches: (i) performs physical integration by refactoring integrated data structures into an unified structure; and (ii) adopts virtual integration by introducing either a custom interface and/or a new query language <ref type="bibr" target="#b23">[23]</ref>.</p><p>In the first direction, several works were designed to deal with semi-structured data. Those works share the idea of the schema-on-read. There is no need to define schemas before loading data, they infer the implicit schema later from stored datasets on query time. They expose for the users a relational view over the data to help them to build SQL queries. Sinew <ref type="bibr" target="#b22">[22]</ref>, is able to infer schemas from semi-structured data. It defines for the user a logical view on the inferred schema, and it flattens data into columns to be stored into relational database system (RDBMS). Drill <ref type="bibr" target="#b11">[12]</ref> enables schema-independent querying via SQL over heterogeneous data without first defining a schema. It gives support for nested data. Tenzing <ref type="bibr" target="#b18">[18]</ref> infers a relational schema from the underlying data but can only do so for flat structures that can be trivially mapped to a relational schema.</p><p>The principle of the previous solutions suggests heavy physical refactorization that requires flattening the underlying data structures into a relational format using complex encoding techniques. Hence, the refactorization requires additional resources such as the need for external relational database and extra efforts to learn the unified inferred relational schema. Besides, some solutions do not support the flexible nature of semi-structured data <ref type="bibr" target="#b18">[18]</ref> for instance they cannot handle nested data. User dealing with those systems has to learn new schemas every time the workload changes, or new data comes because there is a need to re-generate the relational view and the stored columns after every change.</p><p>Virtual integration gets also attention from researchers <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b23">23]</ref>. Works are inspired by the data lake approach <ref type="bibr" target="#b10">[11]</ref> where data is collected in their original format for later use. We consider two major classes: i) schema-oriented queries; and ii) keyword querying.</p><p>Works from the first class infer the schema from a collection of data and offer for the users the possibility to query the inferred schema and to check whether a field or sub-schema exists or not to guide them while developing their applications. In <ref type="bibr" target="#b23">[23]</ref> the authors propose to summarize all the document schema under a skeleton to discover the existence of fields or sub-schemas inside the collection. In <ref type="bibr" target="#b13">[14]</ref> the authors suggest extracting all the schema that are present in the collection to help final users to be aware of the schemas and all fields in the integrated collection of documents. These solutions are limited only to type and field identification and are not used to determine the different paths to access a field in the collection.</p><p>Keyword querying has been adopted in the context of XML <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b24">24]</ref>. The process of answering a keyword query on XML data starts by identifying the existence of the keywords within the documents (possibly through some free-text search). They take as input the searched keywords and return a subset from the document that matches with the query keywords. A score is computed based on the structure of sub-documents, and according to this score, the respective XML documents containing all the keywords are returned.</p><p>Works in Keyword querying suggest doing a pairwise comparison or binary search to identify the possible positions for queried keywords. This concept is not well tailored for a large number of documents with complex structures (different nested elements, numerous attributes, etc.).</p><p>From state of the art, we build our approach in the idea of offering virtual integration to enable schema-independent querying via the usage of keyword based on the attributes and to support native semi-structured features such as nested attributes and support for heterogeneous collections of documents.</p><p>Our work relates in some way to previous attempts with XML keyword querying <ref type="bibr" target="#b15">[16]</ref>. The most important contribution of these earlier efforts is to prevent users from learning complex underlying schemas as well as a complex query language to manage paths. We adopt this idea that the user may not be aware of all existing schemas and cannot manage too complex queries in order to enable schema-independent querying based on only knowledge about the field with the desired information. The main difference between our works and the keywords querying is that we require from the user some simple details about the queried data. For instance, if we execute a keyword query "Enдlish". It is possible to have as result "lanдuaдe" : "Enдlish" and also "movie_title" : "Johnny Enдlish." With our approach, we will specify that we give interest to the field "lanдuaдe" = "Enдlish" or to the field "movie_title" contains "Enдlish."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">QUERYING HETEROGENEOUS COLLECTION OF DOCUMENTS</head><p>In our proposal, we want to enable queries over multi-structured documents by automatically handling the underlying structural heterogeneity. Thus, our query rewriting engine will give transparent support for the heterogeneity on both stored and future new data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 3: An overview of EasyQ</head><p>To give an overview of our approach, let us consider the following selection query (selection operation is defined later in this section):</p><formula xml:id="formula_3">σ ("year "=1997) (C)</formula><p>We refer to the collection presented in Figure <ref type="figure">2</ref> in which we notice that it exists three distinct navigational paths leading to the attribute "year , " i.e. "details.year " "versions.year " and "year ." For each document, at least one path can lead to the attribute "year ." It is possible to express the selection predicate in disjunctive form of navigational paths. X = ("year " = 1997) ∨ ("details.year " = 1997) ∨ ("versions.year " = 1997)</p><p>We rewrite the initial query into σ (X ) (C). Two conditions have to be satisfied to select one document, (i) it does exist at least one navigational path from the sub-conditions of X inside the document and (ii) the result of evaluating at least one of these sub-conditions is equal to true. Otherwise, X is equal to false and the document is not returned in the result.</p><p>The challenges are how to enable schema-independent querying in transparent ways and how to support future new structures without revising the application code.</p><p>Figure <ref type="figure">3</ref> gives a high-level illustration of our query rewriting engine called EasyQ. EasyQ is designed to be used early in data loading phase to materialize a dictionary that tracks the different navigational paths, for all attributes. EasyQ is also used at querying time to enrich the query Q of the user and to bypass the structural heterogeneity. It takes as input the user query formulated over final fields or sub-paths, and the desired collection. The query rewriting engine produces one extended query Q ex t that will be executed by the underlying document store system. The result of this extended query is a collection containing relevant information. An important result of such architecture is that the same user query, evaluated at different moment, will be rewritten each time. So, if new documents with new structures have been inserted in the collection (or existing documents are updated), these new structures are automatically handled and results remain relevant with the query.</p><p>In the rest of this section, we describe the formal model of multi-structured documents, dictionary, and the querying operators across multi-structured documents. Finally, we formally define how we rewrite the queries. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Multi-structured data modeling</head><formula xml:id="formula_4">d i = (k d i , v d i )</formula><p>• k d i is a key that identifies the document (by abusive notation we noted i the key k d i in section 1 and 2;</p><formula xml:id="formula_5">• v d i = {a d i ,1 : v d i ,1 , . . . , a d i ,n : v d i ,n } is the document</formula><p>value. The document value v d i is defined as an object composed by a set of (a d i , j , v d i , j ) pairs, where each a d i , j , is a string called attribute and each v d i , j , is the value that can be atomic (numeric, string, boolean, null) or complex (object, array). A value v d i , j is defined below.</p><p>An atomic value is defined as follows ∀j ∈ [1.</p><p>.n]:</p><formula xml:id="formula_6">• v d i , j = n if n ∈ N * , the set of numeric values (integer, float); • v d i , j = "s" if "s" is a string formulated in U nicodeA * ; • v d i , j = b if b ∈ B, the set of boolean {true, f alse}; • v d i , j = ⊥ is a null value;</formula><p>A complex value is defined as follows ∀j ∈ [1..n]:</p><formula xml:id="formula_7">• v d i , j = {a d i , j,1 : v d i , j,1 , . . . , a d i , j,m : v d i , j,m } is an object value where v d i , j,k , ∀k ∈ [1.</formula><p>.m] are values, and</p><formula xml:id="formula_8">a d i , j,k , ∀k ∈ [1.</formula><p>.m] are Strings in A * called attributes. This is a recursive definition identical to document value;</p><formula xml:id="formula_9">• v d i , j = [v d i , j,l , . . . , v d i , j,l ] represents an array of values v d i , j,k , ∀k ∈ [1..l], l =∥ v d i , j ∥ ;</formula><p>In case of having document values v d i , j as object or array, their inner values v d i , j,k can be complex values too allowing to have different nesting levels. To cope with nested documents and navigate through schemas, we adopt the navigational path notations <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b14">15]</ref>.</p><formula xml:id="formula_10">Definition 4.3 (Schema). The schema, called s d i , is inferred from the document value v d i = {a d i ,1 : v d i ,1 , . . . , a d i ,n : v d i ,n } is defined as s d i = {p 1 , . . . , p m i } where p j , ∀j ∈ [1..m i ]</formula><p>, is a path of each attribute of v d i , or navigational path for nested values such as v d i , j,k . For multiple nesting levels, the navigational path is extracted recursively to find the path from the root to the final atomic value that can be found in the document hierarchy.</p><p>A schema s v d i of value v d i from document d i is formally defined as:</p><formula xml:id="formula_11">• if v d i , j is atomic, s d i = s d i ∪ {a i, j }; • if v d i , j is object, s d i = s d i ∪ {a d i , j } ∪ {∪ p ∈s d i , j a d i , j .p} where s d i , j is the schema of v d i , j ; • if v d i , j is an array, s d i = s d i ∪ {a d i , j } ∪ ∥v d i , j ∥ j=1 { a d i , j .k} ∪ {∪ p ∈s d i , j,k a d i , j .k.</formula><p>p} where s d i , j,k is the schema of the k t h value from the array v d i , j ;</p><p>Example. Let us consider the documents d 1 and d 2 of Figure <ref type="figure">2</ref> . The underlying schema for both documents is described as follows:</p><p>s v d 1 = {"movie_title", "year ", "lanдuaдe"} s v d 2 = {"movie_title", "details", "details.year ", "details.lanдuaдe"}</p><p>We notice that the attribute "details" from document d 2 is a complex one in which are nested the attributes "year " and "lanдuaдe" which leads to have two different navigational paths "details.year " and "details.lanдuaдe". Definition 4.4 (Collection Schema). The schema S C is inferred from collection C is defined by</p><formula xml:id="formula_12">S C = c i=1 s v d i</formula><p>Definition 4.5 (Dictionary). The dictionary dict C of a collection C is defined by</p><formula xml:id="formula_13">∀p k ∈ S C , dict C = {(p k , △ k )} • p k ∈ S C</formula><p>is a path for an attribute which is present at least in one document of the collection;</p><formula xml:id="formula_14">• △ k = {p p k, 1 , . . . , p p k,q } ⊆ S C is a set of navigational paths leading to p k ;</formula><p>For the rest of this paper, we will call equally any path p k as attribute. We will use dictionary paths and dictionary attributes accordingly.</p><p>Example. The dictionary dict C constructed from the collection C is defined below, each dictionary entry p k refers to the set of all extracted navigational paths. </p><formula xml:id="formula_15">dict C = { (movie_title,</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Querying multi-structured data</head><p>Querying multi-structured data is possible via a combination of a set of unary operators. In this paper, we limit the querying process to projection and selection operators expressed by native MongoDB operators "f ind" and "aддreдate".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Minimal closed kernel of unary operators.</head><p>We define a minimal closed kernel of unary operators. We call C in the queried collection, and C out the resulting collection. Definition 4.6 (Projection). The project operator helps to reduce initial schemas of documents from the collection to a finite subset of attributes as;</p><formula xml:id="formula_16">π A (C in ) = C out</formula><p>where A ⊆ S in is a sub-set of attributes from S C in ( the schema of the input collection C in Definition 4.7 (Selection). The select operator runs to retrieve only documents that match some predicates; we call</p><formula xml:id="formula_17">σ p (C in ) = C out</formula><p>where p refers to the predicate (or condition) for the selection operator. A simple predicate is expressed by a k ω k v k where a k ⊆ S C in is an attribute, ω k ∈ {= ; &gt; ; &lt; ; ; ≥ ; ≤ } is a comparison operator, and v k i is a value. It is possible to combine predicates by these operator from Ω = { ∨, ∧, ¬} and this leads to a complex predicate.</p><p>We call Norm p the normal conjunctive form of the predicates p defined as follows:</p><formula xml:id="formula_18">Norm p = i j a i , j ϖ i , j v i, j</formula><p>We consider that all predicates in selection operators as in normal conjunctive form. Definition 4.8 (Query). A query Q can be formulated by composing operators.</p><formula xml:id="formula_19">Q = q 1 • • • • • q r (C)</formula><p>where ∀i ∈ [1, r ] q i ∈ {π , σ } Example. Let us consider the collection presented in Figure <ref type="figure">2</ref>. q 1 : σ ("l anдuaдe"="Enдl ish") (C) = [ {"movie_title" : "Fast and f urious", "year " : 2017, "lanдuaдe" : "Enдlish"} ] q 2 : π ("movie_t itl e","year ") (C) = [ {"movie_title" : "Fast and f urious", year : 2017} {"movie_title" : "Titanic"} {"movie_title" : "Despicable Me 3", year : 2017} {"movie_title" : "The Hobbit"} ] q 3 : π ("movie_t itl e","year ") (σ ("l anдuaдe"="Enдl ish") (C)) = [ {"movie_title" : "Fast and f urious", "year " : 2017} ]</p><p>Here, the query q 3 is constructed by combining select and project operators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.2.2</head><p>Query extension for multi-structured data. In this section, we introduce a new query extension algorithm that automatically enriches the user query. The native query engine of document-oriented stores such as MongoDB can efficiently execute our rewritten queries. Then, it is possible to find out all desired information regardless the structural heterogeneity inside the collection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1: Automatic extension for the initial user query</head><formula xml:id="formula_20">input: Q output: Q ex t Q ex t ← id // identity foreach q i ∈ Q do switch q i do case π A i : // projection do A ex t ← ∀a k ∈A i △ k Q ex t ← Q ex t • π A e x t end case σ p : // selection do P ex t ← i j a k ∈△ i, j , a k ϖ i, j v i, j Q ex t ← Q ex t • σ P e x t end end end</formula><p>Our approach aims to enable transparent querying on a multistructured collection of documents via automatic query rewriting. This process employs the materialized dictionary to enrich the original query by including the different navigational paths that lead to desired attributes. The algorithm 1 describes the query extension process as:</p><p>• In case of projection operation, the algorithm extends the list of attributes A i by uniting different navigational paths △ k for each projected a k . • In case of the selection operation, the algorithm enriches the predicate p, expressed in the normal conjunctive form, with the set of extended dis-junctions built from the navigational paths △ i, j for each attribute a i, j .</p><p>Example. Let us consider the query q 3 from the previous example. First, the query rewriting engine starts by extending the project operator. (line "projection" in Algorithm 1) π ("movie_t itl e","year ") (C)</p><p>For each projected field, the process consults the dictionary and extracts all the possible navigational paths. The dictionary entry, for the field movie_title, corresponds to:</p><p>(movie_title, {movie_title}) So, A ex t = {movie_title} The dictionary entry, for the field year, corresponds to:</p><p>(year, {year , details.year , versions.1.year, versions.2.year }) So, A ex t = {movie_title, year, details.year, versions.1.year , versions.2.year } The projection query is then rewritten as: π ("movie_t itl e", "year ", "det ails .year ", "ver sions .1.year ", "ver sions .2.year ") (C)</p><p>Next, the process continues with the selection query (line "selection" in Algorithm 1) σ ("l anдuaдe"="Enдlish") (C)</p><p>The dictionary entry for the field "language" corresponds to: (lanдuaдe, {lanдuaдe, details.lanдuaдe, versions.1.lanдuaдe, versions.2.lanдuaдe}) So, P ex t = { "lanдuaдe" = "Enдlish" ∨ "details.lanдuaдe" = "Enдlish" ∨ "versions.1.lanдuaдe" = "Enдlish" ∨ "versions.2.lanдuaдe" = "Enдlish"}</p><p>The selection is then rewritten as: σ ("l anдuaдe"="Enдlish"∨"det ails .l anдuaдe"="Enдlish"∨"ver sions .</p><p>1.l anдuaдe"="Enдlish" ∨ "ver sions .2.l anдuaдe"="Enдlish") (C)</p><p>Finally, the query rewriting engine generates the final query by combining the generated queries: π ("movie_t itl e","year ","det ails .year ","ver sions .1.year ","ver sions .2.year ") (σ ("l anдuaдe"="Enдlish"∨"det ails .l anдuaдe"="Enдlish" ∨"ver sions .1.l anдuaдe"="Enдlish"∨"ver sions .2.l anдuaдe"="Enдlish") )(C))</p><p>= [ {"movie_title" : "Fast and f urious", "year " : 2017}, {"movie_title" : "Titanic", "details" : {"year " : 2017}}, {"movie_title" : "The Hobbit", "versions" :</p><formula xml:id="formula_21">[ {"year " : 2017}]} ]</formula><p>The query rewriting process injects additional complexity to the original user's queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>In this section, we conduct a series of experiments to study the aforementioned points:</p><p>• Which are the effects on the execution time of the rewritten queries while varying the size of the collection and is this cost acceptable or not? • Is the time to build the dictionary acceptable and what about the size of the dictionary according to structural variability?</p><p>Next, we explain the experimental protocol, then we study the queries execution cost, and finally we evaluate the dictionary generation time and its size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Experimental protocol</head><p>We choose to run all the queries on synthetic datasets loaded into the document store MongoDB. In this section, we introduce the details of the experimental setup, the process of generating the synthetic datasets and the evaluation queries set. Later on, we present the results of executing the evaluation set in three separate contexts. The goal is to compare the cost of executing the rewritten queries; (i) the cost of executing the original queries on homogeneous documents, (ii) the execution time of several distinct queries that we build manually based on each schema. Then, we study the effects of the heterogeneity on the dictionary in terms of size and construction time. Finally, we evaluate the scale of the heterogeneity and its impact on generating the rewritten queries.</p><p>We conducted our experiments on MongoDB v3.4. We used an I5 3.4GHZ machine coupled with 16GB of RAM with 4 TB of storage space that runs CentOS7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1">Dataset.</head><p>To study the structural heterogeneity, we generate a custom synthetic datasets. First, we collected a JSON collection of documents from imdb 1 that describe movies. The original dataset has only flat documents with 28 attributes in each document. Then, we reuse this flat collection to produce documents with structural heterogeneity. For each generated dataset, we can define several parameters such as the number of schemas to produce in the collection, the percentage of the presence of every generated schema. For each schema, we can adjust the number of grouping objects. We mean by grouping object, a compound field in which we nest a subset of the document attributes. In other words, we cannot find the same grouping objects inside two structures. To make sure about the heterogeneity within documents, the grouping objects are unique in every schema. Only the original fields from the flat dataset are common to all documents. The values of those fields are randomly chosen from the original film collection. To add more complexity, we can set the nesting level used for each structure. For the rest of the experiments, we built our dataset based on the characteristics that we describe in the Table <ref type="table" target="#tab_1">1</ref>. We generate collections of 10, 25, 50 and 100 GB of data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Setting</head><p>Value # of schema 10 # of grouping objects per schema {5, We generate a flat collection with same leaf attributes and their corresponding values as found in the heterogeneous datasets. This new collection helps us to have a proper environment and to compare; the execution time of the rewritten query on the heterogeneous datasets, versus the execution time of the original query on homogeneous datasets. Therefore, we ensure that every 1 imdb.com query returns the identical results from both heterogeneous or flat datasets. The same result implies: i) the same number of documents, and -0ii) the same values for their attributes (leaf fields).</p><p>5.1.2 Queries. we choose to build a synthetic set of queries based on the different comparison operators supported by Mon-goDB. We employed the classical comparison operators, i.e {&lt; , &gt;, ≤, ≥, =, } for numerical values as well as classical logical operators, i.e {and, or } between query predicates. Also, we employed a regular expression to deal with string values. We select 8 attributes of different types and under different levels inside the documents in heterogeneous datasets. The Table <ref type="table">2</ref> shows that for each attribute its type and the selection operator that we used later while formulating the synthetic queries. In addition, we present for each attribute the number of possible paths as found in the synthetic heterogeneous collection, the different nesting levels and the selectivity of the predicate. </p><p>We formulate the following 6 queries:</p><formula xml:id="formula_24">• Q1 : p1 ∧ p2 • Q2 : p1 ∨ p2</formula><p>With the queries Q1&amp;Q2 the rewritten queries contain 15 predicates unlike the original queries that contains 2 predicates. 15 predicates are due to the 8 existing paths for Director N ame in p1 plus 7 paths for Gross in p2 that are included in a disjunctive form as described in rewriting algorithm.</p><formula xml:id="formula_25">• Q3 : p1 ∧ p2 ∧ p5 ∧ p7 • Q4 : p1 ∨ p2 ∨ p5 ∨ p7</formula><p>The rewritten versions of Q3 &amp; Q4 contain 29 predicates unlike the original queries that contain 4 predicates.</p><formula xml:id="formula_26">• Q5 : p1 ∧ p2 ∧ p5 ∧ p7 ∧ p6 ∧ p3 ∧ p4 ∧ p8 • Q6 : p1 ∨ p2 ∨ p5 ∨ p7 ∨ p6 ∨ p3 ∨ p4 ∨ p8</formula><p>Finally, the rewritten versions of the queries Q5&amp;Q6 contain 57 predicates unlike the original queries that contain 8 predicates.</p><p>The Table <ref type="table" target="#tab_4">3</ref> presents for each dataset: i) the number of documents inside the collection, ii) the number of expected results regarding each executed query.    We define three contexts on which we run the above-defined queries. For each context, we measure the average of execution time after executing each query at least five times. The order of query execution is set to be random.</p><formula xml:id="formula_27">Collection size in GB # of documents Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 10 GB 12 M 520 K 8,</formula><p>In the following, we present the details of the three evaluations contexts:</p><p>• We call "Q Base " the query that refers to the initial user query (one of the above-defined queries), and it is executed over the homogeneous versions of the datasets. The purpose of this first context is to study the native behavior of the document store. We use this first context as a baseline for our experimentation. • The "Q Rewr it t en " refers to the query "Q Base " rewritten by our approach. It is executed over the heterogeneous versions of the datasets. • The "Q Accumul at ed " refers to the set of equivalent queries formulated on each possible schema from the collection.</p><p>In our case, it is mad of 10 separated queries since we are dealing with collections having ten schemas. These queries are build "by hand" as should have done any user without any assisting tool. We do not consider the time necessary to merge the results of each query as the goal is to compare the time to find the set of result documents. "Q Accumul at ed " is obviously executed over the heterogeneous versions of the datasets.</p><p>Table <ref type="table" target="#tab_5">4</ref> synthesizes our execution contexts. As shown in Figure <ref type="figure" target="#fig_1">4</ref>, we can notice that our rewritten query, Q Rewr it t en , outperforms the accumulated one, Q Accumul at ed . The difference between the two execution scenarios come from the capabilities of our rewritten query to include automatically all navigational paths extracted from the collection. Hence, the query is executed only once when the accumulated query may require several passes through the stored collection. This solution requires more CPU loads and more intensive disk I/O operations. We move now to study the efficiency of the rewritten query when compared to the baseline query Q Base . We can notice that the overhead of our solution is up to two times (e.g., disjunctive form) when compared to the native execution of the baseline query on the homogeneous dataset. Moreover, we score an overall overhead that does not exceed 1,5 times in all six queries. We believe that this overhead is acceptable since we bypass the needed costs for refactoring the underlying data structures. Unlike the baseline, our synthetic dataset contains different grouping objects with varying nesting levels. Then, the rewritten query include several navigational paths which will be processed by the native query engine of MongoDB to find matches in each visited document among the collection. 5.2.2 Dictionary and query rewriting engine at the scale. With this series of experiments, we try to push the dictionary and the query rewriting engine to their limits. To this end, we generated a heterogeneous synthetic collection of 1 GB. We use the primary 28 attributes from the IMDB flat films collection. The custom collections are generated in a way that each schema inside a document is composed of two grouping objects with no further nesting levels. We generated collection having 10, 100, 1k, 3k and 5k schemas. For this experiment, we keep on the use of the query Q 6 introduced earlier in this section. We present the time needed to build the rewritten query in the Table <ref type="table" target="#tab_6">5</ref>. It is notable that the time to build the rewritten query is very low, less than two seconds. Also, it is possible to construct a dictionary over a heterogeneous collection of documents, here our dictionary can support up to 5 k of distinct schemas. The resulting size of the materialized dictionary is very encouraging since it does not require significant storage space. Furthermore, we also believe that the time spent to build the rewritten query is really interesting and represent another advantage of our solution. In this series of experiments, on each time we try to find distinct navigational paths for eight predicates. Each rewritten query is composed by numerous disjunctive forms for each predicate. We notice 80 disjunctive forms while dealing with dataset having 10 schemas, 800 with 100 schemas, 8 k with 1 k schemas, 24 k with 3 k schemas and 40 k with 5k schemas. We believe that dictionary and the query rewriting engine scale well while dealing with heterogeneous collection of documents having an important number of schemas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head># of schemas rewriting time</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Dictionary construction time</head><p>In this part, we give the interest to the study of the dictionary constructions process. EasyQ offers the possibility to build the dictionary over existing dataset or during data loading phase. The dictionary contains the latest version of the data once all document are inserted. So, the query rewriting engine enrich the queries based on the new dictionary, otherwise if the process of data loading is in progress, it may do not take into account the recent changes. In the following, we study both configurations. First, we start by the evaluation of the time required to build the dictionary among pre-loaded five collections of 100 GB having 2, 4, 6, 8 and 10 schemas respectively.</p><p>We notice from the results in the Table <ref type="table" target="#tab_7">6</ref> that the time elapsed to build the dictionary increases when we start to deal with collections having more heterogeneity. In case of the collection with 10 structures, the time does not exceed 40% when we compare it to a collection with 2 structures. We can again notice in the Afterwards, we give the interest to evaluate the overhead that causes the generation of the dictionary at loading time. We generate five collections of 1GB having the same structures from the last experiment (2, 4, 6, 8 and 10 schemas respectively). We present two measurements in Table <ref type="table" target="#tab_9">7</ref>. First, we measure the time to simply load each collection (without the dictionary building). Second, we measure the overall time to build the dictionary while loading the collection. In this experiments, we find that the overhead measure does not exceed 0.5 the time required to only load data. The evolution of the time while adding more heterogeneity is linear and not exponential which is encouraging. Many factors may affects the query construction phase. The number of attributes, the nesting levels may increase or decrease the overhead. The advantage our solutions is once the data is loaded and the dictionary is built or updated, the rewritten query takes automatically all changes into account.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>#of schemas</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>NoSQL databases are often called schemaless because they may contain variable schemas among stored data. Nowadays, this variability is becoming a common standard in many applications as for example web applications, social media applications or internet of things world. Nevertheless, the existence of such structural heterogeneity makes it very hard for users to formulate queries to find out relevant and coherent results.</p><p>In this paper, to deal with structural heterogeneity, we suggest a novel approach for querying heterogeneous documents describing a given entity over NoSQL document stores. The developed tool is called EasyQ. Our objective is to allow users to perform their queries using a minimal knowledge about data schemas. EasyQ is based on two pillars. The first one is a dictionary that contains all possible paths for any existing field. The second one is a rewriting module that modifies the user query to match all field paths existing in the dictionary. Our approach is a syntactic manipulation of queries. So, it is grounded on an important assumption: the collection describes homogeneous entities, i.e., a field has the same meaning in all document schemas. In case of ambiguity, the user should specify some sub-path in order to overcome the ambiguity. If this assumption is not guaranteed, users may face with irrelevant or incoherent results. Nevertheless this assumption may be acceptable in many applications, such as legacy collections, web applications or internet of things data.</p><p>In our first experiments, the evaluation consists in comparing the execution time cost of basic MongoDB queries and rewritten queries proposed by our approach. We conduct a set of tests by changing two primary parameters, the size of the dataset and the structural heterogeneity inside a collection (number of different schemas). Results show that the cost of executing rewritten queries proposed in this paper is higher when compared to the execution of basic user queries, but always less than twice. The overhead added to the performance of our query is due to the combination of multiple access paths to a queried field. Nevertheless, this time overhead is neglectful when compared to the execution of separated "by hand" queries for each schema while heterogeneity issues are automatically managed.</p><p>These first results are very encouraging to continue this research way and need to be strengthened. Short term perspectives are to continue evaluations and to identify the limitations regarding the number of paths and fields in the same query and regarding time cost. More experiments still to be performed on larger datasets and real case datasets. Another perspective is to enhance the current queries possibilities to introduce all existing classical operators of query languages (contains, etc.). It is also necessary to deal with other querying operators, particularly the aggregation operator.</p><p>The first long-term perspective consists in studying the realtime building of the dictionary when integrating data in order to take into account all possible queries: insert but also delete and update. It's likely the current simple structure of the dictionary will be transformed in depth to support more complex updates such as update and delete operations. The second long-term perspective consists in managing multi-store databases. The goal would be to extend the proposed approach to query data stored in different types of databases in a way independent from the various data schemas and stores. The final goal would be: how to query "transparently" any data store meanwhile being unaware of schemas or real fields names?</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Definition 4 . 1 (</head><label>41</label><figDesc>Collection). A collection C is defined as a set of documents C = {d 1 , . . . , d |C | } Each document d i is considered as a (key-value) pair where the value takes the form: v i = {a i,1 : v i,1 , . . . , a i,n : v i,n } Definition 4.2 (Document). A document d i , ∀i ∈ [1, c], is defined as a (key,value) pair</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Query rewriting evaluations</figDesc><graphic coords="8,53.99,83.68,160.94,90.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>{movie_title}), ( year, {year, details.year, versions.1.year, versions.2.year} ),</figDesc><table><row><cell>(lanдuaдe, {lanдuaдe, details.lanдuaдe,</cell></row><row><cell>versions.1.lanдuaдe, versions.2.lanдuaдe}),</cell></row><row><cell>(details, {details}),</cell></row><row><cell>(details.year , {details.year }),</cell></row><row><cell>(details.lanдuaдe, {details.lanдuaдe}),</cell></row><row><cell>(versions, {versions}),</cell></row><row><cell>(versions.1, {version.1}),</cell></row><row><cell>(versions.1.year , {versions.1.year }),</cell></row><row><cell>(versions.1.lanдuaдe, {versions.1.lanдuaдe}),</cell></row><row><cell>(versions.2, {versions.2}),</cell></row><row><cell>(versions.2.year , {versions.2.year }),</cell></row><row><cell>(versions.2.lanдuaдe, {versions.2.lanдuaдe})</cell></row><row><cell>}</cell></row><row><cell>For example, the entry</cell></row><row><cell>(year , {year , details.year , versions.1.year ,</cell></row><row><cell>versions.2.year }) gives all navigational paths leading to</cell></row><row><cell>the attribute "year ".</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 :</head><label>1</label><figDesc>Settings of the generated dataset</figDesc><table><row><cell>3}</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 :</head><label>3</label><figDesc>Number of extracted documents per query</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4 :</head><label>4</label><figDesc>Evaluations context 5.2 Queries evaluation 5.2.1 Queries execution time.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 5 :</head><label>5</label><figDesc>Scale effects on query rewriting and dictionary size</figDesc><table><row><cell>Dictionary size</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 6</head><label>6</label><figDesc>the negligible size of the generated dictionaries when compared to the 100 GB of the collection.</figDesc><table><row><cell># of schema</cell><cell>2</cell><cell>4</cell><cell>6</cell><cell>8</cell><cell>10</cell></row><row><cell>Required time (minutes)</cell><cell>96</cell><cell>108</cell><cell>127</cell><cell>143</cell><cell>156</cell></row><row><cell>Size of the resulting</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>dictionary (KB)</cell><cell cols="5">4,154 9,458 13,587 17,478 22,997</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 6 :</head><label>6</label><figDesc>Time to build the dictionary of pre-loaded data</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 7 :</head><label>7</label><figDesc>Study of the overhead added during load time</figDesc><table><row><cell></cell><cell cols="3">Load (s) Load and dict. (s) Overhead</cell></row><row><cell>2</cell><cell>201s</cell><cell>269s</cell><cell>33%</cell></row><row><cell>4</cell><cell>205s</cell><cell>277s</cell><cell>35%</cell></row><row><cell>6</cell><cell>207s</cell><cell>285s</cell><cell>37%</cell></row><row><cell>8</cell><cell>208s</cell><cell>300s</cell><cell>44%</cell></row><row><cell>10</cell><cell>210s</cell><cell>309s</cell><cell>47%</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Chris</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jan</forename><surname>Lehnardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noah</forename><surname>Slater</surname></persName>
		</author>
		<title level="m">CouchDB: The Definitive Guide: Time to Relax</title>
				<imprint>
			<publisher>O&apos;Reilly Media, Inc</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Schema inference for massive json datasets</title>
		<author>
			<persName><forename type="first">Mohamed-Amine</forename><surname>Baazizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Houssem</forename><surname>Ben Lahmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Colazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Ghelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Sartiani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Extending Database Technology (EDBT</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">JSON: data model, query languages and schema specification</title>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Bourhis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Juan</forename><forename type="middle">L</forename><surname>Reutter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><surname>Suárez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Domagoj</forename><surname>Vrgoč</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems</title>
				<meeting>the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="123" to="135" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Implementation of multidimensional databases in columnoriented NoSQL systems</title>
		<author>
			<persName><forename type="first">Max</forename><surname>Chevalier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohammed</forename><forename type="middle">El</forename><surname>Malki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arlind</forename><surname>Kopliku</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olivier</forename><surname>Teste</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ronan</forename><surname>Tournier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">East European Conference on Advances in Databases and Information Systems</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="79" to="91" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">Kristina</forename><surname>Chodorow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Dirolf</surname></persName>
		</author>
		<title level="m">MongoDB: The Definitive Guide O&apos;Reilly Media</title>
				<imprint>
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Enabling Self-Service BI on Document Stores</title>
		<author>
			<persName><forename type="first">Mohamed</forename><surname>Lamine Chouder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefano</forename><surname>Rizzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rachid</forename><surname>Chalal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT/ICDT Workshops</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Persisting big-data: The NoSQL landscape</title>
		<author>
			<persName><forename type="first">Alejandro</forename><surname>Corbellini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cristian</forename><surname>Mateos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alejandro</forename><surname>Zunino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniela</forename><surname>Godoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Silvia</forename><surname>Schiaffino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Systems</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="page" from="1" to="23" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Can the elephants handle the nosql onslaught?</title>
		<author>
			<persName><forename type="first">Avrilia</forename><surname>Floratou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikhil</forename><surname>Teletia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><forename type="middle">J</forename><surname>Dewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jignesh</forename><forename type="middle">M</forename><surname>Patel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Donghui</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the VLDB Endowment</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="1712" to="1723" />
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">JSONiq: The history of a query language</title>
		<author>
			<persName><forename type="first">Daniela</forename><surname>Florescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ghislain</forename><surname>Fourny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE internet computing</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="86" to="90" />
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">XRANK: Ranked keyword search over XML documents</title>
		<author>
			<persName><forename type="first">Lin</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Feng</forename><surname>Shao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chavdar</forename><surname>Botev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jayavel</forename><surname>Shanmugasundaram</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2003 ACM SIGMOD international conference on Management of data</title>
				<meeting>the 2003 ACM SIGMOD international conference on Management of data</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="16" to="27" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Constance: An intelligent data lake system</title>
		<author>
			<persName><forename type="first">Rihan</forename><surname>Hai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sandra</forename><surname>Geisler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christoph</forename><surname>Quix</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 International Conference on Management of Data</title>
				<meeting>the 2016 International Conference on Management of Data</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2097" to="2100" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Apache drill: interactive ad-hoc analysis at scale</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Hausenblas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jacques</forename><surname>Nadeau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Big Data</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="100" to="104" />
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">NoSQL evaluation: A use case oriented survey</title>
		<author>
			<persName><forename type="first">Robin</forename><surname>Hecht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Jablonski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Cloud and Service Computing (CSC), 2011 International Conference on. IEEE</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="336" to="341" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">NOSQL design for analytical workloads: variability matters</title>
		<author>
			<persName><forename type="first">Victor</forename><surname>Herrero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alberto</forename><surname>Abelló</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oscar</forename><surname>Romero</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conceptual Modeling: 35th International Conference, ER 2016</title>
				<meeting><address><addrLine>Gifu, Japan</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016-11-14">2016. November 14-17, 2016</date>
			<biblScope unit="page" from="50" to="64" />
		</imprint>
	</monogr>
	<note>Proceedings 35</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">J-Logic: Logical Foundations for JSON Querying</title>
		<author>
			<persName><forename type="first">Jan</forename><surname>Hidders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jan</forename><surname>Paredaens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jan</forename><surname>Van Den Bussche</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems</title>
				<meeting>the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="137" to="149" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A review on XML keyword query processing</title>
		<author>
			<persName><forename type="first">Prashant</forename><forename type="middle">N</forename><surname>Prashant R Lambole</surname></persName>
		</author>
		<author>
			<persName><surname>Chatur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Innovative Mechanisms for Industry Applications</title>
				<imprint>
			<publisher>ICIMIA</publisher>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m">International Conference on. IEEE</title>
				<imprint>
			<biblScope unit="page" from="238" to="241" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Schema-free xquery</title>
		<author>
			<persName><forename type="first">Yunyao</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cong</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><surname>Jagadish</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirtieth international conference on Very large data bases-Volume 30</title>
				<meeting>the Thirtieth international conference on Very large data bases-Volume 30</meeting>
		<imprint>
			<publisher>VLDB Endowment</publisher>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="72" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Tenzing a sql implementation on the mapreduce framework</title>
		<author>
			<persName><forename type="first">Liang</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vera</forename><surname>Lychagina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Weiran</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Younghee</forename><surname>Kwon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sagar</forename><surname>Mittal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Wong</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">The SQL++ query language: Configurable, unifying and semi-structured</title>
		<author>
			<persName><forename type="first">Win</forename><surname>Kian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yannis</forename><surname>Ong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Romain</forename><surname>Papakonstantinou</surname></persName>
		</author>
		<author>
			<persName><surname>Vernoux</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1405.3631</idno>
		<imprint>
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">A survey of schema-based matching approaches</title>
		<author>
			<persName><forename type="first">Pavel</forename><surname>Shvaiko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jérôme</forename><surname>Euzenat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal on data semantics IV</title>
		<imprint>
			<biblScope unit="page" from="146" to="171" />
			<date type="published" when="2005">2005. 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">New opportunities for New SQL</title>
		<author>
			<persName><forename type="first">M</forename><surname>Stonebraker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="10" to="11" />
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Sinew: a SQL system for multi-structured data</title>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Tahara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thaddeus</forename><surname>Diamond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><forename type="middle">J</forename><surname>Abadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 ACM SIGMOD international conference on Management of data</title>
				<meeting>the 2014 ACM SIGMOD international conference on Management of data</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="815" to="826" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Schema management for document stores</title>
		<author>
			<persName><forename type="first">Lanjun</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shuo</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Juwei</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Limei</forename><surname>Jiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oktie</forename><surname>Hassanzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jia</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chen</forename><surname>Wangz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2015">2015. 2015</date>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="922" to="933" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Fast ELCA computation for keyword queries on XML data</title>
		<author>
			<persName><forename type="first">Rui</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chengfei</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jianxin</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 13th International Conference on Extending Database Technology</title>
				<meeting>the 13th International Conference on Extending Database Technology</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="549" to="560" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
