<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Variety-Aware OLAP of Document-Oriented Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Enrico Gallinucci</string-name>
          <email>enrico.gallinucci2@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Golfarelli</string-name>
          <email>matteo.golfarelli@unibo.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Rizzi</string-name>
          <email>stefano.rizzi@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DISI-Univ. of Bologna</institution>
          ,
          <addr-line>Italy, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DISI-Univ. of Bologna</institution>
          ,
          <addr-line>Italy, Cesena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>DISI-Univ. of Bologna</institution>
          ,
          <addr-line>Italy, Cesena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Schemaless databases, and document-oriented databases in particular, are preferred to relational ones for storing heterogeneous data with variable schemas and structural forms. However, the absence of a unique schema adds complexity to analytical applications, in which a single analysis often involves large sets of data with diferent schemas. In this paper we propose an original approach to OLAP on collections stored in document-oriented databases. The basic idea is to stop fighting against schema variety and welcome it as an inherent source of information wealth in schemaless sources. Our approach builds on four stages: schema extraction, schema integration, FD enrichment, and querying; these stages are discussed in detail in the paper. To make users aware of the impact of schema variety, we propose a set of indicators related for instance to query completeness and precision.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Recent years have witnessed an erosion of the relational
DBMS predominance to the benefit of DBMSs based on
alternative representation models (e.g., document-oriented
and graph-based) which adopt a schemaless representation
for data. Schemaless databases are preferred to relational
ones for storing heterogeneous data with variable schemas
and structural forms; typical schema variants within a
collection consist in missing or additional attributes, in
diferent names or types for an attribute, and in
diferent structures for instances [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The absence of a unique
schema grants flexibility to operational applications but
adds complexity to analytical applications, in which a
single analysis often involves large sets of data with diferent
schemas. Dealing with this complexity while adopting a
classical data warehouse design approach would require
a notable efort to understand the rules that drove the
use of alternative schemas, plus an integration activity to
identify a common schema to be adopted for analysis —
which is quite hard when no documentation is available.
Furthermore, since new schema variations are often made,
a continuous evolution of both ETL process and cube
schemas would be needed.
      </p>
      <p>In this paper we propose an original approach to
multidimensional querying and OLAP on schemaless sources,
in particular on collections stored in document-oriented
databases (DODs) such as MongoDB. The basic idea is to
stop fighting against data heterogeneity and schema variety,
and welcome it as an inherent source of information wealth
in schemaless sources. So, instead of trying to hide this
variety, we show it to users (basically, data scientist and</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH OVERVIEW</title>
      <p>Figure 1 gives an overview of the approach: in blue the
diferent stages of the approach, on the right the
metadata produced/consumed by each stage. Remarkably, all
schema-related concepts are stored as metadata, so no
transformation has to be done on source data. User
interaction is required at most stages. Although the picture
suggests a sequential execution of the stages, it simply
outlines the ordering for the first iteration. In the scenario that
we envision, the user starts by analyzing the first results
provided by the system, then iteratively injects additional
knowledge into the diferent stages to refine the metadata
and improve the querying efectiveness. We now provide a
LEGEND
data
metadata
control
FD enrichment</p>
      <p>Collection</p>
      <p>Dependency graph
Schema integration</p>
      <p>Global schema, Mappings
Schema extraction</p>
      <p>Local schemas
short description of each stage; a deeper discussion will be
provided in the following sections.</p>
      <p>Schema extraction (Section 3). The goal of this stage
is to identify the set of distinct local schemas that occur
inside a collection of documents. To this end we provide
a tree-like definition for schemas which models arrays by
considering the union of the schemas of their elements.
This is a completely automatic stage which requires no
interaction with the user.</p>
      <p>Schema integration (Section 4). At this stage we rely
on inter-schema mappings and schema integration
techniques to determine a (tree-like) global schema that gives
the user a single and comprehensive description of the
contents of the collection. In principle, this stage could be
completely automated. In practice, the best results can be
obtained through a semi-automatic approach, that allows
users to manually validate/refine the mappings proposed
by the system. As of now, we rely on the user to
manually provide inter-schema mappings, from which the global
schema is derived.</p>
      <p>FD enrichment (Section 5). Traditional OLAP
analyses are carried out on multidimensional cubes. To enable
the OLAP experience in our setting, a multidimensional
representation of the collection must be derived from the
global schema. In particular, we introduce the notion of
dependency graph, i.e., a graph that provides a
multidimensional view of the global schema in terms of the functional
dependencies (FDs) between its attributes. Some FDs can
be inferred from the structure of the schema, others by
analyzing data; given the expected schema variety, we
specicfially look for approximate FDs.</p>
      <p>
        Querying (Section 6). The last stage consists in
delivering the OLAP experience to the user by enabling the
formulation of multidimensional queries on the dependency
graph and their execution on the collection. First of all,
each formulated query is validated against the requirements
of well-formedness proposed in the literature [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Then,
the query is translated to the query language of the DOD
and reformulated into multiple queries, one for each
local schema in the collection; the results presented to the
user are obtained by merging the results of the single local
queries. To make the user aware of the impact of schema
variety in terms of quality and reliability of the results, we
show her a set of indicators related to query completeness
and precision.
      </p>
      <p>The motivation example that we use across the paper
is based on a real-world collection of workout sessions,
[ { "_id" : ObjectId("54a4332f44cfc02424f961d4"),
"User" :
{ "FullName" : ”John Smith",</p>
      <p>"Age" : 42 },
"StartedOn" : ISODate("2017-06-15T10:20:44.000Z"),
"Facility" :
{ "Name" : "PureGym Piccadilly",</p>
      <p>"Chain" : "PureGym" },
"SessionType" : "RunningProgram",
"DurationMins": 90,
"Exercises" :
[ { "Type" : "Leg press",
"ExCalories" : 28,
"Sets" :
[ { "Reps" : 14,</p>
      <p>"Weight" : 60 },
. . .</p>
      <p>] },
{ "Type" : "Tapis roulant" },
. . .
obtained from a worldwide company selling fitness
equipment. Figure 2 shows a sample document in the collection,
organized according to three nesting levels:
(1) The first level contains information about the user,
including the facility in which the session took place,
the date, and the total duration in minutes.
(2) The Exercises array contains an object for every
exercise carried out during the session, with information
on the type of exercise, and the total calories.
(3) The Sets array contains an object for every set that
the exercise was split into. For example, the “leg
press” exercise has been done in multiple sets, the
ifrst of which comprises 14 repetitions with a weight
of 60 kilograms, for a total of 28 calories.
3</p>
    </sec>
    <sec id="sec-3">
      <title>SCHEMA EXTRACTION</title>
      <p>The goal of this stage is to introduce a notion of (local)
schema for a document, to be used in the integration stage
to determine a (global) schema for a collection and then,
in the FD enrichment stage, to derive an OLAP-compliant
representation of the collection itself.</p>
      <p>The notion of a document is the central concept of a
DOD, and it encapsulates and encodes its data in some
standard format. The most widely adopted format is
currently JSON, which we will use as a reference in this work.</p>
      <p>Definition 3.1 (Document and Collection). A document
 is a JSON object. An object is formed by a set of key/value
pairs (aka fields); a key is string, while a value can be either
a primitive value (i.e, a number, a string, or a Boolean),
an array of values, an object, or null. A collection  is an
array of documents.</p>
      <p>Example 3.2. Figure 2 shows a document excerpted from
the WorkoutSession collection; it contains numbers (e.g.,
Age), strings (e.g., Chain), objects (e.g., User), and arrays
(e.g., Exercises). Conceptually, a session is done by a user
at a facility; it includes a list of exercises, each possibly
comprising several sets. □
Since there is no explicit representation of schemas in
documents, multiple definitions of schema are possible for
the schemas of collections and documents —with diferent
levels of conciseness and precision. The main diference in
these definitions lies in how they cope with inter-document
variety and intra-document variety.</p>
      <p>
        ∙ Inter-document variety impacts on the definition
of the schema for a collection, as it concerns the
presence of documents with diferent fields. This
issue is usually dealt with in one of two ways:
either by defining the schema of the collection as the
union/intersection [
        <xref ref-type="bibr" rid="ref1 ref24">1, 24</xref>
        ] of the most frequent fields,
or by keeping track of every diferent schema [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
Our work mixes the above mentioned approaches in
that it builds a global schema starting from local
schemas.
∙ Intra-document variety impacts on the definition of
the schema for a document, and is mainly related to
the presence in a document of a heterogeneous array.
For instance, an array of objects can mix objects
with diferent fields (e.g., the first objects of the
Exercises array in Figure 2 contains fields that are
missing from the second one). In this work we adopt
a simple representation that, like in [
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ], considers
the union of the values contained in the array.
      </p>
      <p>We start by giving a “structural” definition of a schema as
a tree, then we reuse it to define the schema of a document
and, in Section 4, the schema of a collection.</p>
      <p>Definition 3.3 (Schema). A schema is a directed tree
 = (, ) where  is a set of fields and  is a set of
arcs representing the relationships between arrays and the
contained fields. In particular,
(1)  =   ∪ ,   is a set of array fields
(including the root  of ), and   is a set of primitive
ifelds;
(2)  includes arcs from fields in   to fields in   ∪
 .</p>
      <p>Each field  ∈  has a name, ( ), a unique pathname
(obtained by concatenating the names of the fields along
the path from  to  , with the exclusion of ), and a
type, ( ) (( ) ∈ {number, string, Boolean} for all
 ∈  , ( ) = array for all  ∈  ). Given field
 ̸= , we denote with ( ) the array  ∈   such that
(,  ) ∈ .</p>
      <p>To define the schema of a specific document we need
to add identifiers to arrays. We denote with () the
primitive field that identifies an object within array .
Documents always contain an identiefir, () = id.
Conversely, array objects may not contain such a field, but still
they can be univocally identified by their positional index
within the array. Therefore, given array , () can be
recursively defined as the concatenation of (()) and
the positional index within ; it is (()) = id and
(()) = string.</p>
      <p>Definition 3.4 (Schema of a Document). Given
document  ∈ , the schema of  is the schema () =
(  ∪  , ) such that
(1)   includes a field for each array in , labelled
with the corresponding key and type, plus a root 
labelled with the name of  and with type array.
SessionType
DurationMins</p>
      <p>Exercises
Exercises_id
Type
ExCalories</p>
      <p>Sets
Sets_id
Reps
Weight
(a)
_id
User.FullName
User.FirstName
User.LastName
User.Age
StartedOn
Facility.Name
Facility.Chain
Facility.City
SessionType
DurationMins</p>
      <p>Exercises
Exercises_id
Type
ExCalories</p>
      <p>Sets
Sets_id
Reps
Weight
SetCalories
(b)
_id
FirstName
LastName
Date
Gym.Name
Gym.City
SessionType
DurationSecs</p>
      <p>Series
Series_id
ExType
Reps
Weight
SeriesCalories
(c)
(2)   includes (i) a field for each primitive in ,
and (ii) a field for each () with  ∈  ,  ̸= ;
every field is labelled with its corresponding key and
type (keys of primitives within an object field are
“flattened”, i.e., prefixed with the object’s key);
(3)  includes (i) an arc (,  ) for each field  such that
( ) appears as a key in the root level of , and
(ii) an arc (,  ) if ( ) appears as a key in an
object of array .</p>
      <p>Example 3.5. Figure 3.a shows the schema of the
document represented in Figure 2, part of the WorkoutSession
collection (from now on, abbreviated in WS). Each array is
represented as a box, with its child primitives listed below
(numeric primitives are in italics). Object fields are
preifxed with the object key (e.g., Facility.Chain). The vertical
lines between boxes represent inter-array arcs, with the
root WS on top. It is (Exercises.Type) = Exercises and
(Exercises) = Exercises. id. □</p>
      <p>Given collection , we denote with () the set of
distinct schemas of the documents in  (where two fields
in the schemas of two documents are considered equal if
they have the same pathname and the same type).
() = ⋃︁ ()</p>
      <p>∈</p>
      <p>Given  ∈ (), we denote with  the set of documents
in  such that () = .
4</p>
    </sec>
    <sec id="sec-4">
      <title>SCHEMA INTEGRATION</title>
      <p>
        The goal of this stage is to integrate the distinct, local
schemas extracted from  to obtain a single and
comprehensive view of the collection, i.e., a global schema, and
its mappings with each local schema. The global schema
can be incrementally built using one of the methodologies
discussed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; for instance, adopting a ladder
integration strategy, by (i) taking one local schema as the global
schema; (ii) iteratively taking each other local schema,
finding its mappings onto the global schema, and updating
the global schema accordingly. However, notice that some
mappings may be missed by adopting a purely incremental
strategy (i.e., a second iteration on the local schemas may
be required). A survey of the techniques that can be used
for finding mappings is provided in [
        <xref ref-type="bibr" rid="ref18 ref3">3, 18</xref>
        ].
      </p>
      <p>A mapping is defined as follows:</p>
      <p>Definition 4.1 (Mapping). Given two schemas  and ,
a mapping from  to  can be either
∙ an array mapping with form ⟨, ′⟩, where  ∈ 
and ′ ∈ ;
∙ a primitive mapping with form ⟨,  ′, ⟩, where  ⊆
,  ′ ⊆ , and  is a transcoding function,
 : ( ) → ( ′).</p>
      <p>The definition of the global schema for a collection is
based on the inter-schema mappings determined.</p>
      <p>Definition 4.2 (Global Schema). Given collection  and
the corresponding set of schemas () = {1, . . . , }, the
global schema of  is a schema () = (, ) where
(1) for every  ∈ () there is a mapping ⟨, ⟩ between
the roots of  and ();
(2) every field  in each  is involved in at least one
mapping onto the fields of ();
(3) every field  in () is involved in at least one
mapping with some .</p>
      <p>Example 4.3. Figure 3 shows two sample schemas from
the WS collection (a and c) and the corresponding global
schema () (b); mappings are represented with dotted
lines. An example of array mapping from local schema (c)
to global schema (b) is</p>
      <p>⟨Series, Exercises.Sets⟩
Examples of primitive mappings are
⟨{Date}, {StartedOn}, 1⟩
⟨{FirstName, LastName}, {User.FullName}, 2⟩
⟨{Series.ExType}, {Exercise.Type}, 1⟩
where 1 is the identity function while 2 is a function that
concatenates two strings. □</p>
      <p>A transcoding function transforms values of a set of
ifelds into values of another set of fields; it is needed for
each primitive mapping to enable query reformulation in
presence of selection predicates as well as to enable the
results obtained from all documents to be integrated (see
Appendix). On the other hand, array mappings are not
associated to a transcoding function because arrays are
just containers and do not have values themselves.</p>
      <p>Due to the already mentioned inter-document variety, a
ifeld  of the global schema may not be available in every
local schema (e.g., Facility.Chain is absent in the second
schema in Figure 3); therefore we need a measure of the
support of  with respect to the diferent schemas in collection
. Intuitively, given the nested structure of documents,
the support of  could be defined as the percentage of
times that  occurs among the objects of ( ). However,
due to the fact that  may occur at diferent depths in
diferent documents (e.g., if  = Exercises.ExCalories in the
global schema, ( ) is Exercises in the schema of Figure
3.a and Exercises.Sets in the schema of Figure 3.c), this
measure must be computed locally to each schema and
then aggregated to get a global measure. Thus, we define
the global support of  as the weighted average of the local
support s calculated on the distinct schemas.</p>
      <p>Definition 4.4 (Local Support of a Field). Given a
document schema  = (, ), the local support of a field  ∈ 
is recursively defined as:
(, ) =
{︃1, if  ≡</p>
      <p>∑︀∈ ( ) · (( ), ), otherw.
where ( ) is the percentage of objects of ( ) which
include  .</p>
      <p>Note that the support of  is weighted on the support
of its array ( ); this is because, for instance,  may
occur in every object of ( ) but ( ) may be missing
for some object of (( )). As a result, it is always
(, ) ≤ (( ), ).</p>
      <p>Definition 4.5 (Global Support of a Field). Given
collection  and the set of distinct schemas (), the global
support of a field  ∈  is:
( ) =
∈()
∑︁ (, ) · ||
||
where || is the number of documents with schema  and
|| is the overall number of documents.</p>
      <p>Example 4.6. In our working example, let the collection
have 100 documents (i.e., || = 100) evenly distributed
between 1 and 2 (i.e., |1 | = |2 | = 50). Let  =
Facility.City occur 40 times in 1 and 20 times in 2; then,
(, 1) = 4500 * 1 = 0.8, (, 2) = 2500 * 1 = 0.4
and ( ) = 0.8 * 0.5 + 0.4 * 0.5 = 0.6. □
5</p>
    </sec>
    <sec id="sec-5">
      <title>FD ENRICHMENT</title>
      <p>The goal of this stage is to propose a multidimensional view
of the global schema to enable OLAP analyses. The main
informative gap to be filled to this end is the identification
of hierarchies, which in turn relies on the identification of
FDs between fields in the global schema.</p>
      <p>While in relational databases FDs are represented at the
schema level by means of primary and referential integrity
constraints, the same is not true in DODs. Yet, identifiers
are present in DODs: each collection has its (explicit) id
ifeld and, as discussed in Section 3, every nested object
has its own (implicit) identifier (i.e., () with  ∈  ).
The presence of these identifiers implies the existence of
some FDs, that we call intensional as they can be derived
from the global schema, without looking at the data. In
particular, given global schema () = (  ∪  , )
and array  ∈  , we can infer that:
∙ () →  for every  ∈   such that ( ) = ,
i.e., the identifier of  determines the value of every
primitive in  (e.g., id → SessionType);
∙ if  ̸= , then () → (()) —i.e., the
identifier of  determines the identifier of () (e.g.,
Exercises. id → id); this is trivial, since (()) is
part of ().</p>
      <p>
        In practice, additional FDs can exist between primitive
nodes, though they cannot be inferred from the schema;
so, they can only be found by checking the data. More
precisely, since DODs may contain incomplete and faulty
data, we have to look for approximate FDs (AFDs), i.e.,
FDs that “mostly” hold on data —like done for instance
in [
        <xref ref-type="bibr" rid="ref10 ref23 ref6">6, 10, 23</xref>
        ].
      </p>
      <p>
        Definition 5.1 (Approximate Functional Dependency).
Given two fields  and  ′, let (,  ′) ∈ [0..1] denote the
ratio between the number of unique values of  and the
number of unique values of (,  ′). We will say that AFD
 ⇝  ′ holds if (,  ′) ≥  , where  is a user-defined
threshold [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        To detect AFDs and create hierarchies accordingly, some
approaches that were recently devised in the literature
(e.g., [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) can be reused, possibly coupled with
traditional approaches to multidimensional modeling based
on FDs (e.g., [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]). Interestingly, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] the number of checks
to be made for AFD detection is efectively reduced thanks
to the intensional FDs provided by the global schema.
Note that, diferently from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in our approach we consider
inter-document variety, so the queries that check for AFDs
must be reformulated from the global schema on each local
schema. How this can be done is discussed in the Appendix.
      </p>
      <p>Definition 5.2 (Dependency Graph). Given the global
schema () = (, ) and an (acyclic) set of (A)FDs Γ, the
dependency graph is a couple ℳ = ( , ⪰ ) where  
is the set of primitive nodes in  and ⪰ is a roll-up partial
order of   derived from Γ. In particular,  ⪰  (i.e.,
 is a predecessor of  in ⪰ ) if either  ⇝  ∈ Γ or
 →  ∈ Γ.</p>
      <p>The diferences between a dependency graph and the
global schema it is derived from are that
(1) the global schema is a tree, the dependency graph is
a DAG;
(2) arrays are not present in the dependency graph, but
their id’s are;
(3) arcs express (A)FDs in the dependency graph,
syntactical containment in the global schema;
(4) diferently from the global schema, the dependency
graph can include arcs between primitive fields.</p>
      <p>Example 5.3. Figure 4 shows the dependency graph
for our working example. Each primitive field is
represented as a circle whose color is representative of the field
global support (the lighter the tone, the lower the
support). Identifiers (e.g., id) are shown in bold. Directed
arrows are representative of the (A)FDs in Γ; for instance,
it is id → Facility.Name (FDs are shown in black) and
Facility.Name ⇝ Facility.Chain (AFDs are shown in gray).
Note that, in this case, the dependency graph is a tree,
because in the global schema of Figure 3.b arrays are nested
within each other. A diferent situation is the one shown
in Figure 5, where the collection includes documents with
two arrays at the same level, so the dependency graph is
not a tree. □
Exercises.</p>
      <p>Sets.Weight</p>
      <p>Exercises.</p>
      <p>ExCalories</p>
      <p>Exercises.Sets.</p>
      <p>SetCalories
Legend
level (supp=1)
level (supp&lt;1)
FD (acc=1)</p>
      <p>AFD (acc&lt;1)</p>
      <p>User.</p>
      <p>FullName
Exercises.</p>
      <p>Sets.Reps</p>
      <p>Exercises._id
Exercises.</p>
      <p>Type
_id
Facility.</p>
      <p>Name</p>
      <p>User. User.</p>
      <p>FirstName LastName</p>
      <p>SessionType</p>
      <p>DurationMins
User.Age</p>
      <p>Facility.City</p>
      <p>Facility.Chain</p>
      <p>Definition 6.1 (OLAP query). Given dependency graph
ℳ = ( , ⪰ ), an OLAP query on ℳ is a triple  =
⟨, , ,  ⟩ where:
∙  is the query group-by set, i.e., a non-empty set of
ifelds in   such that for all couples ,  in 
it is  ̸⪰ ;
∙  is an (optional) selection predicate; it is a
conjunction of Boolean predicates, each involving a field in
 ;
∙  ∈   is the query measure, i.e., the numerical
ifeld to be aggregated;
∙  is the operator to be used for aggregation (e.g.,
avg, sum);
Algorithm 1 Validity check of an OLAP query
Input a dependency graph ℳ = ( , ⪰ ), an OLAP query  =
⟨, , ,  ⟩
Output a validity status
1:  ← false
2: for each  ∈  do
3: if (()) ̸⪰ (()) then
4:  ← true ◁ Disjointness failed
5: if () &lt; 1 then
6:  ← true ◁ Completeness failed
7: if  then
8: return “warning”
9: else
10: return “valid”
∙ there exists in ℳ one single field  such that  ⪰ 
for all other fields mentioned in  (either in , , or
).</p>
      <p>We will refer to all the fields in  and  as the query
levels. Field  is called the fact of  (denoted  ()) and
corresponds to the coarsest granularity of ℳ on which
 can be formulated. An example of a case in which a
fact cannot be determined is the one in Figure 5, with
 = {Classes.Name, Exercises.Type}.</p>
      <p>Example 6.2. The following query, 1, measures the
average amount of weight lifted by elderly athletes per city
and type of exercise:
1 = ⟨ {Facility.City, Exercises.Type},</p>
      <p>User.Age ≥ 60, Exercises.Sets.Weight, avg ⟩
It is  (1) = Exercises.Sets. id.
□</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] the authors outline the constraints that must hold
for an OLAP query to be considered well-formed, namely,
the base integrity constraint (stating that the levels in the
group-by set must be functionally independent on each
other) and the summarization integrity constraint [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
which in turn requires disjointness (the measure instances
to be aggregated are partitioned by the group-by instances),
completeness (the union of these partitions constitutes the
entire set), and compatibility (the aggregation operator
chosen for each measure is compatible with the type of
that measure). Remarkably, Definition 6.1 already ensures
that queries meet the base integrity constraint (because the
query group-by set cannot include fields related by (A)FDs).
As to the summarization integrity constraint, since the goal
of our approach is to enable an immediate querying of data
with no cleaning beforehand, we adopt a “soft” approach
to avoid being too restrictive. So, after each query has been
formulated by the user, it undergoes a check (sketched in
Algorithm 1) that can possibly return some warnings to
inform the user of potentially incorrect results. Specifically,
the disjointness constraint ensures that the granularity of
the measure is not coarser than the one of the group-by set
levels (line 3); if this is false, the same instance of  will
be double counted for multiple instances of the group-by
set [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]). The completeness constraint ensures that the
levels in the group-by set have full global support (line 5);
this constraint is easily contradicted as it clashes with the
schemaless property of DODs. Finally, the compatibility
constraint is not considered at all since its verification
would require to properly categorize measures (i.e., flow,
stock and value-per-unit) and levels (i.e, temporal and
nontemporal), but these information can hardly be inferred
from the schema or even provided by the user [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Example 6.3. Query 1 passes the validity check of
Algorithm 1 with a completeness warning, because
(Facility.City) &lt; 1. On the other hand, 1 meets
the disjointness constraint because</p>
      <p>((Facility.City)) = id
((Exercises.Type)) = Exercises. id
((Exercises.Sets.Weight)) = Exercises.Sets. id
Exercises.Sets. id ⪰ id
Exercises.Sets. id ⪰ Exercises. id
□</p>
      <p>
        As previously mentioned, a query fails the completeness
constraint if one or more levels in the group-by set do
not have full support. This issue is strictly related to the
one of incomplete hierarchies in data warehouse design.
The related work proposes three alternative strategies to
replace missing values in a hierarchy level : balancing
by exclusion (i.e., replacing all missing values with a
single value “Other”), downward balancing (replacing with
values from the closest level  such that  ⪰ ), and
upward balancing (replacing with values from the closest
level  such that  ⪰ ) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Whereas they are originally
meant to be applied when populating a data warehouse
from an operational source, these strategies can be directly
applied at query time, e.g., by using the $ifNull operator
in MongoDB, which allows to replace a missing value in
a field with a custom value or with the value of another
ifeld. Thus, when a query fails the completeness constraint,
we ask the user to indicate the desired strategy to replace
missing values in the levels without full support.
6.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Query Execution</title>
      <p>Once a query has been formulated by the user on the
dependency graph corresponding to the global schema, it
has to be reformulated on each local schema to efectively
cope with inter-document variety. How this can be done
is discussed in the Appendix. In the remainder to this
subsection we explain how, after reformulation, each single
query obtained can be translated to MongoDB.</p>
      <p>OLAP queries are translated to MongoDB according
to its aggregation framework, which allows to declare a
multi-stage pipeline of transformations to be carried out on
the documents of a collection. The most important stages
are: $match (to apply predicate selections), $project (to
apply transformations to the single fields), $unwind (to
unfold an array by creating a diferent document for every
object inside the array), $group (to group the documents
and calculate aggregated values).</p>
      <p>Given query  = ⟨, , ,  ⟩ on ℳ and global schema
() = (, ), the translation of  into the MongoDB
language is done as follows:
(1) For every array  in (),  ̸= , for which there is a
ifeld  mentioned in  such that  () ⪰ () ⪰  ,
an $unwind stage is defined; the order of this stages
reflects the order of the arrays in (), beginning
from the one closest to .
(2) If  ̸= ∅, a $match stage is defined listing every
selection predicate.
(3) A $project stage is defined to keep only the fields
that are required for the following stages, i.e., 
and every group-by level. If there is one (or more)
incomplete level  ∈  (i.e., such that ( ) &lt;
1), the replacement of the missing values of  is
done at this stage, in accordance with the balancing
strategy chosen by the user. Additionally, a new
ifeld named balanced is added and valued true if
any of the projected fields has been afected by the
balancing strategy, false otherwise.
(4) A $group stage is defined including the fields that
identify a group (i.e., every level  ∈  plus the
balanced field), the measure  to be aggregated,
and its aggregation functions  . Additionally, two
new measures named count and count-m are added to
count, respectively, the number of aggregated objects
and the number of aggregated objects that actually
contain a value for .</p>
      <p>The query-independent fields balanced, count, and count-m
are needed to calculate the indicators of the query, which
will be discussed in Section 6.3.</p>
      <p>Example 6.4. The MongoDB query obtained from 1
considering a downward balancing strategy is the following.
db.WS.aggregate({
{ $unwind: "$Exercises" },
{ $unwind: "$Exercises.Sets" },
{ $match: { "User.Age": { $gte: 60 } } },
{ $project: {
"Facility.City": { $ifNull:</p>
      <p>["$FacilityCity","$FacilityName"] }
},
"Exercises.Type": 1,
"Exercises.Sets.Weight": 1,
"balanced": {</p>
      <p>$cond: ["$FacilityCity",false,true]
}
} },
{ $group: {
" id": {
"FacilityCity","$FacilityCity",
"ExercisesType","$Exercises.Type",
"balanced","$balanced"
},
"Exercises.Sets.Weight": {</p>
      <p>$avg: "$Exercises.Sets.Weight"
},
"count": { $sum: 1 },
"count-m": { $sum: {</p>
      <p>$cond: ["$Exercises.Sets.Weight",1,0]
}</p>
      <p>} }
} }
□
6.3</p>
    </sec>
    <sec id="sec-7">
      <title>Query Evaluation and Evolution</title>
      <p>In our schemaless scenario, the evaluation of the query
results cannot transcend from the evaluation of the query
itself. In particular, it is important to understand the
coverage of the query with respect to the collection (which
may be influenced by the support of the fields, the quality of
the mappings, and the selectivity of the selection predicate),
as well as the reliability of the results. For these reason,
we introduce some indicators to evaluate the quality of
an OLAP query after it has been executed. Let  be the
set of distinct groups returned by query ; each group
 ∈  includes || objects (measured by the count field as
of Section 6.2), of which || (measured by the count-m
ifeld) have a value for .</p>
      <p>Selectivity. This indicator measures the selectivity of
the selection predicates in :
() = ∑︀∈ ||</p>
      <p>| ()|</p>
      <p>Completeness. This indicator is built on the concept
of completeness previously introduced. The idea is to show
the percentage of the queried objects that have not been
afected by the balancing strategies (which steps in when
the value of a level is null or does not exists):</p>
      <p>() = ∑︀∈∑,!︀∈||() ||
where () is true if  has been balanced, false
otherwise (as stated by the balanced field introduced in
Section 6.2).</p>
      <p>Group precision. While the absence of full support on
levels can be overcome by the balancing strategies, nothing
can be done when it involves the query measure. In this
case, the precision of the aggregated value returned for
each group is determined by the percentage of aggregated
objects that actually contain a value for the measure. Thus,
the precision of a group  is
() = ||
||</p>
      <p>Consistently with an OLAP scenario, a query can evolve
into another with the application of an OLAP operation;
the resulting sequence of queries is called an OLAP session.
In particular, the permitted operations are the following
ones.</p>
      <p>∙ The replacement of the query measure with a
diferent one, or the selection of a diferent aggregation
operator. If a new measure is chosen, a new validity
check is required to verify whether the disjointness
requirement still holds.
∙ The addition/removal/modification of a selection
predicate. This operation has no impact on the
validity of the query.
∙ The roll-up (or drill-down) of one of the group-by
levels, which leads to replacing a level  with a level
 ′ such that  ⪰  ′ (or  ′ ⪰  ).</p>
      <p>Roll-ups and drill-downs imply a navigation of the
dependency graph on the relationships between  and  ′, which
represent (A)FDs. From a multidimensional standpoint,
the navigation of an AFD with accuracy lower than 1 leads
to a violation of the roll-up semantics, i.e., the results of
the second query will not be a correct composition (or
decomposition) of the results of the first query. This happens
because the FD is not strictly true in some cases, which
compromises the correctness of the aggregation. Thus, we
evaluate the impact of these operations by means of another
indicator:</p>
      <p>Accuracy. This indicator quantifies the accuracy of the
aggregated results of a query during an OLAP session with
respect to the results obtained from the previous query.
Given query , let ′ be the query resulting from a roll-up
(or drill-down) of  from level  to  ′, and let Γ ′ ⊆ Γ be
the set of AFDs in the path between  and  ′. Then, the
accuracy of ′ with respect to  is
(′, ) = 1 −
∏︁ ( )
 ∈Γ′
7</p>
    </sec>
    <sec id="sec-8">
      <title>RELATED LITERATURE</title>
      <p>The rise of NoSQL stores has captured a lot of interest
from the research community, which has proposed a
variety of approaches to deal with the schemaless feature. In
particular, most of the recent works focus on the widely
adopted JSON format and on key/value repositories in
general.</p>
      <p>
        A first distinction lies in how each work approaches the
problem of schema discovery. Some works aim at
providing a comprehensive view of the schema variety in JSON
documents; e.g., [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] proposes a reverse engineering
process to derive a versioned schema model, where multiple
versions of the same field are created for every intensional
variation detected in the collection. Other works provide
a more concise representation that tends to hide schema
variety. For instance, [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] couples a clustering technique
with schema matching techniques to identify a skeleton
containing the smallest set of core fields, while [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] adopts
regular expressions to model the variability of a field type.
Our work is closer to the latter group, although our global
schema captures the entire variety of fields and enables
the user to choose the fields to focus on, while assisting
her with quality indicators of the final queries. Several free
tools have also been released to perform schema detection
on diferent platforms (MongoDB, ElasticSearch,
Couchbase, Apache Drill), although they are mostly limited to
collecting the union of the fields. In a previous work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] we
followed a diferent approach and devised a schema profiling
algorithm that explains the schema variety in a collection
in terms of the extensional values found in the documents
(e.g., it could find that diferent schemas depend on the
diferent values for SessionType).
      </p>
      <p>
        The most distinguishing feature of our approach is the
definition of a multidimensional representation of the schema
in order to enable OLAP analyses directly on the DOD.
From this point of view, a work closely related to ours is
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which proposes a schema-on-read approach for OLAP
queries over DODs. This is done by building a
multidimensional schema from the union of fields found in the
collection; then, the OLAP experience is proposed at query
time, where suggestions for roll-up and drill-down
operations are provided given the query formulated by the user.
Diferently from our approach, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] exclusively focuses on
the multidimensional representation of JSON data and
overlooks the schemaless property of DODs: in particular,
inter-document variety is considered only in terms of fields
with varying support (thus no schema integration is
performed), and no support is given to the user to evaluate
the coverage and accuracy of queries. Also, AFD detection
is carried out on demand only after the user has written a
query, thus it only impacts the OLAP experience. Another
similar work is [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which proposes a MapReduce-based
algorithm to compute OLAP cubes on columnar stores.
The approach is meant to work on a data warehouse (i.e.,
a database already comprising facts and dimensions);
besides, it is limited to the computation of the cubes, while
the OLAP querying aspect is mentioned as future work.
Also [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] aims at delivering the OLAP experience, but its
operational data source is a graph-based database, whose
data model is entirely diferent from the one of DODs.
Finally, [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] builds on [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] to propose a complete
architecture that ingests NoSQL data and provides schema-on-read
functionalities, but without mentioning multidimensional
enrichment and OLAP analyses.
      </p>
      <p>
        Since schema variety in a collection often consists of
different representation of the same data (e.g., due to schema
evolution or to the ingestion of data from diferent sources),
the problem of schema discovery is often coupled with
schema matching algorithms. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] provides a
comprehensive summary of the diferent techniques envisioned for
generic schema matching (which ranges from the relational
world to ontologies and XML documents); it is mentioned
as a baseline reference in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], while [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] starts from there to
define its own algorithm for schema matching on NoSQL
stores based on subtree matching. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] a tool is
presented to automatically identify evolution in the schema
of instances in NoSQL databases: once a schema change is
detected, the tool either updates the database instances to
enforce schema consistency or provides a code to deal with
this issue on the application side. This structured approach
difers from our schema-on-read scenario, which
transparently handles schema diferences and avoids to update the
original data.
      </p>
      <p>
        Several works have focused on bringing NoSQL back to
the relational world. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] discusses an approach to provide
schema-on-read capabilities for flexible schema data stored
on RDBMSs; this is done by mapping the document
structure on diferent tables and by providing a data guide as
the union of every possible field at any level. Diferently
from our approach, no advanced schema matching
mechanism is provided. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes an algorithm to provide a
generic relational encoding of arbitrary JSON documents;
in particular, documents are stored in ternary relations
that contain rows for every key in every document (i.e.,
each row stores the document id, the key name, and the
key value). A more sophisticated algorithm is proposed in
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], where normalized relational schemas are automatically
generated from NoSQL stores. It relies on AFD detection
to build relationships between entities and it provides its
own schema matching algorithm. Based on this approach, a
vision for a new paradigm called adaptive schema databases
has been proposed in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]; it is a conceptual framework that
devises global schemas as time-evolving and user-dependent
relational views that are mapped to local schemas via
probabilistic mappings —whereas mappings are deterministic
in our approach.
8
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>
        In this paper we have presented an original approach to
OLAP on DODs. Our current implementation relies on a
prototype that separately handles the diferent stages. In
particular, we use a customized version of the free tool
variety.js for schema extraction on MongoDB collections; we
rely on the BIN framework [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to handle schema mappings
and query reformulation (see Appendix); AFD detection
is carried out by a simple Javascript algorithm, which
determines the presence of AFDs between couples of fields
by means of count distinct queries, adopting a smart
exploration strategy that reduces the search space like in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ];
ifnally, OLAP queries are manually formulated. Our
reference real-world collection is stored on a single machine (i7
CPU, 32GB RAM) and contains 5M workout sessions with
6 diferent local schemas (mostly due to missing attributes),
35M exercises and 85M sets. Table 1 shows the execution
times for the schema extraction phase. Times are
consistent with those of related approaches that perform schema
extraction on JSON datasets, such as [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; also, we note that
time increases linearly with the size of the database. Given
the low number of schemas, mappings have been manually
defined. A sample OLAP query  that groups documents
by Facility.Chain (global support 0.38) to obtain the
average amount of Exercises.ExCalories (global support 0.69)
returns the following indicator values: () = 1 (as there
is no filtering), () = 0.33 (lower than the group-by
set support), and an average () of 0.99. The time for
executing  on MongoDB is about 3 minutes. Drilling down
to Facility.Name (global support 1) increases () to 1,
while the accuracy of the new query with respect to  is
0.98.
      </p>
      <p>
        As future work, we plan to build a fully-functioning
implementation, as well as to thoroughly evaluate the
performance and scalability of the approach. Also, we plan to
switch from a single machine to a multi-node cluster and
to consider schema profiling techniques [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to enhance the
support given to the user at query time.
      </p>
    </sec>
    <sec id="sec-10">
      <title>APPENDIX</title>
      <p>
        Not only inter-schema mappings enable the definition of
a global schema, they also allow a MongoDB query
formulated on the global schema to be reformulated on each
local schema, which is necessary in two situations: (i) when
the collection is queried to detect AFDs (Section 5) and
(ii) when the user issues an OLAP query on the collection
(Section 6). The query reformulation algorithm we adopt
here is the one proposed by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] in the context of business
intelligence networks (BINs); it enables the rewriting of
a query from a source multidimensional schema to a
target multidimensional schema and has been proved to be
complete and provide all certain answers to the query. In
this section we discuss why that algorithm can be reused
to safely rewrite queries in both situations (i) and (ii). To
this end we need to prove that the data schemas, the
interschema mappings, and the queries which we consider in
our work are a particular case of those used as a reference
in the BIN context.
      </p>
      <p>Data schema. The reference schema in the BIN context is
a classical multidimensional schema featuring a fact, a set
of hierarchies (each made of levels), and a set of measures
(each coupled with an aggregation operator). The
dependency graph of Definition 5.2 can be thought of as a sort
of “multi-fact” multidimensional schema with no explicit
distinction between levels and measures. However, when an
OLAP query is formulated as in Definition 6.1, exactly one
fact is implicitly determined, group-by levels are explicitly
distinguished from measures, and an aggregation operator
is coupled to each measure. So, from the data schema point
of view, there is no diference between the context of BINs
and the one of this paper.</p>
      <p>
        Mappings. The primitive mappings of Definition 4 can be
expressed, according to the BIN terminology, using either
same or equi-level predicates. same predicates are used for
measures, and can be annotated with an expression; since
in Definition 6.1 measures are required to be numerical,
the associated transcodings must be translatable into an
expression. equi-level predicates are used for levels, and can
be directly annotated with a transcoding. Remarkably, in
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] these two types of mappings are called exact since they
enable non-approximate query reformulations. Note that
array mappings are not used for query reformulation but
only for determining the global schema, so they are not
considered here.
      </p>
      <p>Queries. An OLAP query (Definition 6.1) has a group-by
set, a (conjunctive) selection predicate, and a measure. A
BIN query has a group-by set, a (conjunctive) selection
predicate, and an expression involving one or more
measures. By simply picking a single measure and the identity
expression, situation (ii) is addressed. As to situation (i),
i.e., querying aimed at checking AFDs, we remark that the
query for checking AFD  ⇝ ′ can be expressed as a BIN
query with group-by set {, ′} and a dummy measure, on
whose result a simple COUNT DISTINCT is then executed.</p>
      <p>Based on the considerations above, we can state that
an OLAP query of the global schema can be correctly
reformulated into a set of local queries, one on each local
schema. Then, each local query is separately executed
on the DOD; specifically, each query must target only
the documents that belong to a specific local schema .
This is done in two steps. First, the information about
which document has which schema (obtained in the schema
extraction stage) is stored in a diferent collection (called
WorkoutSession-schemas in our example) in the following
form: a document is created for every schema  ∈ (),
containing an array ids with the id of every document
 ∈ . Then, the query on schema  is executed by joining
it with the list of identifiers in WorkoutSession-schemas).
Finally, a post-processing activity is required to integrate
the results coming from the diferent local queries.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Amine</surname>
          </string-name>
          <string-name>
            <surname>Baazizi</surname>
          </string-name>
          , Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Sartiani</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Schema Inference for Massive JSON Datasets</article-title>
          .
          <source>In Proc. EDBT</source>
          . Venice, Italy,
          <fpage>222</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Batini</surname>
          </string-name>
          , Maurizio Lenzerini, and
          <string-name>
            <given-names>Shamkant</given-names>
            <surname>Navathe</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>A Comparative Analysis of Methodologies for Database Schema Integration</article-title>
          .
          <source>Comput. Surveys</source>
          <volume>18</volume>
          ,
          <issue>4</issue>
          (
          <year>1986</year>
          ),
          <fpage>323</fpage>
          -
          <lpage>364</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Philip</surname>
            <given-names>A Bernstein</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jayant</given-names>
            <surname>Madhavan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Generic schema matching, ten years later</article-title>
          .
          <source>Proc. VLDB Endowment 4</source>
          ,
          <issue>11</issue>
          (
          <year>2011</year>
          ),
          <fpage>695</fpage>
          -
          <lpage>701</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Arnaud</given-names>
            <surname>Castelltort</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anne</given-names>
            <surname>Laurent</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>NoSQL Graphbased OLAP Analysis</article-title>
          .
          <source>In Proc. KDIR</source>
          . Rome, Italy,
          <fpage>217</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Craig</given-names>
            <surname>Chasseur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yinan</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jignesh</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Enabling JSON Document Stores in Relational Systems</article-title>
          .
          <source>In Proc. WebDB</source>
          . New York, USA,
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Lamine</surname>
          </string-name>
          <string-name>
            <surname>Chouder</surname>
          </string-name>
          , Stefano Rizzi, and
          <string-name>
            <given-names>Rachid</given-names>
            <surname>Chalal</surname>
          </string-name>
          . In press.
          <source>EXODuS: Exploratory OLAP over Document Stores. Inf. Syst</source>
          . (In press).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Dehdouh</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Building OLAP Cubes from Columnar NoSQL Data Warehouses</article-title>
          .
          <source>In Proc. MEDI</source>
          . Almer´ıa, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Michael</given-names>
            <surname>DiScala and Daniel J. Abadi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data</article-title>
          .
          <source>In Proc. SIGMOD</source>
          . San Francisco, USA,
          <fpage>295</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Enrico</given-names>
            <surname>Gallinucci</surname>
          </string-name>
          , Matteo Golfarelli, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Schema Profiling of Document Stores</article-title>
          .
          <source>InProc. SEBD. Squillace Lido, Italy</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Matteo</surname>
            <given-names>Golfarelli</given-names>
          </string-name>
          , Simone Graziani, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Starry Vault: Automating Multidimensional Modeling from Data Vaults</article-title>
          .
          <source>In Proc. ADBIS</source>
          .
          <volume>137</volume>
          -
          <fpage>151</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Matteo</surname>
            <given-names>Golfarelli</given-names>
          </string-name>
          , Federica Mandreoli, Wilma Penzo, Stefano Rizzi, and
          <string-name>
            <given-names>Elisa</given-names>
            <surname>Turricchia</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>OLAP query reformulation in peer-to-peer data warehousing</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>37</volume>
          ,
          <issue>5</issue>
          (
          <year>2012</year>
          ),
          <fpage>393</fpage>
          -
          <lpage>411</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Data warehouse design: Modern principles and methodologies</article-title>
          .
          <source>McGraw-Hill</source>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rihan</surname>
            <given-names>Hai</given-names>
          </string-name>
          , Sandra Geisler, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Quix</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Constance: An Intelligent Data Lake System</article-title>
          .
          <source>In Proc. SIGMOD</source>
          . San Francisco, USA,
          <year>2097</year>
          -
          <fpage>2100</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ihab</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Ilyas</surname>
            , Volker Markl,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Haas</surname>
            , Paul Brown, and
            <given-names>Ashraf</given-names>
          </string-name>
          <string-name>
            <surname>Aboulnaga</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies</article-title>
          .
          <source>In Proc. SIGMOD</source>
          .
          <volume>647</volume>
          -
          <fpage>658</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] Javier Luis Ca´novas Izquierdo and
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Cabot</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Discovering implicit schemas in JSON data</article-title>
          .
          <source>In Proc. ICWE</source>
          .
          <volume>68</volume>
          -
          <fpage>83</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hans-Joachim Lenz</surname>
            and
            <given-names>Arie</given-names>
          </string-name>
          <string-name>
            <surname>Shoshani</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Summarizability in OLAP and Statistical Data Bases</article-title>
          .
          <source>In Proc. Ninth International Conference on Scientific and Statistical Database Management.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Zhen</given-names>
            <surname>Hua</surname>
          </string-name>
          Liu and
          <string-name>
            <given-names>Dieter</given-names>
            <surname>Gawlick</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Management of Flexible Schema Data in RDBMSs - Opportunities and Limitations for NoSQL</article-title>
          .
          <source>In Proc. CIDR</source>
          . Asilomar, USA.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          and
          <string-name>
            <given-names>Philip A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>A survey of approaches to automatic schema matching</article-title>
          .
          <source>VLDB J</source>
          .
          <volume>10</volume>
          ,
          <issue>4</issue>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Oscar</given-names>
            <surname>Romero</surname>
          </string-name>
          and Alberto Abell´o.
          <year>2006</year>
          .
          <article-title>Multidimensional Design by Examples</article-title>
          .
          <source>In Proc. DaWaK</source>
          . Krakow, Poland,
          <fpage>85</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20] Diego Sevilla Ruiz,
          <source>Severino Feliciano Morales, and Jesu´s Garc´ıa Molina</source>
          .
          <year>2015</year>
          .
          <article-title>Inferring Versioned Schemas from NoSQL Databases and Its Applications</article-title>
          .
          <source>In Proc. ER</source>
          .
          <volume>467</volume>
          -
          <fpage>480</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Stefanie</surname>
            <given-names>Scherzinger</given-names>
          </string-name>
          , Eduardo Cunha de Almeida, Thomas Cerqueus, Leandro Batista de Almeida, and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Holanda</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Finding and Fixing Type Mismatches in the Evolution of Object-NoSQL Mappings</article-title>
          .
          <source>In Proc. Workshops EDBT/ICDT.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>William</given-names>
            <surname>Spoth</surname>
          </string-name>
          , Bahareh Sadat Arab, Eric S. Chan, Dieter Gawlick, Adel Ghoneimy, Boris Glavic, Beda Christoph Hammerschmidt, Oliver Kennedy,
          <string-name>
            <given-names>Seokki</given-names>
            <surname>Lee</surname>
          </string-name>
          , Zhen Hua Liu, Xing Niu, and
          <string-name>
            <given-names>Ying</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Adaptive Schema Databases</article-title>
          .
          <source>In Proc. CIDR</source>
          . Chaminade, USA.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Boris</surname>
            <given-names>Vrdoljak</given-names>
          </string-name>
          , Marko Banek, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Designing Web Warehouses from XML Schemas</article-title>
          .
          <source>In Proc. DaWaK</source>
          .
          <volume>89</volume>
          -
          <fpage>98</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Lanjun</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassanzadeh, Jia Zou, and
          <string-name>
            <given-names>Chen</given-names>
            <surname>Wangz</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Schema management for document stores</article-title>
          .
          <source>Proc. VLDB Endowment 8</source>
          ,
          <issue>9</issue>
          (
          <year>2015</year>
          ),
          <fpage>922</fpage>
          -
          <lpage>933</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>