=Paper=
{{Paper
|id=Vol-2062/paper2
|storemode=property
|title=Variety-Aware OLAP of Document-Oriented Databases
|pdfUrl=https://ceur-ws.org/Vol-2062/paper02.pdf
|volume=Vol-2062
|authors=Enrico Gallinucci,Matteo Golfarelli,Stefano Rizzi
|dblpUrl=https://dblp.org/rec/conf/dolap/GallinucciGR18
}}
==Variety-Aware OLAP of Document-Oriented Databases==
Variety-Aware OLAP of Document-Oriented Databases
Enrico Gallinucci Matteo Golfarelli Stefano Rizzi
DISI—Univ. of Bologna, Italy DISI—Univ. of Bologna, Italy DISI—Univ. of Bologna, Italy
Cesena, Italy Cesena, Italy Bologna, Italy
enrico.gallinucci2@unibo.it matteo.golfarelli@unibo.it stefano.rizzi@unibo.it
ABSTRACT data enthusiasts) making them aware of its impact, e.g.,
Schemaless databases, and document-oriented databases in terms of completeness and precision. Specifically, the
in particular, are preferred to relational ones for storing distinguishing features of our approach are as follows.
heterogeneous data with variable schemas and structural
∙ To the best of our knowledge, this is the first ap-
forms. However, the absence of a unique schema adds com-
proach to propose a form of approximated OLAP
plexity to analytical applications, in which a single analysis
analyses on document-oriented databases that em-
often involves large sets of data with different schemas.
braces and exploits the inherent variety of docu-
In this paper we propose an original approach to OLAP
ments.
on collections stored in document-oriented databases. The
∙ Multidimensional querying and OLAP are carried
basic idea is to stop fighting against schema variety and
out directly on the data source, without materializing
welcome it as an inherent source of information wealth in
any cube or data warehouse.
schemaless sources. Our approach builds on four stages:
∙ We adopt an inclusive solution to integration, i.e.,
schema extraction, schema integration, FD enrichment, and
the user can include a concept in a query even if it is
querying; these stages are discussed in detail in the paper.
present in a subset of documents only. We cover both
To make users aware of the impact of schema variety, we
inter-schema and intra-schema variety, specifically
propose a set of indicators related for instance to query
we cope with missing attributes, different levels of
completeness and precision.
detail in instances, different attribute naming.
∙ Our approach to reformulation of multidimensional
1 INTRODUCTION queries on heterogeneous documents grounds on a
Recent years have witnessed an erosion of the relational formal approach [11], which ensures its correctness
DBMS predominance to the benefit of DBMSs based on and completeness.
alternative representation models (e.g., document-oriented ∙ We propose a set of indicators to make the user
and graph-based) which adopt a schemaless representation aware of the level of completeness and precision of
for data. Schemaless databases are preferred to relational the query result.
ones for storing heterogeneous data with variable schemas
Remarkably, this is not yet another paper on multidimen-
and structural forms; typical schema variants within a
sional modeling from non-traditional data sources. Indeed,
collection consist in missing or additional attributes, in
our goal is not to design a single “sharp” schema where
different names or types for an attribute, and in differ-
source attributes are either included or absent, but rather
ent structures for instances [9]. The absence of a unique
to enable OLAP querying on some sort of “soft” schema
schema grants flexibility to operational applications but
where each source attribute is present to some extent.
adds complexity to analytical applications, in which a sin-
The paper outline is as follows. After giving an overview
gle analysis often involves large sets of data with different
of our approach in Section 2, in Sections 3, 4, 5, and 6 we
schemas. Dealing with this complexity while adopting a
describe its four stages, namely, schema extraction, schema
classical data warehouse design approach would require
integration, FD enrichment, and querying. Then, in Section
a notable effort to understand the rules that drove the
7 we discuss the related literature, and finally in Section
use of alternative schemas, plus an integration activity to
8 we draw the conclusions. An appendix completes the
identify a common schema to be adopted for analysis —
paper discussing the correctness of our query reformulation
which is quite hard when no documentation is available.
framework.
Furthermore, since new schema variations are often made,
a continuous evolution of both ETL process and cube
schemas would be needed. 2 APPROACH OVERVIEW
In this paper we propose an original approach to mul- Figure 1 gives an overview of the approach: in blue the
tidimensional querying and OLAP on schemaless sources, different stages of the approach, on the right the meta-
in particular on collections stored in document-oriented data produced/consumed by each stage. Remarkably, all
databases (DODs) such as MongoDB. The basic idea is to schema-related concepts are stored as metadata, so no
stop fighting against data heterogeneity and schema variety, transformation has to be done on source data. User inter-
and welcome it as an inherent source of information wealth action is required at most stages. Although the picture
in schemaless sources. So, instead of trying to hide this suggests a sequential execution of the stages, it simply out-
variety, we show it to users (basically, data scientist and lines the ordering for the first iteration. In the scenario that
© 2018 Copyright held by the owner/author(s). Published in the
we envision, the user starts by analyzing the first results
Workshop Proceedings of the EDBT/ICDT 2018 Joint Conference provided by the system, then iteratively injects additional
(March 26, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 1613- knowledge into the different stages to refine the metadata
0073). Distribution of this paper is permitted under the terms of
the Creative Commons license CC-by-nc-nd 4.0. and improve the querying effectiveness. We now provide a
Querying [ { "_id" : ObjectId("54a4332f44cfc02424f961d4"),
"User" :
{ "FullName" : ”John Smith",
FD enrichment Dependency graph "Age" : 42 },
"StartedOn" : ISODate("2017-06-15T10:20:44.000Z"),
Schema integration Global schema, Mappings "Facility" :
{ "Name" : "PureGym Piccadilly",
LEGEND
Schema extraction Local schemas "Chain" : "PureGym" },
data "SessionType" : "RunningProgram",
metadata "DurationMins": 90,
control Collection "Exercises" :
[ { "Type" : "Leg press",
"ExCalories" : 28,
Figure 1: Approach overview "Sets" :
[ { "Reps" : 14,
"Weight" : 60 },
...
] },
{ "Type" : "Tapis roulant" },
short description of each stage; a deeper discussion will be ...
]
provided in the following sections.
},
Schema extraction (Section 3). The goal of this stage ...
is to identify the set of distinct local schemas that occur ]
inside a collection of documents. To this end we provide
a tree-like definition for schemas which models arrays by Figure 2: An excerpt of the WorkoutSession collec-
considering the union of the schemas of their elements. tion
This is a completely automatic stage which requires no
interaction with the user.
Schema integration (Section 4). At this stage we rely obtained from a worldwide company selling fitness equip-
on inter-schema mappings and schema integration tech- ment. Figure 2 shows a sample document in the collection,
niques to determine a (tree-like) global schema that gives organized according to three nesting levels:
the user a single and comprehensive description of the (1) The first level contains information about the user,
contents of the collection. In principle, this stage could be including the facility in which the session took place,
completely automated. In practice, the best results can be the date, and the total duration in minutes.
obtained through a semi-automatic approach, that allows (2) The Exercises array contains an object for every exer-
users to manually validate/refine the mappings proposed cise carried out during the session, with information
by the system. As of now, we rely on the user to manu- on the type of exercise, and the total calories.
ally provide inter-schema mappings, from which the global (3) The Sets array contains an object for every set that
schema is derived. the exercise was split into. For example, the “leg
FD enrichment (Section 5). Traditional OLAP analy- press” exercise has been done in multiple sets, the
ses are carried out on multidimensional cubes. To enable first of which comprises 14 repetitions with a weight
the OLAP experience in our setting, a multidimensional of 60 kilograms, for a total of 28 calories.
representation of the collection must be derived from the
global schema. In particular, we introduce the notion of 3 SCHEMA EXTRACTION
dependency graph, i.e., a graph that provides a multidimen- The goal of this stage is to introduce a notion of (local)
sional view of the global schema in terms of the functional schema for a document, to be used in the integration stage
dependencies (FDs) between its attributes. Some FDs can to determine a (global) schema for a collection and then,
be inferred from the structure of the schema, others by in the FD enrichment stage, to derive an OLAP-compliant
analyzing data; given the expected schema variety, we representation of the collection itself.
specifically look for approximate FDs. The notion of a document is the central concept of a
Querying (Section 6). The last stage consists in deliv- DOD, and it encapsulates and encodes its data in some
ering the OLAP experience to the user by enabling the standard format. The most widely adopted format is cur-
formulation of multidimensional queries on the dependency rently JSON, which we will use as a reference in this work.
graph and their execution on the collection. First of all,
each formulated query is validated against the requirements Definition 3.1 (Document and Collection). A document
of well-formedness proposed in the literature [19]. Then, 𝑑 is a JSON object. An object is formed by a set of key/value
the query is translated to the query language of the DOD pairs (aka fields); a key is string, while a value can be either
and reformulated into multiple queries, one for each lo- a primitive value (i.e, a number, a string, or a Boolean),
cal schema in the collection; the results presented to the an array of values, an object, or null. A collection 𝐷 is an
user are obtained by merging the results of the single local array of documents.
queries. To make the user aware of the impact of schema Example 3.2. Figure 2 shows a document excerpted from
variety in terms of quality and reliability of the results, we the WorkoutSession collection; it contains numbers (e.g.,
show her a set of indicators related to query completeness Age), strings (e.g., Chain), objects (e.g., User), and arrays
and precision. (e.g., Exercises). Conceptually, a session is done by a user
The motivation example that we use across the paper at a facility; it includes a list of exercises, each possibly
is based on a real-world collection of workout sessions, comprising several sets. □
Since there is no explicit representation of schemas in WS WS WS
documents, multiple definitions of schema are possible for _id _id _id
the schemas of collections and documents —with different User.FullName User.FullName
levels of conciseness and precision. The main difference in User.FirstName FirstName
User.LastName LastName
these definitions lies in how they cope with inter-document
User.Age User.Age
variety and intra-document variety. StartedOn StartedOn Date
∙ Inter-document variety impacts on the definition Facility.Name Facility.Name Gym.Name
Facility.Chain Facility.Chain
of the schema for a collection, as it concerns the Facility.City Gym.City
presence of documents with different fields. This SessionType SessionType SessionType
issue is usually dealt with in one of two ways: ei- DurationMins DurationMins DurationSecs
ther by defining the schema of the collection as the
union/intersection [1, 24] of the most frequent fields, Exercises Exercises
or by keeping track of every different schema [20].
Exercises_id Exercises_id
Our work mixes the above mentioned approaches in Type Type
that it builds a global schema starting from local ExCalories ExCalories
schemas.
Sets Sets Series
∙ Intra-document variety impacts on the definition of
the schema for a document, and is mainly related to Sets_id Sets_id Series_id
ExType
the presence in a document of a heterogeneous array. Reps Reps Reps
For instance, an array of objects can mix objects Weight Weight Weight
with different fields (e.g., the first objects of the SetCalories SeriesCalories
Exercises array in Figure 2 contains fields that are (a) (b) (c)
missing from the second one). In this work we adopt
a simple representation that, like in [1, 15], considers Figure 3: The schema of the JSON document in
the union of the values contained in the array. Figure 2 (a), another schema of the same collection
We start by giving a “structural” definition of a schema as (c), and the global schema (b)
a tree, then we reuse it to define the schema of a document
and, in Section 4, the schema of a collection.
(2) 𝐹 𝑝𝑟𝑖𝑚 includes (i) a field for each primitive in 𝑑,
Definition 3.3 (Schema). A schema is a directed tree and (ii) a field for each 𝑖𝑑(𝑎) with 𝑎 ∈ 𝐹 𝑎𝑟𝑟 , 𝑓 ≠ 𝑟;
𝑠 = (𝐹, 𝐴) where 𝐹 is a set of fields and 𝐴 is a set of every field is labelled with its corresponding key and
arcs representing the relationships between arrays and the type (keys of primitives within an object field are
contained fields. In particular, “flattened”, i.e., prefixed with the object’s key);
(1) 𝐹 = 𝐹 𝑎𝑟𝑟 ∪𝐹 𝑝𝑟𝑖𝑚 , 𝐹 𝑎𝑟𝑟 is a set of array fields (includ- (3) 𝐴 includes (i) an arc (𝑟, 𝑓 ) for each field 𝑓 such that
ing the root 𝑟 of 𝑠), and 𝐹 𝑝𝑟𝑖𝑚 is a set of primitive 𝑘𝑒𝑦(𝑓 ) appears as a key in the root level of 𝑑, and
fields; (ii) an arc (𝑎, 𝑓 ) iff 𝑘𝑒𝑦(𝑓 ) appears as a key in an
(2) 𝐴 includes arcs from fields in 𝐹 𝑎𝑟𝑟 to fields in 𝐹 𝑎𝑟𝑟 ∪ object of array 𝑎.
𝐹 𝑝𝑟𝑖𝑚 .
Example 3.5. Figure 3.a shows the schema of the docu-
Each field 𝑓 ∈ 𝐹 has a name, 𝑘𝑒𝑦(𝑓 ), a unique pathname ment represented in Figure 2, part of the WorkoutSession
(obtained by concatenating the names of the fields along collection (from now on, abbreviated in WS). Each array is
the path from 𝑟 to 𝑓 , with the exclusion of 𝑟), and a represented as a box, with its child primitives listed below
type, 𝑡𝑦𝑝𝑒(𝑓 ) (𝑡𝑦𝑝𝑒(𝑓 ) ∈ {number, string, Boolean} for all (numeric primitives are in italics). Object fields are pre-
𝑓 ∈ 𝐹 𝑝𝑟𝑖𝑚 , 𝑡𝑦𝑝𝑒(𝑓 ) = array for all 𝑓 ∈ 𝐹 𝑎𝑟𝑟 ). Given field fixed with the object key (e.g., Facility.Chain). The vertical
𝑓≠ 𝑟, we denote with 𝑎𝑟𝑟(𝑓 ) the array 𝑎 ∈ 𝐹 𝑎𝑟𝑟 such that lines between boxes represent inter-array arcs, with the
(𝑎, 𝑓 ) ∈ 𝐴. root WS on top. It is 𝑎𝑟𝑟(Exercises.Type) = Exercises and
To define the schema of a specific document we need 𝑖𝑑(Exercises) = Exercises. id. □
to add identifiers to arrays. We denote with 𝑖𝑑(𝑎) the
primitive field that identifies an object within array 𝑎. Given collection 𝐷, we denote with 𝑆(𝐷) the set of
Documents always contain an identifier, 𝑖𝑑(𝑟) = id. Con- distinct schemas of the documents in 𝐷 (where two fields
versely, array objects may not contain such a field, but still in the schemas of two documents are considered equal if
they can be univocally identified by their positional index they have the same pathname and the same type).
within the array. Therefore, given array 𝑎, 𝑖𝑑(𝑎) can be
⋃︁
𝑆(𝐷) = 𝑠(𝑑)
recursively defined as the concatenation of 𝑖𝑑(𝑎𝑟𝑟(𝑎)) and 𝑑∈𝐷
the positional index within 𝑎; it is 𝑘𝑒𝑦(𝑖𝑑(𝑎)) = id and
Given 𝑠 ∈ 𝑆(𝐷), we denote with 𝐷𝑠 the set of documents
𝑡𝑦𝑝𝑒(𝑖𝑑(𝑎)) = string.
in 𝐷 such that 𝑠(𝑑) = 𝑠.
Definition 3.4 (Schema of a Document). Given doc-
ument 𝑑 ∈ 𝐷, the schema of 𝑑 is the schema 𝑠(𝑑) = 4 SCHEMA INTEGRATION
(𝐹 𝑎𝑟𝑟 ∪ 𝐹 𝑝𝑟𝑖𝑚 , 𝐴) such that The goal of this stage is to integrate the distinct, local
(1) 𝐹 𝑎𝑟𝑟 includes a field for each array in 𝑑, labelled schemas extracted from 𝐷 to obtain a single and compre-
with the corresponding key and type, plus a root 𝑟 hensive view of the collection, i.e., a global schema, and
labelled with the name of 𝐷 and with type array. its mappings with each local schema. The global schema
can be incrementally built using one of the methodologies due to the fact that 𝑓 may occur at different depths in
discussed in [2]; for instance, adopting a ladder integra- different documents (e.g., if 𝑓 = Exercises.ExCalories in the
tion strategy, by (i) taking one local schema as the global global schema, 𝑎𝑟𝑟(𝑓 ) is Exercises in the schema of Figure
schema; (ii) iteratively taking each other local schema, find- 3.a and Exercises.Sets in the schema of Figure 3.c), this
ing its mappings onto the global schema, and updating measure must be computed locally to each schema and
the global schema accordingly. However, notice that some then aggregated to get a global measure. Thus, we define
mappings may be missed by adopting a purely incremental the global support of 𝑓 as the weighted average of the local
strategy (i.e., a second iteration on the local schemas may supports calculated on the distinct schemas.
be required). A survey of the techniques that can be used Definition 4.4 (Local Support of a Field). Given a docu-
for finding mappings is provided in [3, 18]. ment schema 𝑠 = (𝐹, 𝐴), the local support of a field 𝑓 ∈ 𝐹
A mapping is defined as follows: is recursively defined as:
Definition 4.1 (Mapping). Given two schemas 𝑠𝑖 and 𝑠𝑗 ,
{︃
1, if 𝑓 ≡ 𝑟
a mapping from 𝑠𝑖 to 𝑠𝑗 can be either 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠) = ∑︀
𝑠∈𝐷 𝑠 𝑝𝑒𝑟𝑐(𝑓 ) · 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑎𝑟𝑟(𝑓 ), 𝑠), otherw.
∙ an array mapping with form ⟨𝑎, 𝑎′ ⟩, where 𝑎 ∈ 𝐹𝑖𝑎𝑟𝑟
and 𝑎′ ∈ 𝐹𝑗𝑎𝑟𝑟 ; where 𝑝𝑒𝑟𝑐(𝑓 ) is the percentage of objects of 𝑎𝑟𝑟(𝑓 ) which
∙ a primitive mapping with form ⟨𝑃, 𝑃 ′ , 𝜑⟩, where 𝑃 ⊆ include 𝑓 .
𝐹𝑖𝑝𝑟𝑖𝑚 , 𝑃 ′ ⊆ 𝐹𝑗𝑝𝑟𝑖𝑚 , and 𝜑 is a transcoding function, Note that the support of 𝑓 is weighted on the support
𝜑 : 𝐷𝑜𝑚(𝑃 ) → 𝐷𝑜𝑚(𝑃 ′ ). of its array 𝑎𝑟𝑟(𝑓 ); this is because, for instance, 𝑓 may
occur in every object of 𝑎𝑟𝑟(𝑓 ) but 𝑎𝑟𝑟(𝑓 ) may be missing
The definition of the global schema for a collection is
for some object of 𝑎𝑟𝑟(𝑎𝑟𝑟(𝑓 )). As a result, it is always
based on the inter-schema mappings determined.
𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠) ≤ 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑎𝑟𝑟(𝑓 ), 𝑠).
Definition 4.2 (Global Schema). Given collection 𝐷 and
Definition 4.5 (Global Support of a Field). Given collec-
the corresponding set of schemas 𝑆(𝐷) = {𝑠1 , . . . , 𝑠𝑛 }, the
tion 𝐷 and the set of distinct schemas 𝑆(𝐷), the global
global schema of 𝐷 is a schema 𝑔(𝐷) = (𝐹, 𝐴) where
support of a field 𝑓 ∈ 𝐹 is:
(1) for every 𝑠𝑖 ∈ 𝑆(𝐷) there is a mapping ⟨𝑟𝑖 , 𝑟⟩ between ∑︁ |𝐷𝑠 |
the roots of 𝑠𝑖 and 𝑔(𝐷); 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) = 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠) ·
|𝐷|
(2) every field 𝑓 in each 𝑠𝑖 is involved in at least one 𝑠∈𝑆(𝐷)
mapping onto the fields of 𝑔(𝐷); where |𝐷𝑠 | is the number of documents with schema 𝑠 and
(3) every field 𝑓 in 𝑔(𝐷) is involved in at least one map- |𝐷| is the overall number of documents.
ping with some 𝑠𝑖 .
Example 4.6. In our working example, let the collection
Example 4.3. Figure 3 shows two sample schemas from have 100 documents (i.e., |𝐷| = 100) evenly distributed
the WS collection (a and c) and the corresponding global between 𝑠1 and 𝑠2 (i.e., |𝐷𝑠1 | = |𝐷𝑠2 | = 50). Let 𝑓 =
schema 𝑔(𝐷) (b); mappings are represented with dotted Facility.City occur 40 times in 𝑠1 and 20 times in 𝑠2 ; then,
lines. An example of array mapping from local schema (c) 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠1 ) = 40 * 1 = 0.8, 𝑙𝑜𝑐𝑆𝑢𝑝𝑝(𝑓, 𝑠2 ) = 20 * 1 = 0.4
50 50
to global schema (b) is and 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) = 0.8 * 0.5 + 0.4 * 0.5 = 0.6. □
⟨Series, Exercises.Sets⟩
5 FD ENRICHMENT
Examples of primitive mappings are
The goal of this stage is to propose a multidimensional view
⟨{Date}, {StartedOn}, 𝜑1 ⟩ of the global schema to enable OLAP analyses. The main
⟨{FirstName, LastName}, {User.FullName}, 𝜑2 ⟩ informative gap to be filled to this end is the identification
⟨{Series.ExType}, {Exercise.Type}, 𝜑1 ⟩ of hierarchies, which in turn relies on the identification of
FDs between fields in the global schema.
where 𝜑1 is the identity function while 𝜑2 is a function that While in relational databases FDs are represented at the
concatenates two strings. □ schema level by means of primary and referential integrity
A transcoding function transforms values of a set of constraints, the same is not true in DODs. Yet, identifiers
fields into values of another set of fields; it is needed for are present in DODs: each collection has its (explicit) id
each primitive mapping to enable query reformulation in field and, as discussed in Section 3, every nested object
presence of selection predicates as well as to enable the has its own (implicit) identifier (i.e., 𝑖𝑑(𝑎) with 𝑎 ∈ 𝐹 𝑎𝑟𝑟 ).
results obtained from all documents to be integrated (see The presence of these identifiers implies the existence of
Appendix). On the other hand, array mappings are not some FDs, that we call intensional as they can be derived
associated to a transcoding function because arrays are from the global schema, without looking at the data. In
just containers and do not have values themselves. particular, given global schema 𝑔(𝐷) = (𝐹 𝑎𝑟𝑟 ∪ 𝐹 𝑝𝑟𝑖𝑚 , 𝐴)
Due to the already mentioned inter-document variety, a and array 𝑎 ∈ 𝐹 𝑎𝑟𝑟 , we can infer that:
field 𝑓 of the global schema may not be available in every ∙ 𝑖𝑑(𝑎) → 𝑓 for every 𝑓 ∈ 𝐹 𝑝𝑟𝑖𝑚 such that 𝑎𝑟𝑟(𝑓 ) = 𝑎,
local schema (e.g., Facility.Chain is absent in the second i.e., the identifier of 𝑎 determines the value of every
schema in Figure 3); therefore we need a measure of the sup- primitive in 𝑎 (e.g., id → SessionType);
port of 𝑓 with respect to the different schemas in collection ∙ if 𝑎 ̸= 𝑟, then 𝑖𝑑(𝑎) → 𝑖𝑑(𝑎𝑟𝑟(𝑎)) —i.e., the iden-
𝐷. Intuitively, given the nested structure of documents, tifier of 𝑎 determines the identifier of 𝑎𝑟𝑟(𝑎) (e.g.,
the support of 𝑓 could be defined as the percentage of Exercises. id → id); this is trivial, since 𝑖𝑑(𝑎𝑟𝑟(𝑎)) is
times that 𝑓 occurs among the objects of 𝑎𝑟𝑟(𝑓 ). However, part of 𝑖𝑑(𝑎).
In practice, additional FDs can exist between primitive Legend Exercises.Sets. _id
nodes, though they cannot be inferred from the schema; level (supp=1)
so, they can only be found by checking the data. More level (supp<1) Exercises.Sets.
Exercises.
precisely, since DODs may contain incomplete and faulty FD (acc=1) Sets.Reps SetCalories
data, we have to look for approximate FDs (AFDs), i.e., AFD (acc<1) Exercises.
Sets.Weight
FDs that “mostly” hold on data —like done for instance Exercises._id
in [6, 10, 23].
Exercises.
Exercises.
Type
ExCalories
Definition 5.1 (Approximate Functional Dependency).
Given two fields 𝑓 and 𝑓 ′ , let 𝑎𝑐𝑐(𝑓, 𝑓 ′ ) ∈ [0..1] denote the _id
ratio between the number of unique values of 𝑓 and the
number of unique values of (𝑓, 𝑓 ′ ). We will say that AFD User.
𝑓 ⇝ 𝑓 ′ holds if 𝑎𝑐𝑐(𝑓, 𝑓 ′ ) ≥ 𝜖, where 𝜖 is a user-defined FullName
User. User.
Facility.
Name SessionType DurationMins
threshold [14]. FirstName LastName
To detect AFDs and create hierarchies accordingly, some User.Age Facility.City Facility.Chain
approaches that were recently devised in the literature
(e.g., [10] and [6]) can be reused, possibly coupled with
Figure 4: Dependency graph for the global schema
traditional approaches to multidimensional modeling based
in Figure 3.b
on FDs (e.g., [23]). Interestingly, in [6] the number of checks
to be made for AFD detection is effectively reduced thanks
to the intensional FDs provided by the global schema. Exercises.Sets._id [ { "_id" : ObjectId("..."),
"SessionType" : "ClassProgram",
Note that, differently from [6], in our approach we consider Exercises. "DurationMins": 150,
"Exercises" : [ . . . ] ,
inter-document variety, so the queries that check for AFDs Sets.Reps
"Classes" :
Classes._id
must be reformulated from the global schema on each local Exercises._id
[ { "Name" : "Combat",
. . . },
schema. How this can be done is discussed in the Appendix. { "Name" : "KettleBell",
... }
Exercises.
Type
],
Classes.Name
_id ...
Definition 5.2 (Dependency Graph). Given the global },
...
schema 𝑔(𝐷) = (𝐹, 𝐴) and an (acyclic) set of (A)FDs Γ, the DurationMins SessionType ]
dependency graph is a couple ℳ = (𝐹 𝑝𝑟𝑖𝑚 , ⪰) where 𝐹 𝑝𝑟𝑖𝑚
is the set of primitive nodes in 𝐹 and ⪰ is a roll-up partial Figure 5: Excerpt of the dependency graph (left)
order of 𝐹 𝑝𝑟𝑖𝑚 derived from Γ. In particular, 𝑓𝑗 ⪰ 𝑓𝑘 (i.e., in presence of alternative documents (right)
𝑓𝑗 is a predecessor of 𝑓𝑘 in ⪰) if either 𝑓𝑗 ⇝ 𝑓𝑘 ∈ Γ or
𝑓𝑗 → 𝑓𝑘 ∈ Γ.
6 QUERYING
The differences between a dependency graph and the
In this section we describe the final querying stage. We
global schema it is derived from are that
start by providing the definition of an OLAP query and dis-
(1) the global schema is a tree, the dependency graph is cussing its correctness from a multidimensional standpoint
a DAG; (Section 6.1). Then, we discuss the execution of a query,
(2) arrays are not present in the dependency graph, but which mainly involves its translation into the MongoDB
their id’s are; language and the reformulation from the global schema
(3) arcs express (A)FDs in the dependency graph, syn- to the local schemas (Section 6.2). Finally, we introduce a
tactical containment in the global schema; set of indicators to evaluate a query in the context of an
(4) differently from the global schema, the dependency OLAP session (Section 6.3).
graph can include arcs between primitive fields.
6.1 Query Formulation
Example 5.3. Figure 4 shows the dependency graph First of all, we define an OLAP query as follows.
for our working example. Each primitive field is repre-
sented as a circle whose color is representative of the field Definition 6.1 (OLAP query). Given dependency graph
global support (the lighter the tone, the lower the sup- ℳ = (𝐹 𝑝𝑟𝑖𝑚 , ⪰), an OLAP query on ℳ is a triple 𝑞 =
port). Identifiers (e.g., id) are shown in bold. Directed ⟨𝐺, 𝑝, 𝑚, 𝜙⟩ where:
arrows are representative of the (A)FDs in Γ; for instance, ∙ 𝐺 is the query group-by set, i.e., a non-empty set of
it is id → Facility.Name (FDs are shown in black) and fields in 𝐹 𝑝𝑟𝑖𝑚 such that for all couples 𝑓𝑗 , 𝑓𝑘 in 𝐺
Facility.Name ⇝ Facility.Chain (AFDs are shown in gray). it is 𝑓𝑗 ̸⪰ 𝑓𝑘 ;
Note that, in this case, the dependency graph is a tree, be- ∙ 𝑝 is an (optional) selection predicate; it is a conjunc-
cause in the global schema of Figure 3.b arrays are nested tion of Boolean predicates, each involving a field in
within each other. A different situation is the one shown 𝐹 𝑝𝑟𝑖𝑚 ;
in Figure 5, where the collection includes documents with ∙ 𝑚 ∈ 𝐹 𝑝𝑟𝑖𝑚 is the query measure, i.e., the numerical
two arrays at the same level, so the dependency graph is field to be aggregated;
not a tree. □ ∙ 𝜙 is the operator to be used for aggregation (e.g.,
avg, sum);
Algorithm 1 Validity check of an OLAP query Example 6.3. Query 𝑞1 passes the validity check of
Input a dependency graph ℳ = (𝐹 𝑝𝑟𝑖𝑚 , ⪰), an OLAP query 𝑞 = Algorithm 1 with a completeness warning, because
⟨𝐺, 𝑝, 𝑚, 𝜙⟩ 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(Facility.City) < 1. On the other hand, 𝑞1 meets
Output a validity status
1: 𝑤𝑎𝑟𝑛 ← false the disjointness constraint because
2: for each 𝑓 ∈ 𝐺 do
3: if 𝑖𝑑(𝑎𝑟𝑟(𝑚)) ̸⪰ 𝑖𝑑(𝑎𝑟𝑟(𝑓 )) then
4: 𝑤𝑎𝑟𝑛 ← true ◁ Disjointness failed
5: if 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) < 1 then 𝑖𝑑(𝑎𝑟𝑟(Facility.City)) = id
6: 𝑤𝑎𝑟𝑛 ← true ◁ Completeness failed
7: if 𝑤𝑎𝑟𝑛 then 𝑖𝑑(𝑎𝑟𝑟(Exercises.Type)) = Exercises. id
8: return “warning”
9: else 𝑖𝑑(𝑎𝑟𝑟(Exercises.Sets.Weight)) = Exercises.Sets. id
10: return “valid”
Exercises.Sets. id ⪰ id
Exercises.Sets. id ⪰ Exercises. id
∙ there exists in ℳ one single field 𝑓 such that 𝑓 ⪰ 𝑓
for all other fields mentioned in 𝑞 (either in 𝐺, 𝑝, or □
𝑚). As previously mentioned, a query fails the completeness
We will refer to all the fields in 𝐺 and 𝑝 as the query constraint if one or more levels in the group-by set do
levels. Field 𝑓 is called the fact of 𝑞 (denoted 𝑓 𝑎𝑐𝑡(𝑞)) and not have full support. This issue is strictly related to the
corresponds to the coarsest granularity of ℳ on which one of incomplete hierarchies in data warehouse design.
𝑞 can be formulated. An example of a case in which a The related work proposes three alternative strategies to
fact cannot be determined is the one in Figure 5, with replace missing values in a hierarchy level 𝑙𝑗 : balancing
𝐺 = {Classes.Name, Exercises.Type}. by exclusion (i.e., replacing all missing values with a sin-
gle value “Other”), downward balancing (replacing with
Example 6.2. The following query, 𝑞1 , measures the av- values from the closest level 𝑙𝑘 such that 𝑙𝑘 ⪰ 𝑙𝑗 ), and
erage amount of weight lifted by elderly athletes per city upward balancing (replacing with values from the closest
and type of exercise: level 𝑙𝑘 such that 𝑙𝑗 ⪰ 𝑙𝑘 ) [12]. Whereas they are originally
𝑞1 = ⟨ {Facility.City, Exercises.Type}, meant to be applied when populating a data warehouse
User.Age ≥ 60, Exercises.Sets.Weight, avg ⟩ from an operational source, these strategies can be directly
It is 𝑓 𝑎𝑐𝑡(𝑞1 ) = Exercises.Sets. id. □ applied at query time, e.g., by using the $ifNull operator
in MongoDB, which allows to replace a missing value in
In [19] the authors outline the constraints that must hold a field with a custom value or with the value of another
for an OLAP query to be considered well-formed, namely, field. Thus, when a query fails the completeness constraint,
the base integrity constraint (stating that the levels in the we ask the user to indicate the desired strategy to replace
group-by set must be functionally independent on each missing values in the levels without full support.
other) and the summarization integrity constraint [16],
which in turn requires disjointness (the measure instances 6.2 Query Execution
to be aggregated are partitioned by the group-by instances),
Once a query has been formulated by the user on the
completeness (the union of these partitions constitutes the
dependency graph corresponding to the global schema, it
entire set), and compatibility (the aggregation operator
has to be reformulated on each local schema to effectively
chosen for each measure is compatible with the type of
cope with inter-document variety. How this can be done
that measure). Remarkably, Definition 6.1 already ensures
is discussed in the Appendix. In the remainder to this
that queries meet the base integrity constraint (because the
subsection we explain how, after reformulation, each single
query group-by set cannot include fields related by (A)FDs).
query obtained can be translated to MongoDB.
As to the summarization integrity constraint, since the goal
OLAP queries are translated to MongoDB according
of our approach is to enable an immediate querying of data
to its aggregation framework, which allows to declare a
with no cleaning beforehand, we adopt a “soft” approach
multi-stage pipeline of transformations to be carried out on
to avoid being too restrictive. So, after each query has been
the documents of a collection. The most important stages
formulated by the user, it undergoes a check (sketched in
are: $match (to apply predicate selections), $project (to
Algorithm 1) that can possibly return some warnings to
apply transformations to the single fields), $unwind (to
inform the user of potentially incorrect results. Specifically,
unfold an array by creating a different document for every
the disjointness constraint ensures that the granularity of
object inside the array), $group (to group the documents
the measure is not coarser than the one of the group-by set
and calculate aggregated values).
levels (line 3); if this is false, the same instance of 𝑚 will
Given query 𝑞 = ⟨𝐺, 𝑝, 𝑚, 𝜙⟩ on ℳ and global schema
be double counted for multiple instances of the group-by
𝑔(𝐷) = (𝐹, 𝐴), the translation of 𝑞 into the MongoDB
set [19]). The completeness constraint ensures that the
language is done as follows:
levels in the group-by set have full global support (line 5);
this constraint is easily contradicted as it clashes with the ̸ 𝑟, for which there is a
(1) For every array 𝑎 in 𝑔(𝐷), 𝑎 =
schemaless property of DODs. Finally, the compatibility field 𝑓 mentioned in 𝑞 such that 𝑓 𝑎𝑐𝑡(𝑞) ⪰ 𝑖𝑑(𝑎) ⪰ 𝑓 ,
constraint is not considered at all since its verification an $unwind stage is defined; the order of this stages
would require to properly categorize measures (i.e., flow, reflects the order of the arrays in 𝑔(𝐷), beginning
stock and value-per-unit) and levels (i.e, temporal and non- from the one closest to 𝑟.
temporal), but these information can hardly be inferred (2) If 𝑝 ̸= ∅, a $match stage is defined listing every
from the schema or even provided by the user [6]. selection predicate.
(3) A $project stage is defined to keep only the fields the mappings, and the selectivity of the selection predicate),
that are required for the following stages, i.e., 𝑚 as well as the reliability of the results. For these reason,
and every group-by level. If there is one (or more) we introduce some indicators to evaluate the quality of
incomplete level 𝑓 ∈ 𝐺 (i.e., such that 𝑔𝑙𝑜𝑆𝑢𝑝𝑝(𝑓 ) < an OLAP query after it has been executed. Let 𝐸 be the
1), the replacement of the missing values of 𝑓 is set of distinct groups returned by query 𝑞; each group
done at this stage, in accordance with the balancing 𝑒 ∈ 𝐸 includes |𝑒| objects (measured by the count field as
strategy chosen by the user. Additionally, a new of Section 6.2), of which |𝑒|𝑚 (measured by the count-m
field named balanced is added and valued true if field) have a value for 𝑚.
any of the projected fields has been affected by the Selectivity. This indicator measures the selectivity of
balancing strategy, false otherwise. the selection predicates in 𝑞:
(4) A $group stage is defined including the fields that ∑︀
𝑒∈𝐸 |𝑒|
identify a group (i.e., every level 𝑓 ∈ 𝐺 plus the 𝑠𝑒𝑙(𝑞) =
|𝑓 𝑎𝑐𝑡(𝑞)|
balanced field), the measure 𝑚 to be aggregated,
and its aggregation functions 𝜙. Additionally, two Completeness. This indicator is built on the concept
new measures named count and count-m are added to of completeness previously introduced. The idea is to show
count, respectively, the number of aggregated objects the percentage of the queried objects that have not been
and the number of aggregated objects that actually affected by the balancing strategies (which steps in when
contain a value for 𝑚. the value of a level is null or does not exists):
∑︀
The query-independent fields balanced, count, and count-m 𝑒∈𝐸,!𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑(𝑒) |𝑒|
are needed to calculate the indicators of the query, which 𝑐𝑜𝑚𝑝𝑙(𝑞) = ∑︀
𝑒∈𝐸 |𝑒|
will be discussed in Section 6.3.
where 𝑏𝑎𝑙𝑎𝑛𝑐𝑒𝑑(𝑒) is true if 𝑒 has been balanced, false
Example 6.4. The MongoDB query obtained from 𝑞1 otherwise (as stated by the balanced field introduced in
considering a downward balancing strategy is the following. Section 6.2).
db.WS.aggregate({ Group precision. While the absence of full support on
{ $unwind: "$Exercises" }, levels can be overcome by the balancing strategies, nothing
{ $unwind: "$Exercises.Sets" }, can be done when it involves the query measure. In this
{ $match: { "User.Age": { $gte: 60 } } }, case, the precision of the aggregated value returned for
{ $project: { each group is determined by the percentage of aggregated
"Facility.City": { $ifNull: objects that actually contain a value for the measure. Thus,
["$FacilityCity","$FacilityName"] } the precision of a group 𝑒 is
}, |𝑒|𝑚
𝑝𝑟𝑒𝑐(𝑒) =
"Exercises.Type": 1, |𝑒|
"Exercises.Sets.Weight": 1,
Consistently with an OLAP scenario, a query can evolve
"balanced": {
into another with the application of an OLAP operation;
$cond: ["$FacilityCity",false,true]
the resulting sequence of queries is called an OLAP session.
}
In particular, the permitted operations are the following
} },
ones.
{ $group: {
" id": { ∙ The replacement of the query measure with a differ-
"FacilityCity","$FacilityCity", ent one, or the selection of a different aggregation
"ExercisesType","$Exercises.Type", operator. If a new measure is chosen, a new validity
"balanced","$balanced" check is required to verify whether the disjointness
}, requirement still holds.
"Exercises.Sets.Weight": { ∙ The addition/removal/modification of a selection
$avg: "$Exercises.Sets.Weight" predicate. This operation has no impact on the va-
}, lidity of the query.
"count": { $sum: 1 }, ∙ The roll-up (or drill-down) of one of the group-by
"count-m": { $sum: { levels, which leads to replacing a level 𝑓 with a level
$cond: ["$Exercises.Sets.Weight",1,0] 𝑓 ′ such that 𝑓 ⪰ 𝑓 ′ (or 𝑓 ′ ⪰ 𝑓 ).
} }
} } Roll-ups and drill-downs imply a navigation of the depen-
} dency graph on the relationships between 𝑓 and 𝑓 ′ , which
represent (A)FDs. From a multidimensional standpoint,
□ the navigation of an AFD with accuracy lower than 1 leads
to a violation of the roll-up semantics, i.e., the results of
6.3 Query Evaluation and Evolution the second query will not be a correct composition (or de-
In our schemaless scenario, the evaluation of the query composition) of the results of the first query. This happens
results cannot transcend from the evaluation of the query because the FD is not strictly true in some cases, which
itself. In particular, it is important to understand the compromises the correctness of the aggregation. Thus, we
coverage of the query with respect to the collection (which evaluate the impact of these operations by means of another
may be influenced by the support of the fields, the quality of indicator:
Accuracy. This indicator quantifies the accuracy of the similar work is [7], which proposes a MapReduce-based
aggregated results of a query during an OLAP session with algorithm to compute OLAP cubes on columnar stores.
respect to the results obtained from the previous query. The approach is meant to work on a data warehouse (i.e.,
Given query 𝑞, let 𝑞 ′ be the query resulting from a roll-up a database already comprising facts and dimensions); be-
(or drill-down) of 𝑞 from level 𝑓 to 𝑓 ′ , and let Γ′ ⊆ Γ be sides, it is limited to the computation of the cubes, while
the set of AFDs in the path between 𝑓 and 𝑓 ′ . Then, the the OLAP querying aspect is mentioned as future work.
accuracy of 𝑞 ′ with respect to 𝑞 is Also [4] aims at delivering the OLAP experience, but its
∏︁ operational data source is a graph-based database, whose
𝑎𝑐𝑐(𝑞 ′ , 𝑞) = 1 − 𝑎𝑐𝑐(𝛾) data model is entirely different from the one of DODs.
𝛾∈Γ′ Finally, [13] builds on [24] to propose a complete architec-
ture that ingests NoSQL data and provides schema-on-read
7 RELATED LITERATURE functionalities, but without mentioning multidimensional
The rise of NoSQL stores has captured a lot of interest enrichment and OLAP analyses.
from the research community, which has proposed a vari- Since schema variety in a collection often consists of dif-
ety of approaches to deal with the schemaless feature. In ferent representation of the same data (e.g., due to schema
particular, most of the recent works focus on the widely evolution or to the ingestion of data from different sources),
adopted JSON format and on key/value repositories in the problem of schema discovery is often coupled with
general. schema matching algorithms. [18] provides a comprehen-
A first distinction lies in how each work approaches the sive summary of the different techniques envisioned for
problem of schema discovery. Some works aim at provid- generic schema matching (which ranges from the relational
ing a comprehensive view of the schema variety in JSON world to ontologies and XML documents); it is mentioned
documents; e.g., [20] proposes a reverse engineering pro- as a baseline reference in [24], while [8] starts from there to
cess to derive a versioned schema model, where multiple define its own algorithm for schema matching on NoSQL
versions of the same field are created for every intensional stores based on subtree matching. In [21] a tool is pre-
variation detected in the collection. Other works provide sented to automatically identify evolution in the schema
a more concise representation that tends to hide schema of instances in NoSQL databases: once a schema change is
variety. For instance, [24] couples a clustering technique detected, the tool either updates the database instances to
with schema matching techniques to identify a skeleton enforce schema consistency or provides a code to deal with
containing the smallest set of core fields, while [1] adopts this issue on the application side. This structured approach
regular expressions to model the variability of a field type. differs from our schema-on-read scenario, which transpar-
Our work is closer to the latter group, although our global ently handles schema differences and avoids to update the
schema captures the entire variety of fields and enables original data.
the user to choose the fields to focus on, while assisting Several works have focused on bringing NoSQL back to
her with quality indicators of the final queries. Several free the relational world. [17] discusses an approach to provide
tools have also been released to perform schema detection schema-on-read capabilities for flexible schema data stored
on different platforms (MongoDB, ElasticSearch, Couch- on RDBMSs; this is done by mapping the document struc-
base, Apache Drill), although they are mostly limited to ture on different tables and by providing a data guide as
collecting the union of the fields. In a previous work [9] we the union of every possible field at any level. Differently
followed a different approach and devised a schema profiling from our approach, no advanced schema matching mecha-
algorithm that explains the schema variety in a collection nism is provided. [5] proposes an algorithm to provide a
in terms of the extensional values found in the documents generic relational encoding of arbitrary JSON documents;
(e.g., it could find that different schemas depend on the in particular, documents are stored in ternary relations
different values for SessionType). that contain rows for every key in every document (i.e.,
The most distinguishing feature of our approach is the each row stores the document id, the key name, and the
definition of a multidimensional representation of the schema key value). A more sophisticated algorithm is proposed in
in order to enable OLAP analyses directly on the DOD. [8], where normalized relational schemas are automatically
From this point of view, a work closely related to ours is generated from NoSQL stores. It relies on AFD detection
[6], which proposes a schema-on-read approach for OLAP to build relationships between entities and it provides its
queries over DODs. This is done by building a multidi- own schema matching algorithm. Based on this approach, a
mensional schema from the union of fields found in the vision for a new paradigm called adaptive schema databases
collection; then, the OLAP experience is proposed at query has been proposed in [22]; it is a conceptual framework that
time, where suggestions for roll-up and drill-down opera- devises global schemas as time-evolving and user-dependent
tions are provided given the query formulated by the user. relational views that are mapped to local schemas via prob-
Differently from our approach, [6] exclusively focuses on abilistic mappings —whereas mappings are deterministic
the multidimensional representation of JSON data and in our approach.
overlooks the schemaless property of DODs: in particular,
inter-document variety is considered only in terms of fields
with varying support (thus no schema integration is per- 8 CONCLUSIONS
formed), and no support is given to the user to evaluate
In this paper we have presented an original approach to
the coverage and accuracy of queries. Also, AFD detection
OLAP on DODs. Our current implementation relies on a
is carried out on demand only after the user has written a
prototype that separately handles the different stages. In
query, thus it only impacts the OLAP experience. Another
Table 1: Execution times for schema extraction our work are a particular case of those used as a reference
in the BIN context.
# records DB size Time Data schema. The reference schema in the BIN context is
5K 2 MB 4 sec a classical multidimensional schema featuring a fact, a set
50 K 20 MB 33 sec of hierarchies (each made of levels), and a set of measures
500 K 197 MB 6 min (each coupled with an aggregation operator). The depen-
5M 1.7 GB 60 min dency graph of Definition 5.2 can be thought of as a sort
of “multi-fact” multidimensional schema with no explicit
distinction between levels and measures. However, when an
OLAP query is formulated as in Definition 6.1, exactly one
particular, we use a customized version of the free tool va-
fact is implicitly determined, group-by levels are explicitly
riety.js for schema extraction on MongoDB collections; we
distinguished from measures, and an aggregation operator
rely on the BIN framework [11] to handle schema mappings
is coupled to each measure. So, from the data schema point
and query reformulation (see Appendix); AFD detection
of view, there is no difference between the context of BINs
is carried out by a simple Javascript algorithm, which de-
and the one of this paper.
termines the presence of AFDs between couples of fields
Mappings. The primitive mappings of Definition 4 can be
by means of count distinct queries, adopting a smart ex-
expressed, according to the BIN terminology, using either
ploration strategy that reduces the search space like in [6];
same or equi-level predicates. same predicates are used for
finally, OLAP queries are manually formulated. Our refer-
measures, and can be annotated with an expression; since
ence real-world collection is stored on a single machine (i7
in Definition 6.1 measures are required to be numerical,
CPU, 32GB RAM) and contains 5M workout sessions with
the associated transcodings must be translatable into an
6 different local schemas (mostly due to missing attributes),
expression. equi-level predicates are used for levels, and can
35M exercises and 85M sets. Table 1 shows the execution
be directly annotated with a transcoding. Remarkably, in
times for the schema extraction phase. Times are consis-
[11] these two types of mappings are called exact since they
tent with those of related approaches that perform schema
enable non-approximate query reformulations. Note that
extraction on JSON datasets, such as [1]; also, we note that
array mappings are not used for query reformulation but
time increases linearly with the size of the database. Given
only for determining the global schema, so they are not
the low number of schemas, mappings have been manually
considered here.
defined. A sample OLAP query 𝑞 that groups documents
Queries. An OLAP query (Definition 6.1) has a group-by
by Facility.Chain (global support 0.38) to obtain the aver-
set, a (conjunctive) selection predicate, and a measure. A
age amount of Exercises.ExCalories (global support 0.69)
BIN query has a group-by set, a (conjunctive) selection
returns the following indicator values: 𝑠𝑒𝑙(𝑞) = 1 (as there
predicate, and an expression involving one or more mea-
is no filtering), 𝑐𝑜𝑚𝑝𝑙(𝑞) = 0.33 (lower than the group-by
sures. By simply picking a single measure and the identity
set support), and an average 𝑝𝑟𝑒𝑐(𝑒) of 0.99. The time for
expression, situation (ii) is addressed. As to situation (i),
executing 𝑞 on MongoDB is about 3 minutes. Drilling down
i.e., querying aimed at checking AFDs, we remark that the
to Facility.Name (global support 1) increases 𝑐𝑜𝑚𝑝𝑙(𝑞) to 1,
query for checking AFD 𝑙 ⇝ 𝑙′ can be expressed as a BIN
while the accuracy of the new query with respect to 𝑞 is
query with group-by set {𝑙, 𝑙′ } and a dummy measure, on
0.98.
whose result a simple COUNT DISTINCT is then executed.
As future work, we plan to build a fully-functioning
implementation, as well as to thoroughly evaluate the per-
Based on the considerations above, we can state that
formance and scalability of the approach. Also, we plan to
an OLAP query of the global schema can be correctly
switch from a single machine to a multi-node cluster and
reformulated into a set of local queries, one on each local
to consider schema profiling techniques [9] to enhance the
schema. Then, each local query is separately executed
support given to the user at query time.
on the DOD; specifically, each query must target only
the documents that belong to a specific local schema 𝑠.
APPENDIX This is done in two steps. First, the information about
Not only inter-schema mappings enable the definition of which document has which schema (obtained in the schema
a global schema, they also allow a MongoDB query for- extraction stage) is stored in a different collection (called
mulated on the global schema to be reformulated on each WorkoutSession-schemas in our example) in the following
local schema, which is necessary in two situations: (i) when form: a document is created for every schema 𝑠 ∈ 𝑆(𝐷),
the collection is queried to detect AFDs (Section 5) and containing an array ids with the id of every document
(ii) when the user issues an OLAP query on the collection 𝑑 ∈ 𝐷𝑠 . Then, the query on schema 𝑠 is executed by joining
(Section 6). The query reformulation algorithm we adopt it with the list of identifiers in WorkoutSession-schemas).
here is the one proposed by [11] in the context of business Finally, a post-processing activity is required to integrate
intelligence networks (BINs); it enables the rewriting of the results coming from the different local queries.
a query from a source multidimensional schema to a tar-
get multidimensional schema and has been proved to be REFERENCES
complete and provide all certain answers to the query. In [1] Mohamed Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo,
this section we discuss why that algorithm can be reused Giorgio Ghelli, and Carlo Sartiani. 2017. Schema Inference for
to safely rewrite queries in both situations (i) and (ii). To Massive JSON Datasets. In Proc. EDBT. Venice, Italy, 222–
233.
this end we need to prove that the data schemas, the inter- [2] Carlo Batini, Maurizio Lenzerini, and Shamkant Navathe. 1986.
schema mappings, and the queries which we consider in A Comparative Analysis of Methodologies for Database Schema
Integration. Comput. Surveys 18, 4 (1986), 323–364.
[3] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. 2011.
Generic schema matching, ten years later. Proc. VLDB En-
dowment 4, 11 (2011), 695–701.
[4] Arnaud Castelltort and Anne Laurent. 2014. NoSQL Graph-
based OLAP Analysis. In Proc. KDIR. Rome, Italy, 217–224.
[5] Craig Chasseur, Yinan Li, and Jignesh M. Patel. 2013. Enabling
JSON Document Stores in Relational Systems. In Proc. WebDB.
New York, USA, 1–6.
[6] Mohamed Lamine Chouder, Stefano Rizzi, and Rachid Chalal.
In press. EXODuS: Exploratory OLAP over Document Stores.
Inf. Syst. (In press).
[7] Khaled Dehdouh. 2016. Building OLAP Cubes from Columnar
NoSQL Data Warehouses. In Proc. MEDI. Almerı́a, Spain.
[8] Michael DiScala and Daniel J. Abadi. 2016. Automatic Genera-
tion of Normalized Relational Schemas from Nested Key-Value
Data. In Proc. SIGMOD. San Francisco, USA, 295–310.
[9] Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2017.
Schema Profiling of Document Stores. In Proc. SEBD. Squillace
Lido, Italy, 1–8.
[10] Matteo Golfarelli, Simone Graziani, and Stefano Rizzi. 2016.
Starry Vault: Automating Multidimensional Modeling from
Data Vaults. In Proc. ADBIS. 137–151.
[11] Matteo Golfarelli, Federica Mandreoli, Wilma Penzo, Stefano
Rizzi, and Elisa Turricchia. 2012. OLAP query reformulation in
peer-to-peer data warehousing. Inf. Syst. 37, 5 (2012), 393–411.
[12] Matteo Golfarelli and Stefano Rizzi. 2009. Data warehouse
design: Modern principles and methodologies. McGraw-Hill,
Inc.
[13] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Con-
stance: An Intelligent Data Lake System. In Proc. SIGMOD.
San Francisco, USA, 2097–2100.
[14] Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf
Aboulnaga. 2004. CORDS: Automatic Discovery of Correlations
and Soft Functional Dependencies. In Proc. SIGMOD. 647–
658.
[15] Javier Luis Cánovas Izquierdo and Jordi Cabot. 2013. Discov-
ering implicit schemas in JSON data. In Proc. ICWE. 68–83.
[16] Hans-Joachim Lenz and Arie Shoshani. 1997. Summarizability
in OLAP and Statistical Data Bases. In Proc. Ninth Inter-
national Conference on Scientific and Statistical Database
Management.
[17] Zhen Hua Liu and Dieter Gawlick. 2015. Management of Flexi-
ble Schema Data in RDBMSs - Opportunities and Limitations
for NoSQL. In Proc. CIDR. Asilomar, USA.
[18] Erhard Rahm and Philip A. Bernstein. 2001. A survey of
approaches to automatic schema matching. VLDB J. 10, 4
(2001).
[19] Oscar Romero and Alberto Abelló. 2006. Multidimensional
Design by Examples. In Proc. DaWaK. Krakow, Poland, 85–
94.
[20] Diego Sevilla Ruiz, Severino Feliciano Morales, and Jesús Garcı́a
Molina. 2015. Inferring Versioned Schemas from NoSQL
Databases and Its Applications. In Proc. ER. 467–480.
[21] Stefanie Scherzinger, Eduardo Cunha de Almeida, Thomas
Cerqueus, Leandro Batista de Almeida, and Pedro Holanda.
2016. Finding and Fixing Type Mismatches in the Evolution of
Object-NoSQL Mappings. In Proc. Workshops EDBT/ICDT.
[22] William Spoth, Bahareh Sadat Arab, Eric S. Chan, Dieter
Gawlick, Adel Ghoneimy, Boris Glavic, Beda Christoph Ham-
merschmidt, Oliver Kennedy, Seokki Lee, Zhen Hua Liu, Xing
Niu, and Ying Yang. 2017. Adaptive Schema Databases. In
Proc. CIDR. Chaminade, USA.
[23] Boris Vrdoljak, Marko Banek, and Stefano Rizzi. 2003. Design-
ing Web Warehouses from XML Schemas. In Proc. DaWaK.
89–98.
[24] Lanjun Wang, Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassan-
zadeh, Jia Zou, and Chen Wangz. 2015. Schema management
for document stores. Proc. VLDB Endowment 8, 9 (2015),
922–933.