=Paper= {{Paper |id=Vol-2646/07-paper |storemode=property |title=OLAP Querying of Document Stores in the Presence of Schema Variety |pdfUrl=https://ceur-ws.org/Vol-2646/07-paper.pdf |volume=Vol-2646 |authors=Matteo Francia,Enrico Gallinucci,Matteo Golfarelli,Stefano Rizzi |dblpUrl=https://dblp.org/rec/conf/sebd/FranciaGGR20 }} ==OLAP Querying of Document Stores in the Presence of Schema Variety== https://ceur-ws.org/Vol-2646/07-paper.pdf

OLAP Querying of Document Stores
in the Presence of Schema Variety
(DISCUSSION PAPER)

Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi

DISI, University of Bologna, Italy

Abstract. Document stores are preferred to relational ones for stor-
ing heterogeneous data due to their schemaless nature. However, the
absence of a unique schema adds complexity to analytical applications.
In a previous paper we have proposed an original approach to OLAP
on document stores; its basic idea was to stop fighting against schema
variety and welcome it as an inherent source of information wealth in
schemaless sources. In this paper we focus on the querying phase, show-
ing how queries can be directly rewritten on a heterogeneous collection
in an inclusive way, i.e., also including the concepts present in a subset
of documents only.

Keywords: NoSQL · Multidimensional Modeling · OLAP

1 Introduction

Schemaless databases, in particular document stores (DSs) such as MongoDB,
are preferred to relational ones for storing heterogeneous data with variable
schemas and structural forms; typical schema variants within a collection con-
sist in missing or additional fields, in different names or types for a field, and in
different structures for instances. The absence of a unique schema grants flexi-
bility to operational applications but adds complexity to analytical applications,
in which a single analysis often involves large sets of data with different schemas.
Dealing with this complexity while adopting a classical data warehouse design
approach would require a notable effort to understand the rules that drove the
use of alternative schemas, plus an integration activity to identify a common
schema to be adopted for analysis.
In this paper we propose an approach to OLAP querying on DSs. The basic
idea is to welcome data heterogeneity and schema variety as an inherent source
of information wealth. So, instead of trying to hide this variety, we show it
to users (basically, data scientists and data enthusiasts). To the best of our
knowledge, this is the first approach to propose a form of approximated OLAP
analyses on document stores that embraces and exploits the inherent variety of
documents. OLAP querying is carried out directly on the data source, without
Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). This volume is published
and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
materializing any cube or data warehouse. Remarkably, we adopt an inclusive
solution to integration, i.e., the user can include a concept in a query even if
it is present in a subset of documents only. We cover both inter-schema and
intra-schema variety, specifically we cope with missing fields, different levels of
detail in instances, different field naming.

2 Related Literature
The rise of NoSQL stores has captured a lot of interest from the research com-
munity, which has proposed a variety of approaches to deal with the schemaless
feature. Accessing schemaless sources often requires the adoption of data inte-
gration techniques to provide a unified view of data; as this is not the primary
focus of the paper, we refer the reader to a survey on the subject [6].
A distinguishing feature of our approach is the definition of a multidimen-
sional representation of the schema to enable OLAP analyses directly on the DS.
From this point of view, a work closely related to ours is [2], which proposes a
schema-on-read approach for multidimensional queries over DSs. That approach
differs from ours because it focuses on the multidimensional representation of
JSON data and overlooks the variety issues, and because the detection of func-
tional dependencies is activated on-demand only after the user has written a
query. OLAP analyses on DSs are enabled also in [7], although a simpler inte-
gration approach is proposed, with no identification of functional dependencies
nor multidimensional views.
On the issue of schema variety on DSs, a recent work [8] builds on a sim-
ple mapping strategy to hide the variety within a single, comprehensive query.
Whereas the approach proposes a simpler querying mechanism, it only covers a
limited set of schema variants and does not support OLAP.
Finally, several works have focused on bringing NoSQL back to the rela-
tional world. [11] discusses an approach to provide schema-on-read capabilities
for flexible schema data stored on RDBMSs: it maps the document structure on
different tables and provides a data guide as the union of every possible field at
any level. However, no advanced schema matching mechanism is provided.

3 Approach Overview and Working Example
Figure 2 gives an overview of our approach. Although the picture suggests a
sequential execution of the stages, it simply outlines the ordering for the first
iteration. The user starts by analyzing the first results provided by the system,
then iteratively injects additional knowledge into the different stages to refine
the metadata and improve the querying effectiveness. We now provide a short
description of each stage based on an example; all the details can be found in
[3]. The example we use is based on a real-world collection of workout sessions,
WS, obtained from a worldwide company selling fitness equipment. Figure 1
(left) shows a sample document in the collection, organized according to three
nesting levels: the first level contains information about the user; the Exercises
[ { "_id" : ObjectId("54a4332f44"), WS WS WS
"User" : _id
_id _id
{ "FullName" : ”John Smith",
User.FullName User.FullName
"Age" : 42 },
User.FirstName FirstName
"StartedOn" : ISODate("2017-06-15"),
User.LastName LastName
"Facility" :
User.Age User.Age
{ "Name" : "PureGym Piccadilly",
StartedOn StartedOn Date
"Chain" : "PureGym" },
Facility.Name Facility.Name Gym.Name
"SessionType" : "RunningProgram", Facility.Chain
Facility.Chain
"DurationMins": 90,
Facility.City Gym.City
"Exercises" :
Facility.Country Gym.Country
[ { "Type" : "Leg press",
SessionType SessionType SessionType
"ExCalories" : 28,
DurationMins DurationMins DurationSecs
"Sets" :
[ { "Reps" : 14,
"Weight" : 60 }, Exercises Exercises
... Exercises_id Exercises_id
] }, Type Type
{ "Type" : "Tapis roulant" }, ExCalories ExCalories
...
] Sets Sets Series
},
... Sets_id Sets_id Series_id
] ExType
Reps Reps Reps
Weight Weight Weight
SetCalories SeriesCalories
s1 g s2

Fig. 1. A JSON document in the WS collection (left), its schema (s1 ), another schema
of the same collection (s2 ), and the global schema (g)

Querying
Formulation
Dependency graph
FD enrichment Validation
Global schema, Mappings Reformulation
Schema integration
Translation
Local schemas Execution
Schema extraction
Merge

Document store

Fig. 2. Approach overview

array contains an object for every exercise carried out during the session; the
Sets array contains an object for every set that the exercise was split into.

1. Schema extraction. The goal is to identify the set of distinct local schemas
that occur inside a collection of documents. To this end we adopt a tree-like
definition for schemas which models arrays by considering the union of the
schemas of their elements, as done in [1]. This is a completely automatic
stage; its implementation is loosely inspired by the free tool variety.js and
consists of a routine that connects to the desired collection on MongoDB,
extracts the local schemas, and writes the results on a triplestore. With refer-
ence to our example, Figure 1 shows the schema s1 of the sample document.
Each array is represented as a box, with its child primitives listed below
(numeric primitives are in italics). Object fields are prefixed with the object
key (e.g., Facility.Chain). The vertical lines between boxes represent nestings
of arrays, with the root WS on top.
2. Schema integration. The goal here is to integrate the distinct, local schemas
extracted from a collection to obtain a single and comprehensive view of the
latter, i.e., a global schema, and its mappings with each local schema. To
this end we rely on both schema matching and schema mapping techniques:
the former allow to identify the single matches between the attributes, while
the latter define the mappings between each local schema and the global
one, thus enabling the rewriting of queries. A mapping between two (sets
of) primitive fields P and P 0 requires a transcoding function to transform
values of P into values of P 0 ; these functions enable query reformulation in
the presence of selection predicates as well as the integration of the query
results obtained from all documents. The approach we adopt for schema
integration includes two steps. The first step is automatic and defines a pre-
liminary global schema as the name-based union of all local schemas [10]. In
the second step, the preliminary global schema is refined by merging match-
ing (sets of) fields in the global schema. Existing tools (e.g., Coma 3.0 [12])
can be used to automatically find a list of possible matches between arrays
and primitives; then the user will browse them and possibly define addi-
tional mappings. With reference to our example, Figure 1 shows the global
schema g resulting from the integration of s1 with s2 , one more schema
from WS; mappings between arrays and primitives are represented with dot-
ted lines. The transcoding functions of mappings h{Date}, {StartedOn}i and
h{FirstName, LastName}, {User.FullName}i are the identity function and a
function that concatenates two strings, respectively.
3. Schema enrichment. The goal is to propose a multidimensional view of
the global schema to enable OLAP analyses. The main informative gap to
be filled to this end is the identification of hierarchies, which in turn relies
on the identification of functional dependencies (FDs) between fields in the
global schema. By assuming the presence of identifiers at every nesting level,
some exact FDs can be derived from the global schema without looking
at the data. However, additional FDs can exist between primitive nodes,
though they cannot be inferred from the schema and can only be found
by querying the data. More precisely, since DSs may contain incomplete
and faulty data, we look for approximate FDs (AFDs) [9], i.e., FDs that
“mostly” hold on data. To detect approximate FDs, we adapted the approach
proposed in [4]. As a result, we determine a dependency graph, which provides
a multidimensional view of the global schema in terms of the FDs between its
primitive fields. Figure 3 shows the dependency graph for our example. Each
primitive field f is represented as a circle whose color is representative of the
support of f , i.e., of the percentage of times that f occurs in the collection
(the lighter the tone, the lower the support). Identifiers (e.g., id) are shown
in bold. Directed arrows are representative of the (A)FDs detected during
schema enrichment; for instance, we have id → Facility.Name (exact FD, in
black) and Facility.Name Facility.Chain (AFD, in grey). The latter FD is
approximate because it only holds for a subset of documents.
4. Querying. The last stage enables the formulation of multidimensional queries
on the dependency graph and their execution on the collection. First of
level (support=1) Exercises.Sets. _id

level (support<1) Exercises.Sets.SetCalories
FD
Exercises.Sets.Reps
AFD Exercises.Sets.Weight
Exercises._id

Exercises.Type Exercises.ExCalories
_id

User.FullName

Facility.Name
User.LastName SessionType DurationMins
User.FirstName
Facility.City
User.Age Facility.Chain

Facility.Country

Fig. 3. Dependency graph for the global schema in Figure 1

all, each formulated query is validated against the requirements of well-
formedness proposed in the literature [13]. Then, the query is reformulated
into multiple queries, one for each local schema in the collection, which are
translated into the query language of the DS; the results presented to the
user are obtained by merging the results of the single local queries.

4 Querying

In this section we describe the final querying stage of our approach. Given a
global schema g and the dependency graph M obtained by enriching g with
(A)FDs, a multidimensional query (from now on, md-query) is a triple q =
hG, p, m, ϕi where: G is the query group-by set, i.e., a non-empty set of primitive
fields in M; p is an optional selection predicate defined as a conjunction of
Boolean predicates on primitive fields; m is the query measure, i.e., the numeric
primitive field to be aggregated; ϕ is the operator to be used for aggregation
(e.g., avg, sum).
For q to be well-formed, there must exist in M one single field f such that
all the other fields mentioned in q (either in G, p, or m) can be reached in M
from f . Field f is called the fact of q (denoted f act(q)) and corresponds to the
coarsest granularity of M on which q can be formulated.
Other well-formedness constraints for md-queries are introduced in [13]: the
base integrity constraint, stating that the fields in G must be functionally in-
dependent on each other, and the summarization integrity constraint. The base
integrity constraint can be easily checked on the dependency graph (no arcs must
exist between the fields in G). As to the summarization integrity constraint, each
query undergoes a check that can possibly return some warnings to inform the
user of potentially incorrect results. Specifically, summarization integrity entails
disjointness (to avoid double counting, the measure instances to be aggregated
must be partitioned by the group-by instances) and completeness (the union of
these partitions must constitute the entire set). Disjointness is easily checked
on the dependency graph by verifying if the granularity of measure m is finer
than the one of all the fields in G. Completeness is obtained if all the fields in G
have full support, which is easily contradicted in heterogeneous collections. To
restore completeness, we adopt at query time the balancing strategies used for
incomplete hierarchies in data warehouse design; basically, the $ifNull operator
in MongoDB is used to replace a missing value in a field with a custom value.
Example 1. The following md-query on the WS collection, q1 , measures the av-
erage amount of weight lifted by elderly athletes per city and type of exercise:

q1 = h{Facility.City, Exercises.Type}, (User.Age ≥ 60), Exercises.Sets.Weight, avgi

We have f act(q1 ) = Exercises.Sets. id. Query q1 passes the validity check with
a warning, because the support of Facility.City is less than one, so balancing
is used to restore completeness. On the other hand, q1 meets the disjointness
constraint because the granularity Exercises.Sets. id of the required measure,
Exercises.Sets.Weight, is finer than both the granularities id and Exercises. id of
the group-by fields.
Once a well-formed md-query q has been formulated by the user on the
dependency graph, it has to be reformulated on each local schema si to effectively
cope with inter-document variety. To this end we rely on the approach to md-
query reformulation proposed in [5] for federated data warehouse architectures.
This approach has been proved to be complete and to provide all certain answers
to the md-query. As a result, a set of local queries q (i) , one for each local schema
si , are determined.
At this point, each local query q (i) is separately executed on the DS; specifi-
cally, q (i) must target only the documents that belong to local schema si . This
is done in two steps. First, the information about which document has which
schema (obtained in the schema extraction stage) is stored in a different collec-
tion (called WS-schemas in our example) in the following form: a document is
created for every schema si , containing an array ids with the id of every docu-
ment having schema si . Then, query q (i) is executed by joining it with the list of
identifiers in WS-schemas. Note that, to be executed, q (i) needs be translated to
the MongoDB query language, which allows us to declare a multi-stage pipeline
of transformations to be carried out on the documents of a collection.
Finally, a post-processing activity is required to integrate and, possibly, fur-
ther aggregating the results coming from the different local queries. This oper-
ation can be performed in-memory, as OLAP queries usually produce a limited
number of records and the transcoding functions provide homogeneous values.
Example 2. Consider an md-query that calculates the total amount of burnt
calories by facility, excluding workout sessions that are shorter than 30 minutes:

q2 = h{Facility.Name}, (DurationMins ≥ 30), Exercises.ExCalories, sumi

Consider the local schemas s1 and s2 in Figure 1. The reformulation of q2 onto
(1)
s1 has no effect (i.e., q2 ≡ q2 ); conversely, the reformulation onto s2 generates
the following local query:
(2) DurationSecs
q2 = h{Gym.Name}, ( ≥ 30), Series.SeriesCalories, sumi
60
The MongoDB query obtained from q1 of Example 1 is the following; note that
the missing values of Facility.City are replaced by those of Facility.Name:
db.WS.aggregate(
{ { $unwind: "$Exercises" },
{ $unwind: "$Exercises.Sets" },
{ $match: { "User.Age": { $gte: 60 } } },
{ $project:
{ "Facility.City": { $ifNull: ["$FacilityCity","$FacilityName"] } },
"Exercises.Type": 1,
"Exercises.Sets.Weight": 1,
"balanced": { $cond: ["$FacilityCity",false,true]} } },
{ $group:
{ " id":
{ "FacilityCity","$FacilityCity",
"ExercisesType","$Exercises.Type",
"balanced","$balanced" },
"Exercises.Sets.Weight":
{ $avg: "$Exercises.Sets.Weight" },
"count": { $sum: 1 },
"count-m": { $sum: { $cond: ["$Exercises.Sets.Weight",1,0] } } } } } )

5 Conclusions and Evaluation
In this paper we have presented an original approach to OLAP querying on
DSs. Our basic claim is that the heterogeneity and schema variety intrinsic
to DSs should be considered as a source of information wealth. At the core
of our approach are (i) the building of a global schema that maps onto the
different local schemas within a collection, (ii) the translation of this schema
into multidimensional form enhanced by the detection of approximate FDs, and
(iii) the reformulation of queries from the global schema onto the local schema
to improve the completeness of the result.
As a proof of concept for our approach we have developed a Java prototype to
support the main phases and tested it on a cluster of seven CentOS 6 machines
with an 8-core i7-4790 CPU @3.60 GHz and 32 GB of RAM. Our reference real-
world collection, WS, is stored on Mongo DB 3.4 and randomly sharded on the
cluster; it contains 5 M workout sessions with 6 different local schemas (mostly
due to missing fields), 35 M exercises, and 85 M sets.
Query formulation and translation to MongoDB are done in negligible time.
Reformulation is done with polynomial complexity [5]. Thus, performances mainly
depend on query execution. Consider for instance the following three md-queries:

qa = h{User.Age, Facility.Chain}, T RU E, DurationMins, avgi
qb = h{User.FullName}, (SessionType = “Advanced”), Exercises.ExCalories, sumi
qc = h{Facility.Name, Exercises.Type}, (StartedOn ≥ 01/01/2018), User.Age, maxi

Due to the reformulation on the local schemas, 6 local md-queries are created
from each of the three global ones. A simple optimization is done, when possible,
Fig. 4. Execution times of three queries (qa , qb , and qc ) split into the execution times
of the required local queries

to merge the local queries that involve the same fields on different local schemas;
for instance, with reference to Figure 1, a query counting the documents by
SessionType can be translated into a single local query, as every local schema
has the same representation of SessionType. Due to this optimization, qa , qb ,
and qc are reformulated into either 2 or 3 local queries. The execution times in
seconds for each query are shown in Figure 4; the times for qb and qc are higher
due to the necessity of unwinding arrays.

References
1. Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema infer-
ence for massive JSON datasets. In: Proc. EDBT. pp. 222–233 (2017)
2. Chouder, M.L., Rizzi, S., Chalal, R.: EXODuS: Exploratory OLAP over document
stores. Inf. Syst. 79, 44–57 (2019)
3. Gallinucci, E., Golfarelli, M., Rizzi, S.: Approximate OLAP of document-oriented
databases: A variety-aware approach. Inf. Syst. 85, 114–130 (2019)
4. Golfarelli, M., Graziani, S., Rizzi, S.: Starry vault: Automating multidimensional
modeling from data vaults. In: Proc. ADBIS. pp. 137–151 (2016)
5. Golfarelli, M., Mandreoli, F., Penzo, W., Rizzi, S., Turricchia, E.: OLAP query
reformulation in peer-to-peer data warehousing. Inf. Syst. 37(5), 393–411 (2012)
6. Golshan, B., Halevy, A.Y., Mihaila, G.A., Tan, W.: Data integration: After the
teenage years. In: Proc. PODS. pp. 101–106 (2017)
7. Hamadou, H.B., Gallinucci, E., Golfarelli, M.: Answering GPSJ queries in a poly-
store: A dataspace-based approach. In: Proc. ER. pp. 189–203 (2019)
8. Hamadou, H.B., Ghozzi, F., Péninou, A., Teste, O.: Towards schema-independent
querying on document data stores. In: Proc. DOLAP (2018)
9. Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: CORDS: Automatic
discovery of correlations and soft functional dependencies. In: Proc. SIGMOD. pp.
647–658 (2004)
10. Klettke, M., Störl, U., Scherzinger, S., Regensburg, O.: Schema extraction and
structural outlier detection for JSON-based NoSQL data stores. In: Proc. BTW.
vol. 2105, pp. 425–444 (2015)
11. Liu, Z.H., Gawlick, D.: Management of flexible schema data in RDBMSs - oppor-
tunities and limitations for NoSQL. In: Proc. CIDR. Asilomar, USA (2015)
12. Maßmann, S., Raunich, S., Aumüller, D., Arnold, P., Rahm, E.: Evolution of the
COMA match system. In: Proc. OMISWC. Bonn, Germany (2011)
13. Romero, O., Abelló, A.: Multidimensional design by examples. In: Proc. DaWaK.
pp. 85–94. Krakow, Poland (2006)