OLAP Querying of Document Stores
            in the Presence of Schema Variety
                             (DISCUSSION PAPER)


     Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi

                           DISI, University of Bologna, Italy


        Abstract. Document stores are preferred to relational ones for stor-
        ing heterogeneous data due to their schemaless nature. However, the
        absence of a unique schema adds complexity to analytical applications.
        In a previous paper we have proposed an original approach to OLAP
        on document stores; its basic idea was to stop fighting against schema
        variety and welcome it as an inherent source of information wealth in
        schemaless sources. In this paper we focus on the querying phase, show-
        ing how queries can be directly rewritten on a heterogeneous collection
        in an inclusive way, i.e., also including the concepts present in a subset
        of documents only.

        Keywords: NoSQL · Multidimensional Modeling · OLAP


1     Introduction

Schemaless databases, in particular document stores (DSs) such as MongoDB,
are preferred to relational ones for storing heterogeneous data with variable
schemas and structural forms; typical schema variants within a collection con-
sist in missing or additional fields, in different names or types for a field, and in
different structures for instances. The absence of a unique schema grants flexi-
bility to operational applications but adds complexity to analytical applications,
in which a single analysis often involves large sets of data with different schemas.
Dealing with this complexity while adopting a classical data warehouse design
approach would require a notable effort to understand the rules that drove the
use of alternative schemas, plus an integration activity to identify a common
schema to be adopted for analysis.
    In this paper we propose an approach to OLAP querying on DSs. The basic
idea is to welcome data heterogeneity and schema variety as an inherent source
of information wealth. So, instead of trying to hide this variety, we show it
to users (basically, data scientists and data enthusiasts). To the best of our
knowledge, this is the first approach to propose a form of approximated OLAP
analyses on document stores that embraces and exploits the inherent variety of
documents. OLAP querying is carried out directly on the data source, without
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). This volume is published
    and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
materializing any cube or data warehouse. Remarkably, we adopt an inclusive
solution to integration, i.e., the user can include a concept in a query even if
it is present in a subset of documents only. We cover both inter-schema and
intra-schema variety, specifically we cope with missing fields, different levels of
detail in instances, different field naming.


2   Related Literature
The rise of NoSQL stores has captured a lot of interest from the research com-
munity, which has proposed a variety of approaches to deal with the schemaless
feature. Accessing schemaless sources often requires the adoption of data inte-
gration techniques to provide a unified view of data; as this is not the primary
focus of the paper, we refer the reader to a survey on the subject [6].
    A distinguishing feature of our approach is the definition of a multidimen-
sional representation of the schema to enable OLAP analyses directly on the DS.
From this point of view, a work closely related to ours is [2], which proposes a
schema-on-read approach for multidimensional queries over DSs. That approach
differs from ours because it focuses on the multidimensional representation of
JSON data and overlooks the variety issues, and because the detection of func-
tional dependencies is activated on-demand only after the user has written a
query. OLAP analyses on DSs are enabled also in [7], although a simpler inte-
gration approach is proposed, with no identification of functional dependencies
nor multidimensional views.
    On the issue of schema variety on DSs, a recent work [8] builds on a sim-
ple mapping strategy to hide the variety within a single, comprehensive query.
Whereas the approach proposes a simpler querying mechanism, it only covers a
limited set of schema variants and does not support OLAP.
    Finally, several works have focused on bringing NoSQL back to the rela-
tional world. [11] discusses an approach to provide schema-on-read capabilities
for flexible schema data stored on RDBMSs: it maps the document structure on
different tables and provides a data guide as the union of every possible field at
any level. However, no advanced schema matching mechanism is provided.


3   Approach Overview and Working Example
Figure 2 gives an overview of our approach. Although the picture suggests a
sequential execution of the stages, it simply outlines the ordering for the first
iteration. The user starts by analyzing the first results provided by the system,
then iteratively injects additional knowledge into the different stages to refine
the metadata and improve the querying effectiveness. We now provide a short
description of each stage based on an example; all the details can be found in
[3]. The example we use is based on a real-world collection of workout sessions,
WS, obtained from a worldwide company selling fitness equipment. Figure 1
(left) shows a sample document in the collection, organized according to three
nesting levels: the first level contains information about the user; the Exercises
        [   { "_id" : ObjectId("54a4332f44"),                     WS               WS                         WS
               "User" :                                                      _id
                                                            _id                                        _id
               { "FullName" : ”John Smith",
                                                            User.FullName    User.FullName
                  "Age" : 42 },
                                                                             User.FirstName            FirstName
               "StartedOn" : ISODate("2017-06-15"),
                                                                             User.LastName             LastName
               "Facility" :
                                                            User.Age         User.Age
               { "Name" : "PureGym Piccadilly",
                                                            StartedOn        StartedOn                 Date
                  "Chain" : "PureGym" },
                                                            Facility.Name    Facility.Name             Gym.Name
               "SessionType" : "RunningProgram",                             Facility.Chain
                                                            Facility.Chain
               "DurationMins": 90,
                                                                             Facility.City             Gym.City
               "Exercises" :
                                                                             Facility.Country          Gym.Country
               [ { "Type" : "Leg press",
                                                            SessionType      SessionType               SessionType
                       "ExCalories" : 28,
                                                            DurationMins     DurationMins              DurationSecs
                       "Sets" :
                       [ { "Reps" : 14,
                              "Weight" : 60 },                 Exercises        Exercises
                           ...                              Exercises_id     Exercises_id
                       ] },                                 Type             Type
                  { "Type" : "Tapis roulant" },             ExCalories       ExCalories
                  ...
               ]                                                 Sets             Sets                       Series
            },
            ...                                             Sets_id          Sets_id                   Series_id
        ]                                                                                              ExType
                                                            Reps             Reps                      Reps
                                                            Weight           Weight                    Weight
                                                                             SetCalories               SeriesCalories
                                                                      s1               g                      s2


Fig. 1. A JSON document in the WS collection (left), its schema (s1 ), another schema
of the same collection (s2 ), and the global schema (g)

                                                          Querying
                                                                                  Formulation
                  Dependency graph
                                                       FD enrichment                   Validation
              Global schema, Mappings                                                  Reformulation
                                                      Schema integration
                                                                                            Translation
                    Local schemas                                                               Execution
                                                      Schema extraction
                                                                                                    Merge

                                                        Document store


                                           Fig. 2. Approach overview


array contains an object for every exercise carried out during the session; the
Sets array contains an object for every set that the exercise was split into.

 1. Schema extraction. The goal is to identify the set of distinct local schemas
    that occur inside a collection of documents. To this end we adopt a tree-like
    definition for schemas which models arrays by considering the union of the
    schemas of their elements, as done in [1]. This is a completely automatic
    stage; its implementation is loosely inspired by the free tool variety.js and
    consists of a routine that connects to the desired collection on MongoDB,
    extracts the local schemas, and writes the results on a triplestore. With refer-
    ence to our example, Figure 1 shows the schema s1 of the sample document.
    Each array is represented as a box, with its child primitives listed below
    (numeric primitives are in italics). Object fields are prefixed with the object
    key (e.g., Facility.Chain). The vertical lines between boxes represent nestings
    of arrays, with the root WS on top.
2. Schema integration. The goal here is to integrate the distinct, local schemas
   extracted from a collection to obtain a single and comprehensive view of the
   latter, i.e., a global schema, and its mappings with each local schema. To
   this end we rely on both schema matching and schema mapping techniques:
   the former allow to identify the single matches between the attributes, while
   the latter define the mappings between each local schema and the global
   one, thus enabling the rewriting of queries. A mapping between two (sets
   of) primitive fields P and P 0 requires a transcoding function to transform
   values of P into values of P 0 ; these functions enable query reformulation in
   the presence of selection predicates as well as the integration of the query
   results obtained from all documents. The approach we adopt for schema
   integration includes two steps. The first step is automatic and defines a pre-
   liminary global schema as the name-based union of all local schemas [10]. In
   the second step, the preliminary global schema is refined by merging match-
   ing (sets of) fields in the global schema. Existing tools (e.g., Coma 3.0 [12])
   can be used to automatically find a list of possible matches between arrays
   and primitives; then the user will browse them and possibly define addi-
   tional mappings. With reference to our example, Figure 1 shows the global
   schema g resulting from the integration of s1 with s2 , one more schema
   from WS; mappings between arrays and primitives are represented with dot-
   ted lines. The transcoding functions of mappings h{Date}, {StartedOn}i and
   h{FirstName, LastName}, {User.FullName}i are the identity function and a
   function that concatenates two strings, respectively.
3. Schema enrichment. The goal is to propose a multidimensional view of
   the global schema to enable OLAP analyses. The main informative gap to
   be filled to this end is the identification of hierarchies, which in turn relies
   on the identification of functional dependencies (FDs) between fields in the
   global schema. By assuming the presence of identifiers at every nesting level,
   some exact FDs can be derived from the global schema without looking
   at the data. However, additional FDs can exist between primitive nodes,
   though they cannot be inferred from the schema and can only be found
   by querying the data. More precisely, since DSs may contain incomplete
   and faulty data, we look for approximate FDs (AFDs) [9], i.e., FDs that
   “mostly” hold on data. To detect approximate FDs, we adapted the approach
   proposed in [4]. As a result, we determine a dependency graph, which provides
   a multidimensional view of the global schema in terms of the FDs between its
   primitive fields. Figure 3 shows the dependency graph for our example. Each
   primitive field f is represented as a circle whose color is representative of the
   support of f , i.e., of the percentage of times that f occurs in the collection
   (the lighter the tone, the lower the support). Identifiers (e.g., id) are shown
   in bold. Directed arrows are representative of the (A)FDs detected during
   schema enrichment; for instance, we have id → Facility.Name (exact FD, in
   black) and Facility.Name       Facility.Chain (AFD, in grey). The latter FD is
   approximate because it only holds for a subset of documents.
4. Querying. The last stage enables the formulation of multidimensional queries
   on the dependency graph and their execution on the collection. First of
                level (support=1)            Exercises.Sets. _id

                level (support<1)                                       Exercises.Sets.SetCalories
                 FD
                           Exercises.Sets.Reps
                 AFD                                     Exercises.Sets.Weight
                                      Exercises._id

                               Exercises.Type                              Exercises.ExCalories
                                                            _id

           User.FullName

                                             Facility.Name
                                 User.LastName                            SessionType     DurationMins
                    User.FirstName
                                            Facility.City
             User.Age                                             Facility.Chain

                                    Facility.Country


            Fig. 3. Dependency graph for the global schema in Figure 1


    all, each formulated query is validated against the requirements of well-
    formedness proposed in the literature [13]. Then, the query is reformulated
    into multiple queries, one for each local schema in the collection, which are
    translated into the query language of the DS; the results presented to the
    user are obtained by merging the results of the single local queries.


4   Querying

In this section we describe the final querying stage of our approach. Given a
global schema g and the dependency graph M obtained by enriching g with
(A)FDs, a multidimensional query (from now on, md-query) is a triple q =
hG, p, m, ϕi where: G is the query group-by set, i.e., a non-empty set of primitive
fields in M; p is an optional selection predicate defined as a conjunction of
Boolean predicates on primitive fields; m is the query measure, i.e., the numeric
primitive field to be aggregated; ϕ is the operator to be used for aggregation
(e.g., avg, sum).
    For q to be well-formed, there must exist in M one single field f such that
all the other fields mentioned in q (either in G, p, or m) can be reached in M
from f . Field f is called the fact of q (denoted f act(q)) and corresponds to the
coarsest granularity of M on which q can be formulated.
    Other well-formedness constraints for md-queries are introduced in [13]: the
base integrity constraint, stating that the fields in G must be functionally in-
dependent on each other, and the summarization integrity constraint. The base
integrity constraint can be easily checked on the dependency graph (no arcs must
exist between the fields in G). As to the summarization integrity constraint, each
query undergoes a check that can possibly return some warnings to inform the
user of potentially incorrect results. Specifically, summarization integrity entails
disjointness (to avoid double counting, the measure instances to be aggregated
must be partitioned by the group-by instances) and completeness (the union of
these partitions must constitute the entire set). Disjointness is easily checked
on the dependency graph by verifying if the granularity of measure m is finer
than the one of all the fields in G. Completeness is obtained if all the fields in G
have full support, which is easily contradicted in heterogeneous collections. To
restore completeness, we adopt at query time the balancing strategies used for
incomplete hierarchies in data warehouse design; basically, the $ifNull operator
in MongoDB is used to replace a missing value in a field with a custom value.
Example 1. The following md-query on the WS collection, q1 , measures the av-
erage amount of weight lifted by elderly athletes per city and type of exercise:

q1 = h{Facility.City, Exercises.Type}, (User.Age ≥ 60), Exercises.Sets.Weight, avgi

We have f act(q1 ) = Exercises.Sets. id. Query q1 passes the validity check with
a warning, because the support of Facility.City is less than one, so balancing
is used to restore completeness. On the other hand, q1 meets the disjointness
constraint because the granularity Exercises.Sets. id of the required measure,
Exercises.Sets.Weight, is finer than both the granularities id and Exercises. id of
the group-by fields.                                                             
     Once a well-formed md-query q has been formulated by the user on the
dependency graph, it has to be reformulated on each local schema si to effectively
cope with inter-document variety. To this end we rely on the approach to md-
query reformulation proposed in [5] for federated data warehouse architectures.
This approach has been proved to be complete and to provide all certain answers
to the md-query. As a result, a set of local queries q (i) , one for each local schema
si , are determined.
     At this point, each local query q (i) is separately executed on the DS; specifi-
cally, q (i) must target only the documents that belong to local schema si . This
is done in two steps. First, the information about which document has which
schema (obtained in the schema extraction stage) is stored in a different collec-
tion (called WS-schemas in our example) in the following form: a document is
created for every schema si , containing an array ids with the id of every docu-
ment having schema si . Then, query q (i) is executed by joining it with the list of
identifiers in WS-schemas. Note that, to be executed, q (i) needs be translated to
the MongoDB query language, which allows us to declare a multi-stage pipeline
of transformations to be carried out on the documents of a collection.
     Finally, a post-processing activity is required to integrate and, possibly, fur-
ther aggregating the results coming from the different local queries. This oper-
ation can be performed in-memory, as OLAP queries usually produce a limited
number of records and the transcoding functions provide homogeneous values.
Example 2. Consider an md-query that calculates the total amount of burnt
calories by facility, excluding workout sessions that are shorter than 30 minutes:

      q2 = h{Facility.Name}, (DurationMins ≥ 30), Exercises.ExCalories, sumi

Consider the local schemas s1 and s2 in Figure 1. The reformulation of q2 onto
                         (1)
s1 has no effect (i.e., q2 ≡ q2 ); conversely, the reformulation onto s2 generates
the following local query:
        (2)                      DurationSecs
       q2 = h{Gym.Name}, (                    ≥ 30), Series.SeriesCalories, sumi
                                      60
The MongoDB query obtained from q1 of Example 1 is the following; note that
the missing values of Facility.City are replaced by those of Facility.Name:
db.WS.aggregate(
{ { $unwind: "$Exercises" },
  { $unwind: "$Exercises.Sets" },
  { $match: { "User.Age": { $gte: 60 } } },
  { $project:
     { "Facility.City": { $ifNull: ["$FacilityCity","$FacilityName"] } },
       "Exercises.Type": 1,
       "Exercises.Sets.Weight": 1,
       "balanced": { $cond: ["$FacilityCity",false,true]} } },
  { $group:
     { " id":
       { "FacilityCity","$FacilityCity",
          "ExercisesType","$Exercises.Type",
          "balanced","$balanced" },
       "Exercises.Sets.Weight":
       { $avg: "$Exercises.Sets.Weight" },
       "count": { $sum: 1 },
       "count-m": { $sum: { $cond: ["$Exercises.Sets.Weight",1,0] } } } } } )


5    Conclusions and Evaluation
In this paper we have presented an original approach to OLAP querying on
DSs. Our basic claim is that the heterogeneity and schema variety intrinsic
to DSs should be considered as a source of information wealth. At the core
of our approach are (i) the building of a global schema that maps onto the
different local schemas within a collection, (ii) the translation of this schema
into multidimensional form enhanced by the detection of approximate FDs, and
(iii) the reformulation of queries from the global schema onto the local schema
to improve the completeness of the result.
     As a proof of concept for our approach we have developed a Java prototype to
support the main phases and tested it on a cluster of seven CentOS 6 machines
with an 8-core i7-4790 CPU @3.60 GHz and 32 GB of RAM. Our reference real-
world collection, WS, is stored on Mongo DB 3.4 and randomly sharded on the
cluster; it contains 5 M workout sessions with 6 different local schemas (mostly
due to missing fields), 35 M exercises, and 85 M sets.
     Query formulation and translation to MongoDB are done in negligible time.
Reformulation is done with polynomial complexity [5]. Thus, performances mainly
depend on query execution. Consider for instance the following three md-queries:

qa = h{User.Age, Facility.Chain}, T RU E, DurationMins, avgi
qb = h{User.FullName}, (SessionType = “Advanced”), Exercises.ExCalories, sumi
qc = h{Facility.Name, Exercises.Type}, (StartedOn ≥ 01/01/2018), User.Age, maxi

Due to the reformulation on the local schemas, 6 local md-queries are created
from each of the three global ones. A simple optimization is done, when possible,
Fig. 4. Execution times of three queries (qa , qb , and qc ) split into the execution times
of the required local queries


to merge the local queries that involve the same fields on different local schemas;
for instance, with reference to Figure 1, a query counting the documents by
SessionType can be translated into a single local query, as every local schema
has the same representation of SessionType. Due to this optimization, qa , qb ,
and qc are reformulated into either 2 or 3 local queries. The execution times in
seconds for each query are shown in Figure 4; the times for qb and qc are higher
due to the necessity of unwinding arrays.

References
 1. Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema infer-
    ence for massive JSON datasets. In: Proc. EDBT. pp. 222–233 (2017)
 2. Chouder, M.L., Rizzi, S., Chalal, R.: EXODuS: Exploratory OLAP over document
    stores. Inf. Syst. 79, 44–57 (2019)
 3. Gallinucci, E., Golfarelli, M., Rizzi, S.: Approximate OLAP of document-oriented
    databases: A variety-aware approach. Inf. Syst. 85, 114–130 (2019)
 4. Golfarelli, M., Graziani, S., Rizzi, S.: Starry vault: Automating multidimensional
    modeling from data vaults. In: Proc. ADBIS. pp. 137–151 (2016)
 5. Golfarelli, M., Mandreoli, F., Penzo, W., Rizzi, S., Turricchia, E.: OLAP query
    reformulation in peer-to-peer data warehousing. Inf. Syst. 37(5), 393–411 (2012)
 6. Golshan, B., Halevy, A.Y., Mihaila, G.A., Tan, W.: Data integration: After the
    teenage years. In: Proc. PODS. pp. 101–106 (2017)
 7. Hamadou, H.B., Gallinucci, E., Golfarelli, M.: Answering GPSJ queries in a poly-
    store: A dataspace-based approach. In: Proc. ER. pp. 189–203 (2019)
 8. Hamadou, H.B., Ghozzi, F., Péninou, A., Teste, O.: Towards schema-independent
    querying on document data stores. In: Proc. DOLAP (2018)
 9. Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: CORDS: Automatic
    discovery of correlations and soft functional dependencies. In: Proc. SIGMOD. pp.
    647–658 (2004)
10. Klettke, M., Störl, U., Scherzinger, S., Regensburg, O.: Schema extraction and
    structural outlier detection for JSON-based NoSQL data stores. In: Proc. BTW.
    vol. 2105, pp. 425–444 (2015)
11. Liu, Z.H., Gawlick, D.: Management of flexible schema data in RDBMSs - oppor-
    tunities and limitations for NoSQL. In: Proc. CIDR. Asilomar, USA (2015)
12. Maßmann, S., Raunich, S., Aumüller, D., Arnold, P., Rahm, E.: Evolution of the
    COMA match system. In: Proc. OMISWC. Bonn, Germany (2011)
13. Romero, O., Abelló, A.: Multidimensional design by examples. In: Proc. DaWaK.
    pp. 85–94. Krakow, Poland (2006)