Enabling Self-Service BI on Document Stores

              Mohamed L. Chouder                                  Stefano Rizzi                      Rachid Chalal
                       LMCS, ESI                           DISI, University of Bologna                  LMCS, ESI
                      Algiers,Algeria                             Bologna, Italy                       Algiers,Algeria
                m_chouder@esi.dz                            stefano.rizzi@unibo.it                  r_chalal@esi.dz


ABSTRACT                                                                     paradigm). This schemaless nature provides a flexible data
The growing use of document stores has resulted in vast                      model that has attracted developers seeking to avoid the
amounts of semi-structured data holding precious informa-                    restrictions posed by the relational model.
tion, which could be profitably integrated into existing BI                     Document stores usually collect nested, denormalized, and
systems. Unfortunately, due to their schemaless nature, doc-                 hierarchical documents in the form of collections; the docu-
ument stores are hardly accessible via direct OLAP query-                    ments within the same collection may present a structural
ing. In this paper we propose an interactive, schema-on-read                 variety due to schema flexibility and data evolution. Doc-
approach for finding multidimensional structures in docu-                    uments are self-describing and mainly encoded using the
ment stores aimed at enabling OLAP querying in the context                   semi-structured data format JSON (JavaScript Object No-
of self-service BI. Our approach works in three steps: mul-                  tation), which is also popular as an exchange format and
tidimensional enrichment, querying, and OLAP enabling;                       widely adopted in modern web APIs.
the validation of user queries and the detection of multidi-                    The growing use of document stores and the dominance
mensional structures is based on the mining of approximate                   of JSON have resulted in vast amounts of semi-structured
functional dependencies from data. The efficiency of our                     data holding precious information, which could be profitably
approach is discussed with reference to real datasets.                       integrated into existing business intelligence (BI) systems
                                                                             [1]. Indeed, multidimensional modeling and analysis has
                                                                             been recognized to be an effective way for conducting an-
CCS Concepts                                                                 alytics over big NoSQL data [6]. Unfortunately, due to their
•Information systems Ñ Data warehouses; •Applied                             schemaless nature, document stores are hardly accessible via
computing Ñ Business intelligence;                                           direct OLAP querying. Recent efforts in this area propose
                                                                             SQL-like query interfaces (e.g., Apache Drill, Spark SQL,
Keywords                                                                     Presto), which provide access to schemaless and nested data
                                                                             while offering the possibility of using traditional BI tools.
NoSQL, Document-Oriented Databases, JSON, Multidimen-                        However, none of these solutions supports multidimensional
sional modeling                                                              querying or OLAP over document stores.
                                                                                In this paper we propose an interactive, schema-on-read
1.    INTRODUCTION                                                           approach for finding multidimensional structures in docu-
   Over the past decade, companies have been adopting NoSQL                  ment stores aimed at enabling OLAP querying in the con-
databases to deal with the huge volumes of data manipu-                      text of self-service BI. Differently from a schema-on-write
lated by modern applications. NoSQL systems have emerged                     approach, which would force a fixed structure in data and
as an alternative to relational database management sys-                     load them into a data warehouse, a schema-on-read ap-
tems in many implementations [4]. They can be classi-                        proach leaves data unchanged in their structure until they
fied based on their data model, the most popular categories                  are accessed by the user [16]. We claim that, when querying
being key-value, wide-column, graph-based and document-                      document stores for self-service BI, a schema-on-read ap-
oriented; document-oriented databases (briefly, document                     proach should be preferred to a schema-on-write approach.
stores) are probably the most widely adopted so far. The                     The main reason for this is that document stores handle
common features of document stores are horizontal scala-                     high volumes of data with varied structure, which question
bility on commodity hardware and the lack of an explicit                     the ability of the traditional ETL in processing data on the
schema (according to the data first, schema later or never                   one hand, and the possibility of storing them into a data
                                                                             warehouse with a fixed schema on the other hand.
                                                                                To discover the information necessary for discovering mul-
                                                                             tidimensional concepts despite the lack of structure, we re-
                                                                             sort to data distributions, i.e., we mine approximate func-
                                                                             tional dependencies (AFDs) and use them to automate the
                                                                             design process. More specifically, our approach takes as in-
 c 2017, Copyright is with the authors. Published in the Workshop Proceed-
ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,         put a collection of nested documents and operates in three
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is        phases.
permitted under the terms of the Creative Commons license CC-by-nc-nd
4.0
     1. Multidimensional Enrichment. In this phase, a                 In these approaches, multidimensional modeling is done at
        schema is extracted out of the collection and enriched     design time, mainly based on FDs expressed in the source
        with some basic multidimensional knowledge. This is        schemata as foreign keys or many-to-one relationships. Con-
        done by tentatively classifying attributes into dimen-     versely, our approach is meant to be used at query time
        sions and measures, and by relating each measure to        and automates the discovery of multidimensional concepts
        the subset of dimensions that determine its granular-      by mining FDs from data, as required by the schemaless
        ity, so as to create a draft multidimensional schema.      context.
                                                                      Multidimensional design from non-relational data.
     2. Querying. Now the user can formulate a multidi-            Our approach closely relates to previous approaches for mul-
        mensional query by picking dimensions and measures         tidimensional modeling from semi-structured data [9, 13, 28]
        of interest. This query is checked by accessing data to    in XML format, which is similar to JSON. These approaches
        ensure its multidimensional validity and correct sum-      take in input DTDs or XML schemata that provide rich in-
        marization. If the query is found to be well-formed, it    formation about XML documents (e.g., multiplicities, data
        is executed. Otherwise, the multidimensional schema        types), so they cannot operate directly on XML data not
        is refined and some possible querying alternatives are     having a schema specification. In particular, the work in [28]
        proposed to the user.                                      builds multidimensional schemata starting from an XML
     3. OLAP Enabling. This iterative phase enables the            schema but, in some cases, data is accessed to discover FDs
        user to further explore data by running an OLAP ses-       that are not expressed in the schema. Similarly, starry vault
        sion. To this end, local portions of multidimensional      [7] is a recent approach that mines FDs for multidimensional
        hierarchies (consisting of roll-up and drill-down rela-    modeling from data vaults. These are databases character-
        tionships for each level involved in the user query) are   ized by a specific data model tailored to provide historical
        built by mining AFDs from data. The user can now           storage in presence of schema evolutions. The main idea is
        apply an OLAP operator (e.g., roll-up or drill-down)       to mine approximate and temporal FDs to cope with the
        to iteratively create a new query on the collection.       issue of noisy and time-varying data, which may result in
                                                                   hidden FDs.
The advantages of the proposed approach are twofold. First,           All the above-mentioned approaches are similar in that
it enables decision makers to formulate multidimensional           they define multidimensional schemata at design time using
queries and create reports on-the-fly on document stores in        structural and additional information extracted from data
a self-service fashion, i.e., without any support from ICT         (i.e., FDs). In contrast, our approach operates at query
people. This reduces the time and effort spent in traditional      time and mostly relies on data distributions, while structural
BI settings. Second, our approach also works when nested           information is only used —when available— to intelligently
structures are not present within the input data (e.g., for        reduce the search space.
flat documents), which is relevant because a large number             OLAP on linked data. Recent works in this space pro-
of non-nested datasets are available in JSON format on the         pose to directly perform OLAP-like analysis over semantic
web (e.g., more than 11000 collections in www.data.gov).           web data. Exploratory OLAP has been defined as the pro-
The approach has been implemented on top of MongoDB,               cess that discovers, collects, integrates, and analyzes these
one of the most popular document stores.                           external data on the fly [2]. Some works in this direction ad-
   The rest of this paper is organized as follows. Section 2       dress the problem of building multidimensional hierarchies
discusses the related work and Section 3 presents document         from linked data [21, 27], which is closely related to the
stores along with the working example. Our approach is             third phase of our approach. Specifically, iMOLD [21] is
detailed in Section 4 and experimentally evaluated in Sec-         an instance-based approach that operates at query time for
tion 5. Section 6 concludes the paper and gives some future        exploratory OLAP on linked data; it aims at finding mul-
directions for research.                                           tidimensional patterns representing roll-up relationships, in
                                                                   order to enable users to extend corporate data cubes with
2.     RELATED WORK                                                new hierarchies extracted from linked data. Similarly, [27]
                                                                   proposes an algorithm for discovering hierarchies from di-
   Supply-driven multidimensional design. The au-
                                                                   mensions present in statistical linked data represented us-
tomation of multidimensional modeling has been widely ex-
                                                                   ing the RDF Data Cube (QB) vocabulary (www.w3.org/TR/
plored in the area of data warehouse design. Supply-driven
                                                                   vocab-data-cube/). Like starry vault, this algorithm mines
approaches automate the discovery of multidimensional con-
                                                                   for AFDs (here called quasi-FDs) to cope with the presence
cepts by a thorough analysis of the source data. The first
                                                                   of imperfect and partial data.
approaches proposed algorithms to build multidimensional
                                                                      The algorithm that we propose for building hierarchies dif-
schemata starting from Entity/Relationship diagrams or re-
                                                                   fers from these works in that, to ensure good performances,
lational schemata [8, 19, 12], while further approaches con-
                                                                   it looks for local portions of hierarchies for the levels involved
sidered more expressive conceptual sources such as UML
                                                                   in the user queries, thus acting in an on-demand fashion.
class diagrams [20] and ontologies [23]. Furthermore, [24]
                                                                      SQL on schemaless data. Several solutions have emerged
discussed the use of FDs for automating multidimensional
                                                                   in the industry to enable SQL querying of schemaless data
modeling from ontologies. In particular, similarly to the
                                                                   for analytic purposes. These solutions can be classified into
querying phase of our approach, [22] proposes an algorithm
                                                                   two categories: (1) relational systems enabling the storage
to check the multidimensional validity of an SQL cube query
                                                                   and management of schemaless data, and (2) systems de-
and to derive the underlying multidimensional schema. This
                                                                   signed as an SQL interface for schemaless data.
algorithm identifies multidimensional concepts from the struc-
                                                                      Solutions in the first category include RDBMSs that sup-
ture of the query itself and by following foreign keys in the
                                                                   port storage and querying of schemaless data to be used
relational schema.
along with relational data in one system [5, 17] (e.g., Or-       games :                                      games :
                                                                  {“ id” : “52f2a70eddbd75540aba7c06”,         {“ id” : “52f2a70eddbd75540aba7c06”,
acle, IBM DB2, and PostgreSQL). Current systems do not             “date” : ISODate(“1989-03-27”),              “date” : ISODate(“1989-03-27”),
impose a fixed relational schema to store and query data,          “teams” :                                    “hometeam” :
                                                                    [ {“name” : “Detroit Pistons”,               {“name” : “Detroit Pistons”,
but derive and maintain a dynamic schema to be used for                “abbreviation” : “DET”,                    “abbreviation” : “DET”,
schema-on-read SQL/JSON querying instead [17]. Other so-               “score” : 90,                              “score” : 90,
                                                                       “home” : true,                             “home” : true,
lutions that fall into this category are virtual adapters that         “won” : 1,                                 “won” : 1,
expose a relational view of the data in a document store, to           “results” : { “points” : 90, . . . },      “results” : { “points” : 90, . . . },
                                                                       “players” :                                “players” :
be used in common BI tools; an example is the MongoDB                   [ {“player” : “Isiah Thomas”,              [ {“player” : “Isiah Thomas”,
BI connector (www.mongodb.com/products/bi-connector).                      “points” : 30, . . . },                    “points” : 30, . . . },
                                                                          ...                                        ...
   The second line of solutions are SQL-like interfaces that            ]                                          ]
offer extensions to query schemaless data persisted in doc-           },                                         },
                                                                      {“name”: “Dallas Mavericks”,              “guestteam” :
ument stores or as JSON files (e.g., in Hadoop). Spark                 “abbreviation” : “DAL”,                   {“name”: “Dallas Mavericks”,
SQL [3] has been designed for relational data processing               “score” : 77,                              “abbreviation” : “DAL”,
                                                                       “home” : false,                            “score” : 77,
of native Spark distributed datasets in addition of diverse            “won” : 0,                                 “home” : false,
data sources and formats (including JSON). In order to run             “results” : { “points” :77, . . . },       “won” : 0,
                                                                       “players” :                                “results” : { “points” :77, . . . },
SQL queries on schemaless data in Spark SQL, a schema                   [ {“player” : “Rolando Blackman”,         “players” :
is automatically inferred beforehand. Apache Drill (drill.                 “points” : 23, . . . },                 [ {“player” : “Rolando Blackman”,
                                                                          ...                                         “points” : 23, . . . },
apache.org) is a distributed SQL engine that can join data              ]                                            ...
from multiple data sources, including Hadoop and NoSQL                }                                            ]
                                                                    ]                                            }
databases. It has a built-in support for JSON with colum-         }                                            }
nar in-memory execution. Drill is also able to dynamically
discover a schema during query processing, which allows to                              (a)                                       (b)
handle schemaless data without defining a schema upfront.
Finally, Presto (prestodb.io) is a distributed SQL engine,              Figure 1: Sample JSON document of a game
developed at Facebook for interactive queries, that can also
combine data from multiple kinds of data sources includ-
ing relational and non-relational databases. When querying
                                                                     Example 1. In the scope of this paper, we focus on col-
schemaless data, Presto tries to discover data types, but it
                                                                  lections of JSON documents that logically represent the same
may need a schema to be defined manually in order to run
                                                                  business entities, while expecting that their structure may
queries. All above-mentioned engines differ in their archi-
                                                                  vary. In particular, we will use as a working example a col-
tecture, support to ANSI SQL, and extensions to SQL to
                                                                  lection that models NBA games (adapted from [26]), where a
handle schemaless and nested data. These engines can con-
                                                                  document represents one game between two teams along with
nect to MongoDB, which gave us various architectural op-
                                                                  players and team results. In the sample document shown in
tions for implementation. However, none of them supports
                                                                  Figure 1a the main object is games, which holds the array
multidimensional or OLAP querying over document stores.
                                                                  of objects teams (to-many), which in turn holds the object
                                                                  results (to-one) and the array of objects players (to-many).
3.   DOCUMENT STORES                                              Figure 1b shows an alternative JSON representation of the
   A document store holds a set of databases, each organizing     same domain; the relationship between game and teams is
the storage of documents in the form of collections in a way      modeled using two separate sub-documents (home and guest
similar to rows and tables in relational databases. A docu-       team) instead of an array. Note that a simple query that
ment, also called object, consists of a set of key-value pairs    computes the average score by team would be much easier to
(or fields) mainly encoded in JSON format (www.json.org).         implement using the first representation and it would also
JSON has many variants that are used for storage and op-          perform better, since in the second representation it would
timization purposes such as BSON (www.bsonspec.org) in            require two separate queries.
MongoDB and, very recently, OSON in the Oracle DBMS
[17]. Keys are always strings, while values have the follow-
ing types: primitive (number, string, Boolean), object (or        4.      APPROACH
sub-document), array of atomic values or objects.                    As already discussed, our approach for enabling self-service
   From a conceptual point of view, a typical collection con-     BI on document stores includes three phases: multidimen-
sists of a set of business entities connected through two kinds   sional enrichment, querying, and OLAP enabling; these phases
of relationships: nesting and references. Nesting consists of     are discussed in the detail in the following subsections.
embedding objects arbitrarily within other objects, which
corresponds to two types of relationships: to-one in the case     4.1        Multidimensional Enrichment
of a nested object and to-many in the case of an array of            The aim of this phase is to define a draft multidimensional
nested objects. On the other hand, references are similar         schema on which the user will be able to formulate queries.
to foreign keys in relational databases and can be expressed      To this end, we extract the schema of the input collection
manually or using a specific mechanism such as $ref in Mon-       and enrich it with basic multidimensional knowledge. Sev-
goDB. However, when references are used, getting data re-         eral works have addressed schema extraction and discovery
quires joins which are not supported in document stores; so,      from document stores [25, 29, 14]. These works focus on the
nesting is most often used instead. Since a document store        structural variety of documents within the same collection
has a flexible schema there are a variety of ways for data        caused by their schemaless nature and by the evolution of
representation; nevertheless, the way data is modeled may         data; for instance, [14] proposes to extract a JSON schema
affect performance and data retrieval patterns.                   (www.json-schema.org) similar to a JSON document and
infer attribute and type occurrence frequencies to empha-               After the c-schema S of a collection has been defined, a
size the structural variety. In [29], all the schema versions        corresponding multidimensional schema has to be derived
in a collection are extracted and stored in a repository to be       from it. Since measures at different granularities can pos-
used within a schema management framework. In order to               sibly be included in the documents (e.g., in our example,
have a single view of the distinct schemata in a collection,         points at the player’s level and score at the team level), the
the authors propose a relaxed form of schema called skele-           definition we provide relates each single measure to a specific
ton that better captures the core and prominent fields and           set of dimension.
filters out unfrequent ones. This would be useful for us since
we rely on a single view of the collection; however, the com-          Definition 2 (Md-Schema). A           multidimensional
putational cost for building skeletons is high, making them          schema (briefly, md-schema) is a triple M “ pH, M, gq
unsuitable in real-time contexts.                                    where:
   Since schema discovery is not the focus of this paper, we
                                                                        ‚ H is a finite set of hierarchies; each hierarchy hi P H
adopt a simple algorithm that builds a tree-like schema in-
                                                                          is associated to a set Li of categorical levels and a roll-
cluding all the fields that appear in the documents [3]. This                                                               L
                                                                          up partial
                                                                                Ť order ľi of Li . Each element G P 2 , where
algorithm works in a single-pass over data to infer a tree
                                                                          L “ i Li and at most one level in G is taken from
structure of fields and their corresponding types (fields with
                                                                          each hierarchy hi , is called a group-by set of H.
varying types are generalized to String).
   The tree-like collection schema that will be the input for           ‚ M is a finite set of numerical measures.
the subsequent steps of our approach is defined as follows.
                                                                        ‚ g is a function relating each measure in M to its di-
   Definition 1 (Collection Schema). Given a collec-                      mensions, i.e., to the set of levels that determine its
tion D, its schema (briefly, c-schema) is a tree S “ pK, Eq               granularity: g : M Ñ G where G is the set of all group-
where:                                                                    by sets of H.

   ‚ K is the union of the sets of keys appearing in the                At first, a draft md-schema Mdraf t is built from S by
     documents of D;                                                 tentatively labelling all attributes of types date, string, and
                                                                     Boolean as dimensions (i.e., as levels of different hierarchies),
   ‚ r P K is the distinguished root node and is labelled with       and all attributes of types integer and decimal as measures.
     the collection name;                                            Date dimensions are then decomposed into different categor-
                                                                     ical levels giving rise to standard temporal hierarchies (e.g.,
   ‚ E is a set of arcs that includes (i) an arc r Ñ k for           date ľ month ľ year). Note that no further attempt to dis-
     each key/value pair in the root r, and (ii) an arc k1 Ñ         cover hierarchies is made at this stage, since discovering the
     k2 iff k2 appears as a key in an array of nested objects        FDs involving all attributes would be computationally too
     having key k1 .                                                 expensive. As a consequence, in Mdraf t two levels l and l1
                                                                     might be erroneously modeled as two separate dimensions,
   Figure 2 shows the c-schema to which the document of
                                                                     while they should be actually part of the same hierarchy
Figure 1a conforms. A path from the root key to a leaf
                                                                     because l ľ l1 .
key is called an attribute. In denoting attributes we use
                                                                        The user can contribute to this step by manually changing
the dot notation omitting the root key (since a single col-
                                                                     the label of some attributes, since in some cases a numeric
lection is involved) but including the keys corresponding to
                                                                     attribute can be used as a dimension, and a non-numeric
nested objects; so, for instance, the c-schema in Figure 2
                                                                     attribute can be used as a measure. The first situation has
contains attributes id, date, teams.name, teams.results.points,
                                                                     no impact on the following steps, since a group-by set can
teams.players.player, teams.players.points, etc. The depth of
                                                                     also include numeric attributes. Conversely, the second sit-
an attribute a is the number of arcs included in its path,
                                                                     uation requires that the values of the non-numeric attribute
and is denoted by depthpaq (e.g., depthpidq “ 1); the set of
                                                                     are transformed into numerical values to be supported by
attributes at depth δ is denoted by Attrspδq.
                                                                     aggregation functions.
                                                                        To complete the definition of Mdraf t , the granularity map-
                      games
                                                                     ping g must be built. We recall from [22] that an md-schema
                                                                     should comply with the multidimensional space arrangement
                                                                     constraint, stating that each instance of a measure is related
                                                                     to one instance of each dimension. In the c-schema S, at-
                                                                     tributes in Attrspδq are related by to-many multiplicity to
            id        date    teams
                                                                     those in Attrspδ ` 1q, so a measure cannot be related to
                                                                     levels having a higher depth. Therefore, gpmq is set by con-
                                                                     necting each measure m to the group-by set G that contains
                                                                     all levels l P L such that depthpmq ě depthplq.
name   abbreviation   home    score   results.points players   ...
                                                                        Example 2. In this example and in the following ones
                                                                     we will refer to the representation in Figure 1a. Attribute
                                                                     teams.score is numerical, so it is assumed to be a mea-
                                         player      points    ...   sure. It is depthpteams.scoreq “ 2, therefore teams.score
                                                                     is tentatively associated in Mdraf t with group-by set G “
                                                                     tteams.name, teams.abbreviation, teams.home, id, dateu. All
            Figure 2: NBA games c-schema                             the measures that have the same depth of teams.score
(e.g., teams.won and teams.results.points) are also asso-        Algorithm 1 Querying
ciated with G.      Similarly, measure teams.players.points      Require: A c-schema S, an md-schema M, an md-query q “
has depth 3, so it is associated with group-by set                   pGq , Mq , Σq on M
                                                                 Ensure: A (refined) md-schema M, a (valid) md-query q on M
G1 “ tteams.players.player, teams.name, teams.abbreviation,       1: Mq Ð CheckM easurespMq , Gq , Mq
teams.home, id, dateu.                                            2: if Mq “ H then
                                                                  3:    Mrec “ RecommendM easurespGq , Mq
4.2    Querying                                                   4:    if Mrec ‰ H then
                                                                  5:        update Mq based on the user’s choice
   This phase is aimed at supporting a non-expert user in for-    6: if Mq ‰ H then
mulating a well-formed multidimensional query. In a schema-       7:    if |Gq | ą 1 then
on-write approach, queries are formulated on a complete md-       8:        Gq “ CheckGroupBypMq , Gq , M, Sq
schema whose correctness is ensured by the designer. Con-         9:        if |Gq | “ 1 then
                                                                 10:             Lrec “ RecommendLevelspGq , Mq
versely, in a schema-on-read approach —our case—, the md-        11:             update Gq and M based on the user’s choice
schema is defined at read-time by the query itself. Though       12:     Σ Ð CheckSummarizationpq, Mq
this approach gives more flexibility because each user can       13:     Executepqq
“force” her own multidimensional view onto the data, it re-      14: else
                                                                 15:     ask the user to change the md-query group-by set Gq
quires a further check to ensure that the FDs implied by the
query are not contradicted by data; so, querying becomes
an iterative process where the underlying md-schema is pro-
                                                                 to be used in the next phase. Algorithm 1 works as fol-
gressively refined together with the query.
                                                                 lows. Firstly, procedure CheckMeasures drops from Mq the
   Definition 3 (Md-Query). A multidimensional query             measures, if any, that are not compatible with Gq based on
(briefly, md-query) q on md-schema M “ pH, M, gq is a            their granularity (Line 1). If no measure is left, some alter-
triple q “ pGq , Mq , Σq where                                   native (compatible) measures are suggested by Recommend-
                                                                 Measures (Lines 2–5). If some measures Mrec are found,
   ‚ Gq P G is the md-query group-by set;                        the user interacts to approve their inclusion in the md-query
                                                                 (Line 5). Then, if the md-query group-by set includes either
   ‚ Mq Ď M is the set of required measures;                     2 or 3 levels (Lines 7–11), procedure CheckGroupBy is called
   ‚ Σ is a function that associates each measure m P Mq         to drop from Gq the levels, if any, that are not compliant
     with an aggregation operator.                               with the base integrity constraint and update M accord-
                                                                 ingly (Line 8). If just one level is left in Gq , procedure
   Starting from the draft md-schema Mdraf t , the user for-     RecommendLevels is called to look for additional group-by
mulates an md-query q by choosing one to three dimensions        levels (Line 10) and possibly include them in the md-query
(Gq ), one or more measures of interest (Mq ), and an ag-        (Line 11). Finally, procedure CheckSummarization checks
gregation operator for each measure (Σ). However, to be          the aggregation operators (Line 12). We allow the resulting
considered well-formed, q (and the md-schema q is formu-         md-query to be executed with a single group-by level, but
lated on) should comply with the following constraints [22]:     not without measures (Lines 6 and 13); in the latter case,
                                                                 the user is asked to choose a new group-by set (Line 15).
  71 The base integrity constraint, stating that the levels
     in the group-by set are orthogonal, i.e., functionally      4.2.1      Checking Measures
     independent on each other.                                     The goal of this procedure is to ensure that all the mea-
                                                                 sures in Mq can be correctly aggregated at group-by set Gq ;
  72 The summarization integrity constraint, which requires      specifically, our goal is to avoid that for some m P Mq there
     disjointness (the measure instances to be aggregated        is a to-many relationship between m and a level l P Gq ,
     are partitioned by the group-by instances), complete-       because this would violate disjointness and lead to double
     ness (the union of these partitions constitutes the en-     counting when executing q. Measure check is based on the
     tire set), and compatibility (the aggregation operator      g component of M, which expresses the granularity of each
     chosen for each measure is compatible with the type         measure.
     of that measure) [15].                                         To explain how this is done, we observe that the roll-up
How to carry out these validity checks is discussed in the       partial orders on the hierarchies H of M induce a partial
following subsections.                                           order ľH on the set G of the group-by sets of H, defined as
                                                                 the product order of the single roll-up orders.1 Measure m is
   Example 3. A possible md-query on the draft md-schema         compatible with Gq iff gpmq ľH Gq ; indeed, if this condition
described in Example 2 is the one characterized by Gq “          is satisfied, the granularity expressed by the group-by set is
tteams.players.player, dateu, Mq “ tteams.scoreu, and Σ “        coarser than the one at which m is defined, meaning that
Sum. Unfortunately, aggregating on teams.score would re-         m can be safely aggregated at Gq . The measures that are
sult in double counting since each instance of teams.score is    found not to be compatible with Gq are removed from Mq .
related to multiple instances of teams.players.player (a team
has several players). So, this query violates disjointness and      Example 4. Considering again the query in Example 3,
is not well-formed.                                              it is gpteams.scoreq ńH Gq (because teams.players.player is
                                                                 not in gpteams.scoreq). Then, q is not well-formed and
  The pseudo-code of the querying phase is sketched in Al-       teams.score is dropped from Mq .
gorithm 1. It starts from the md-query q and takes as input      1
                                                                  The product order of n total orders is a partial order on
both the c-schema and the md-schema. The output of the           the Cartesian product of the n totally ordered sets, such
algorithm is a valid md-query based on a refined md-schema       that px1 , . . . , xn q ľ py1 , . . . , yn q iff xi ľ yi for i “ 1, . . . , n.
   This check is performed at the schema level, so some ex-                     (c) Finally, when l1 Ñ l and l2 Ñ l,
ceptions may arise when the relationship between a measure
                                                                                    i. if the relationship between l1 and l2 is many-to-
m and Gq is hidden in data. This may happen when arrays
                                                                                       many, they are both retained in Gq ; conceptu-
are not used to model to-many relationships, or when m de-
                                                                                       ally, l is a convergence [8] and the md-schema
pends either on a level not present in Gq or on a subset of
                                                                                       must be updated by adding l1 ľ l and l2 ľ l.
the levels in Gq [18]. In these cases, m may be non-additive
or even non-aggregable on Gq , so the user’s knowledge of                          ii. if the relationship between l1 and l2 is many-to-
the application domain is required to manually fix summa-                              one, then the three levels in Gq belong to the
rization (see Section 4.2.5).                                                          same hierarchy (l1 ľ l2 ľ l) and only the level
                                                                                       with highest cardinality, l1 , is kept in Gq .
4.2.2     Checking Group-by Set
   This process is aimed at ensuring completeness and base
                                                                        4.2.3     Recommending Measures
integrity.                                                                 When there is no compatible measure left in Mq , this pro-
   The completeness condition is violated when, for some                cedure looks for alternative measures. Specifically, it returns
level l P Gq , there is no instance corresponding to one or             all the measures in the md-schema that are compatible with
more instances of a measure in Mq ; from a conceptual point             Gq , i.e., all m P M such that gpmq ľH Gq .
of view, l is classified as optional [8]. In a schema-on-write
approach this problem is fixed by aggregating all “dangling”               Example 5. With reference to Example 4, procedure
measures into an ad-hoc group of l. Similarly, in our schema-           RecommendM easures returns teams.players.points, since it
on-read approach, these instances are grouped into a null               is gpteams.players.pointsq ľH Gq .
instance of l thus restoring the completeness condition; then
for instance, the $ifNull operator of MongoDB could be used             4.2.4     Recommending Levels
to replace null values with an ad-hoc one when executing q.               Recommending additional group-by levels is done when
   Base integrity requires that the levels in Gq are mutually           originally the user had selected two or three group-by levels,
orthogonal. So, from this point of view, md-query q is valid            but only one of them was left in Gq after checking the group-
only if, for each pair of levels l, l1 P Gq , there is no FD            by set. For a given candidate level l, it requires to check
between l and l1 , i.e., there is a many-to-many relationship           that the base integrity constraint is met between l and some
between them. An FD is a to-one relationship, which we will             other level in M, and that the measure check is not violated.
denote with l Ñ l1 to emphasize that values of l functionally           The first level found is proposed to the user, who has the
determine the values of l1 . Checking base integrity requires           choice to use it or to proceed looking for other levels (up to
to access data in order to retrieve AFDs between the levels,            a maximum of three levels in the group-by set).
in Gq , that are not related by a to-many multiplicity in the
c-schema (see Section 4.3 for more details about AFDs).                    Example 6. Let Gq         “    tid, dateu and Mq      “
   We recall that 1 ď |Gq | ď 3:                                        tteams.players.pointsu.    The relationship between the
                                                                        levels in Gq is many-to-one, so date is removed. In this
  1. If |Gq | “ 1, orthogonality is obvious.                            case, to recommend levels we may look for a many-to-many
  2. If |Gq | “ 2, the relationship between the two levels              relationship between id and teams.name, teams.abbreviation,
     in Gq is checked. If it turns out to be many-to-many               teams.home, and teams.players.player. Since teams.name
     (e.g., if Gq “ tdate, teams.nameu), the base integrity             does not violate the measure check and has a many-to-many
     constraint is met and both levels remain in Gq . If it             relationship with id, it is recommended to the user as a
     turns out to be many-to-one (e.g., if Gq “ tid, dateu),            possible group-by level.
     then there is a roll-up relationship between the two
     levels in Gq . Only the level at the “many” side (id) re-          4.2.5     Checking Summarization
     mains in Gq and the md-schema is modified by placing                  Compatibility states that measures cannot be aggregated
     these levels in the same hierarchy (id ľ date). A spe-             along hierarchies using any aggregation operators. In prin-
     cial case is when the relationship is one-to-one (e.g.,            ciple, checking summarization would require knowing the
     if Gq “ tteams.name, teams.abbreviationu); here, one of            measure category (flow, stock, or value-per-unit), the type
     the levels is kept in Gq while the other is considered             of group-by levels (temporal or non-temporal), and the ag-
     as a descriptive attribute.                                        gregation operator (Sum, Avg, Min, Max, Count, etc.) [15].
                                                                        Knowing the measure category and type of group-by levels
  3. In case |Gq | “ 3, there are three possibilities:                  could be used to recommend some aggregation operators,
                                                                        but still the user would have to choose among a number
        (a) If many-to-many relationships are found between             of potential options; besides, correctly classifying a mea-
            each pair of levels, then the constraint is met             sure into flow, stock, or value-per-unit may be hard for non-
            and all levels remain in Gq (e.g., if Gq “                  skilled users. Therefore, we prefer to enable users to directly
            tdate, teams.name, teams.players.playeru).                  pick an aggregation operator for each measure in Mq .
        (b) Let many-to-many relationships be found between                Summarization is violated when a measure m P Mq has
            two pairs of levels pl, l1 q, pl, l2 q, and a many-to-one   finer granularity than another measure m1 P Mq , since m
            relationship be found between the remaining pair            would be double-counted when executing q. In this case,
            pl1 , l2 q (e.g., if Gq “ tteams.name, id, dateu, since     the user is warned that she will get erroneous results, and
            id Ñ date). In this case, l1 and l2 are placed in the       she may choose to change the aggregation operator for m
            same hierarchy in the md-schema (l1 ľ l2 ), and             (e.g., using Min, Max or Avg instead of Sum will give the
            l2 is removed from Gq .                                     correct result) or to drop m from Mq .
   Example 7. Let Mq “ tteams.score, teams.players.ptsu.             4.3.1    Roll-up Discovery
These measures have different granularities, so if the user             Let l P Gq ; discovering possible roll-ups for l requires to
has chosen to aggregate teams.score (which has finer gran-           mine the AFDs of type l ; l1 , with l1 P LzGq and depthpl1 q ď
ularity) using Sum she will get double counting. Then, she           depthplq. Let R be the set of all right-hand sides for these
can either change the aggregation operator for teams.score           AFDs. To avoid useless checks, the AFDs whose right-hand
to Avg or drop teams.score from Mq .                                 side l1 has higher cardinality than l are not checked, since
                                                                     they clearly cannot hold. One exception is when |l| “ |l1 | or
4.3    OLAP Enabling                                                 |l| Á |l1 | (due to approximation): in this case both l1 ; l
   The goal of this phase is to refine the md-schema by dis-         and l ; l1 must be checked, since the relationship between
covering some hierarchies, so that the user is enabled to in-        l and l1 may be one-to-one (so l1 is a descriptive attribute
teract with data in an OLAP fashion. Completely building             for l). These bidirectional checks are made at the end of the
all the hierarchies would require to mine all FDs between            process to prioritize the discovery of “true” hierarchies. The
levels, which would be computationally too expensive. For            AFDs in R are checked until one is found to hold, say l ; l1 .
this reason, we only build local portions of hierarchies for         Then, the md-schema M is updated by placing these levels
the levels in the group-by set of the previously-formulated          in the same hierarchy (l ľ l1 ) and the user is notified of the
md-query q. Besides, again for complexity reasons, we only           ability of performing a roll-up from l.
mine simple FDs (i.e., those relating single attributes rather
than attribute sets), which are mostly common in multidi-               Example 8. Let Gq “ tid, teams.nameu and Mq “
mensional schemata.                                                  tteams.score, teams.wonu. The search space for level id
   Specifically, the idea is to mine, for each level l P Gq for      only includes date, since it is the only level for which
which no roll-up relationships have been discovered yet, the         depthpdateq ď depthpidq. Since |id| ą |date|, id ; date
FDs of either type l Ñ l1 (to enable a roll-up of l) or l1 Ñ l (to   is checked. This is found to be true, so the md-schema is
enable a drill-down of l). Then, if the user applies a roll-up       updated with id ľ date. The search space for teams.name
or drill-down, a new md-query q 1 is formed and the process          consists of teams.abbreviation and teams.home; by query-
is iterated to further extend the hierarchies. Remarkably,           ing data we find that |teams.name| “ |teams.abbreviation| ą
using FDs to discover hierarchies guarantees that q 1 satisfies      |teams.home|.     Firstly, teams.name ; teams.home is
the conditions discussed earlier, specifically disjointness and      checked, and found not to hold. Then, teams.name ;
completeness.                                                        teams.abbreviation and teams.abbreviation ; teams.name are
   FDs are not explicitly modeled in a document store, so            checked; both are found to hold, giving rise to a descriptive
we must resort to data. Since document stores host large             attribute.
amounts of data, we can reasonably assume that the mined
FDs are representative enough of the application domain.             4.3.2    Drill-Down Discovery
However, schemaless data commonly present some errors
                                                                        Here the set of candidate levels for checking AFDs of type
and missing values which may hide some FDs. The tool we
                                                                     l1 ; l includes the levels in LzGq whose depth is greater
adopt to cope with this issue are approximate FDs (AFDs)
                                                                     or equal than l, i.e., such that depthpl1 q ě depthplq. As in
[10, 11], which “almost hold” on data, instead of traditional
                                                                     roll-up discovery, an AFD whose left-hand side l1 has lower
ones. Unfortunately, this entails an approximate satisfac-
                                                                     cardinality than l is not checked, since it cannot hold. If
tion of the disjointness and completeness conditions, possi-
                                                                     l1 ; l is found to hold, l1 ľ l is added to M and the user is
bly leading to non-disjoint and incomplete hierarchies.
                                                                     notified of the ability of drilling down from l.
   Definition 4 (Approximate FD). Given two levels l                    Note that drilling down from l to l1 could produce a new
and l1 , let strengthpl, l1 q denote the ratio between the number    md-query q 1 whose group-by set, Gq1 “ Gq ztlu Y tl1 u, is not
of unique values of l and the number of unique values of ll1 .       compatible with some of the measures in Mq1 “ Mq . Since
We will say that AFD l ; l1 holds if strengthpl, l1 q ě ,           checking compatibility is computationally cheaper than check-
where  is a user-defined threshold [11].                            ing AFDs (as explained in Section 4.2.1, it can be done
                                                                     based on the partial order of group-by sets, without access-
   At a given time during a user’s session, let M be the cur-        ing data), the AFD for a candidate level l1 is checked only if
rent version of the md-schema and q be the last md-query             Gq1 is found to be compatible with at least one of the mea-
formulated. Then, the search space for mining AFDs in-               sures in Mq1 . Of course, if l1 ; l is found to hold and the
cludes the levels that are not in Gq and are not involved            user decides to drill-down to l1 , the incompatible measures
in roll-up relationships in M. To reduce the search com-             in Mq1 must be dropped.
plexity we take into account the structural clues provided
by the c-schema. Specifically, let l and l1 be two levels, such         Example 9. Consider again Example 8.                  The
that depthplq ă depthpl1 q; as already stated, the attributes        search space for id includes date, teams.home, and
in Attrspδq are related by to-many multiplicity to those in          teams.players.player (we recall that teams.abbreviation has
Attrspδ ` 1q, so l ­; l1 . For this reason, checking if l and        already been added as a descriptive attribute). Since id
l1 should be placed in the same hierarchy only requires to           has the highest cardinality, there are no AFDs to check.
check if l1 ; l. In addition, we avoid checking trivial and          On the other hand, the search space for teams.name
transitive AFDs since we explore one hierarchy at a time.            consists of teams.home and teams.players.player, where
Finally, the result of all AFD checks performed is stored            |teams.home| ă |teams.name| ă |teams.players.player|. AFD
in a meta-data repository, to be used to avoid checking the          teams.home ; teams.name is not checked because of its
same AFDs twice.                                                     cardinality; the same for teams.players.player ; teams.name,
   In the following subsections, roll-up and drill-down dis-         since Gq1 “ tid, teams.players.playeru is not compatible with
covery are detailed.                                                 the measures in Mq1 .
  The md-schema resulting after OLAP enabling in Exam-              In this section, we consider five md-queries on the games
ples 8 and 9 is shown in DFM notation [8] in Figure 3. Note         dataset, q1 to q5 , and four on the DBLP dataset, q6 to q9
that, at the end of the user’s session, md-schemata can be          (the group-by sets of all queries are shown in Table 1). Then,
stored for reuse and sharing.                                       we measure the time required by the querying and OLAP-
                                                                    enabling phases. We also discuss the efficiency of our algo-
                                                                    rithm based on the number of performed and skipped checks
     Year
                                                                    in each phase. The results obtained are proposed below.
 Month                       Game
                   id                  Teams.name                   5.1      Querying
                         Teams.score                                   To evaluate the efficiency of the validation of an md-query,
                         Teams.won             Teams.abbreviation
            Date
                                                                    we measure the time of the group-by check since it is the
                                                                    only step that requires to access data.2 In Table 1 we show,
                                                                    for our nine md-queries, the number of checks avoided using
             Figure 3: The games md-schema                          the c-schema (#Avoided), and the total time for validating
                                                                    the md-query (tquerying ). In order to measure tquerying , the
                                                                    AFD checks required by a single md-query are executed in
5.      EXPERIMENTAL EVALUATION                                     parallel. We can easily conclude that (i) our approach effec-
                                                                    tively reduces the number of checks by relying on the dataset
  We evaluate the performance of our approach using two
                                                                    structure, and (ii) the md-query validation time depends on
real-world datasets :
                                                                    the number of levels in the group-by set and on their depths.
     ‚ Games. The working example dataset has been col-             Remarkably, the md-query validation time is still reasonable
       lected by Sports Reference LLC [26]. It contains around      (max 50 secs) even with a large dataset, which makes our
       32K nested documents representing NBA games in the           approach suitable for real-time usage.
       period of 1985–2013. Each document represents a game
       between two teams with at least 11 players each. It              Dataset   q    Gq                      #Avoided   tquerying
       contains 46 attributes; 40 of them are numeric and                         q1   id, teams.name            1/2       0.181
       represent team and player results.                                         q2   date, teams.name          1/2       0.177
                                                                                  q3   id,teams.name,            2/6       0.211
     ‚ DBLP. This dataset contains 5.4M documents scraped                              teams.home
       from DBLP (dblp.uni-trier.de/xml/) in XML for-                   Games     q4   teams.players.player,     3/6       1.364
       mat and converted into JSON. Documents are flat and                             date, teams.name
       represent eight kinds of publications including confer-                    q5   teams.players.player,     2/6       1.157
                                                                                       teams.name,
       ence proceedings, journal articles, books, thesis, etc.
                                                                                       teams.home,
       So, they have common attributes such as title, au-
       thor, type, year and uncommon ones such as journal                         q6   journal, year             0/2       3.616
                                                                                  q7   year, type                0/2       11.843
       and booktitle.                                                   DBLP
                                                                                  q8   author, year              1/2       36.685
   Implementation. Each dataset has been loaded as a col-                         q9   type, year, author        2/6       49.993
lection in a local MongoDB instance running on Intel Core i5
CPU at 3.3 GHz and 4 GB of RAM machine with Ubuntu
14.04 OS. We have focused on the parts of our approach              Table 1: Total time for the querying phase (in sec-
that require accessing data to retrieve multiplicities of inter-    onds)
level relationships by means of count distinct queries,
since the purely algorithmic parts have no relevant compu-
tational complexity. Query execution can be delegated ei-           5.2      OLAP Enabling
ther to the document store (MongoDB in our architecture)               For OLAP enabling, we measure the time to check each
or to query engines such as Spark SQL, Drill, and Presto.           single AFD and the time to retrieve the cardinality of each
Based on the tests we performed, these engines are signifi-         level. Tables 2 and 3 show, for each AFD checked between
cantly slower than native querying, plausibly because of the        a pair of levels, the number of processed documents, the
different query execution plans. In addition, these engines         AFD strength, and the checking time. The difference in the
sometimes return incorrect results when it comes to count           number of documents (or the data size) for a given dataset
distinct queries involving nested attributes. Therefore, we         is to due to the flattening operation that is performed when
defined two queries based on the native query language of           executing QAF D . The results clearly show that the querying
MongoDB:                                                            time tAF D depends on the levels and their depths. This
     1. The first one (QAF D ) checks the AFD between a pair        is apparent for the AFDs involving the author attribute in
        of levels and returns its approximation using the for-      Table 3, which have a higher execution time. This is due
        mula of Definition 4.                                       to the amount of memory allowed by MongoDB for group-
                                                                    by queries, which is limited to 100 MB; when a query does
     2. The second one (Qcard ) returns the cardinality of a        not fit in memory, temporary files must be written on disk,
        level.                                                      which dramatically slows execution down.
  Methodology. In the previous section, we showed how               2
                                                                     The time for actually executing each md-query is not con-
our approach validates md-queries and iteratively derives an        sidered here, since the optimization of each md-query is out
md-schema that satisfies the multidimensional constraints.          of the paper scope.
 l ; l1                          #Docs      strengthpl, l1 q   tAF D       Table 4 shows for each level its cardinality and the time
 id ; date                                      100%            0.113   to retrieve it. From these results, we can conclude that the
                                    32K
 date ; id                                      13.96%          0.065   efficiency of checking cardinalities also depends on the lev-
 t.name ; id                                    0.05%           0.181   els. tcard is normally small, since MongoDB can use indexes
 t.name ; date                                  0.05%           0.177   instead of collection scans. The high value of tcard for the
                                    64K
 t.name ; t.abbreviation                        100%            0.175   level teams.players.player is due to the non-use of nested at-
 t.name ; t.home                                 50%            0.168   tribute indexes to answer the corresponding query. As to
 t.abbreviation ; id                            0.05%           0.169   level author, Qcard failed to return the result because the
 t.abbreviation ; date                          0.05%           0.169   BSON document size (16MB) was exceeded. So we had to
                                    64K
 t.abbreviation ; t.name                        100%            0.161   define a different query that could not benefit from indexes
 t.abbreviation ; t.home                        50%             0.163
                                                                        and suffered from the memory limit discussed earlier.
 t.home ; id                                     0.01%          0.169      Using cardinalities to reduce the search space significantly
 t.home ; date                                   0.02%          0.159   improves the overall performance, since checking a cardinal-
                                    64K
 t.home ; t.name                                 2.77%          0.179
 t.home ; t.abbreviation                         2.77%          0.169   ity is faster than checking an AFD. Besides, the number of
                                                                        feasible AFDs when using cardinalities (8, i.e., their strength
 t.players.player ; id                          0.34%           1.285   is shown in bold in Table 2) is less than half the total number
 t.players.player ; date                        0.34%           1.278
 t.players.player ; t.name          644K       34.74%           1.113   of all checked AFDs (19, i.e., the number of rows in Table
 t.players.player ; t.abbrev.                  34.74%           1.087   2). The same applies for the DBLP dataset, where only 10
 t.players.player ; t.home                     50.54%           1.054   of 16 AFDs are feasible.
                                                                           Finally, Table 5 shows the overall performance of the OLAP
                                                                        enabling phase for our nine md-queries (for all levels in the
Table 2: Time for AFD checks on the Games dataset                       query): the number of roll-up and drill-down relationships
(in seconds)                                                            discovered, the number of checks avoided in roll-up and
                                                                        drill-down discovery, and the total time spent. The latter
 l ; l1                         #Docs      strengthpl, l1 q    tAF D    (tOLAP ) is calculated as the sum of the times for checking
                                                                        each AFD, plus the time for checking all cardinalities. The
 year ; type                                  21.76%           10.945
 year ; journal                 5.4M           0.35%            3.373   results show that our approach for discovering hierarchies
 year ; booktitle                              0.16%            4.461   effectively reduces the number of AFD checks by using car-
                                                                        dinalities and by relying on the dataset structure. Further-
 type ; year                                   2.07%           11.211
 type ; journal                 5.4M           0.12%            8.181   more, the time required by OLAP enabling is reasonable for
 type ; booktitle                              0.03%            4.824   all md-queries (2 minutes at most), which proves that our
                                                                        performances fit real-time contexts.
 journal ; year                                7.09%           3.365
 journal ; type                 5.4M          99.94%           8.166
 journal ; booktitle                       no relationship     1.841     Dataset    q   #Roll-up, #Drill-down     #Avoided     tOLAP
 booktitle ; year                             29.02%           4.517               q1             1,0             5/8, 6/8     2.657
 booktitle ; type               5.4M          54.97%           4.949               q2             0,1             6/8 , 5/8    2.657
 booktitle ; journal                       no relationship     1.852     Games     q3             1,0             7/9, 5/9     3.706
                                                                                   q4             0,1             5/9, 7/9     3.685
 author ; type                                42.98%           43.231              q5             0,0             7/9, 7/9     2.513
 author ; year                                36.13%           36.685
                                12M
 author ; journal                             42.31%           15.401              q6             1,0             4/6 , 2/6   89.940
 author ; booktitle                           31.92%           18.881     DBLP
                                                                                   q7             0,1             6/6 , 0/6   113.287
                                                                                   q8             0,0             3/6 , 3/6   107.401
                                                                                   q9             0,1             4/6 , 2/6   67.653
Table 3: Time for AFD checks on the DBLP dataset
(in seconds)
                                                                        Table 5: Total time for the OLAP enabling phase
                                                                        (in seconds)
 Dataset     l                                   |l|           tcard
             id                                31686           0.124
             date                              4424            0.019
             teams.name                          36            0.057    6.   CONCLUSION
 Games       teams.abbreviation                  36            0.056
                                                                           In this paper we have proposed an interactive schema-on-
             teams.home                          2             0.049
             teams.players.player              2191            0.622    read approach for enabling OLAP on document stores. To
             All levels                          —             0.927    this end, AFDs are mined and used to automate the discov-
                                                                        ery of multidimensional structures. The user interaction is
             author                          1859242           12.298
             journal                           1632            0.011
                                                                        limited to the selection of multidimensional concepts and to
             booktitle                        10811            0.063    a proper choice of the aggregation operators. After validat-
 DBLP                                                                   ing user queries from the multidimensional point of view and
             year                               82             0.001
             type                                8             0.001    refining the underlying multidimensional schema, we adopt
             All levels                         —              12.374   a smart strategy to efficiently build local portions of hierar-
                                                                        chies aimed at enabling OLAP-style user interaction in the
                                                                        form of roll-ups and drill-downs.
 Table 4: Time for cardinality checks (in seconds)                         Overall, the experiments we conducted show that the per-
                                                                        formances of our approach are in line with the requirements
of a real-time user interaction. However, some relevant is-     [13] M. R. Jensen, T. H. Møller, and T. B. Pedersen.
sues still need to be explored and are part of our future            Specifying OLAP cubes on XML data. Journal of
work. To improve performance, we plan to further optimize            Intelligent Information Systems, 17(2-3):255–280,
the algorithms proposed. Besides, to increase effectiveness,         2001.
the evolution of data and schemata in a collection must be      [14] M. Klettke, U. Störl, and S. Scherzinger. Schema
considered; we intend to address this issues by searching for        extraction and structural outlier detection for
temporal FDs.                                                        JSON-based NoSQL data stores. In Proc. BTW, pages
                                                                     425–444, 2015.
7.   REFERENCES                                                 [15] H. J. Lenz and A. Shoshani. Summarizability in
 [1] A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli,           OLAP and statistical databases. In Proc. SSDBM,
     J.-N. Mazón, F. Naumann, T. B. Pedersen, S. Rizzi,             pages 132–143, 1997.
     J. Trujillo, P. Vassiliadis, and G. Vossen. Fusion         [16] Z. H. Liu and D. Gawlick. Management of flexible
     cubes: towards self-service business intelligence.              schema data in RDBMSs - opportunities and
     International Journal of Data Warehousing and                   limitations for NoSQL. In Proc. CIDR, 2015.
     Mining, 9(2):66–88, 2013.                                  [17] Z. H. Liu, B. Hammerschmidt, D. McMahon, Y. Lu,
 [2] A. Abelló, O. Romero, T. B. Pedersen, R. Berlanga,             and H. J. Chang. Closing the functional and
     V. Nebot, M. J. Aramburu, and A. Simitsis. Using                performance gap between SQL and NoSQL. In Proc.
     semantic web technologies for exploratory OLAP: A               SIGMOD, pages 227–238, 2016.
     survey. IEEE Transactions on Knowledge and Data            [18] J. N. Mazón, J. Lechtenbörger, and J. Trujillo. A
     Engineering, 27(2):571–588, 2015.                               survey on summarizability issues in multidimensional
 [3] M. Armbrust, A. Ghodsi, M. Zaharia, R. S. Xin,                  modeling. Data and Knowledge Engineering,
     C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng,               68(12):1452–1469, 2009.
     T. Kaftan, and M. J. Franklin. Spark SQL: Relational       [19] C. Phipps and K. C. Davis. Automating data
     data processing in spark. In Proc. SIGMOD, pages                warehouse conceptual schema design and evaluation.
     1383–1394, 2015.                                                In Proc. DMDW, pages 23–32, 2002.
 [4] R. Cattell. Scalable SQL and NoSQL data stores.            [20] N. Prat, J. Akoka, and I. Comyn-Wattiau. A
     ACM SIGMOD Record, 39(4):12–27, 2011.                           UML-based data warehouse design method. Decision
 [5] C. Chasseur, Y. Li, and J. Patel. Enabling JSON                 Support Systems, 42(3):1449–1473, 2006.
     document stores in relational systems. In Proc.            [21] S. Rizzi, E. Gallinucci, M. Golfarelli, O. Romero, and
     WebDB, pages 1–6, 2013.                                         A. Abelló. Towards exploratory OLAP on linked data.
 [6] A. Cuzzocrea, L. Bellatreche, and I.-Y. Song. Data              In Proc. SEBD, pages 86–93, 2016.
     warehousing and OLAP over big data: current                [22] O. Romero and A. Abelló. Multidimensional design by
     challenges and future research directions. Proc.                examples. In Proc. DaWaK, pages 85–94, 2006.
     DOLAP, pages 67–70, 2013.                                  [23] O. Romero and A. Abelló. Automating
 [7] M. Golfarelli, S. Graziani, and S. Rizzi. Starry vault:         multidimensional design from ontologies. In Proc.
     Automating multidimensional modeling from data                  DOLAP, pages 1–8, 2007.
     vaults. In Proc. ADBIS, pages 137–151, 2016.               [24] O. Romero, D. Calvanese, A. Abelló, and
 [8] M. Golfarelli, D. Maio, and S. Rizzi. The dimensional           M. Rodrı́guez-Muro. Discovering functional
     fact model: A conceptual model for data warehouses.             dependencies for multidimensional design. In Proc.
     International Journal of Cooperative Information                DOLAP, pages 1–8, 2009.
     Systems, 7(2-3):215–247, 1998.                             [25] D. S. Ruiz, S. F. Morales, and Jesus Garcia Molina.
 [9] M. Golfarelli, S. Rizzi, and B. Vrdoljak. Data                  Inferring versioned schemas from NoSQL databases
     warehouse design from XML sources. In Proc.                     and its applications. In Proc. ER, pages 467–480, 2015.
     DOLAP, pages 40–47, 2001.                                  [26] K. Valeri. https://thecodebarbarian.wordpress.com,
[10] Y. Huhtala, J. Kärkkäinen, P. Porkka, and                     2014.
     H. Toivonen. TANE: An efficient algorithm for              [27] J. Varga, A. Vaisman, O. Romero, L. Etcheverry,
     discovering functional and approximate dependencies.            T. B. Pedersen, and C. Thomsen. Dimensional
     Computer Journal, 42(2):100–111, 1999.                          enrichment of statistical linked open data. Journal of
[11] I. F. Ilyas, V. Markl, P. Haas, P. Brown, and                   Web Semantics, 40:22–51, 2016.
     A. Aboulnaga. Cords: Automatic discovery of                [28] B. Vrdoljak, M. Banek, and S. Rizzi. Designing web
     correlations and soft functional dependencies. In Proc.         warehouses from XML schemas. In Proc. DaWaK,
     of SIGMOD, pages 647–658, 2004.                                 pages 89–98, 2003.
[12] C. S. Jensen, T. Holmgren, and T. B. Pedersen.             [29] L. Wang, O. Hassanzadeh, S. Zhang, J. Shi, L. Jiao,
     Discovering multidimensional structure in relational            J. Zou, and C. Wang. Schema management for
     data. In Proc. DaWaK, pages 138–148, 2004.                      document stores. VLDB Journal, 8(9):922–933, 2015.