Enabling Self-Service BI on Document Stores Mohamed L. Chouder Stefano Rizzi Rachid Chalal LMCS, ESI DISI, University of Bologna LMCS, ESI Algiers,Algeria Bologna, Italy Algiers,Algeria m_chouder@esi.dz stefano.rizzi@unibo.it r_chalal@esi.dz ABSTRACT paradigm). This schemaless nature provides a flexible data The growing use of document stores has resulted in vast model that has attracted developers seeking to avoid the amounts of semi-structured data holding precious informa- restrictions posed by the relational model. tion, which could be profitably integrated into existing BI Document stores usually collect nested, denormalized, and systems. Unfortunately, due to their schemaless nature, doc- hierarchical documents in the form of collections; the docu- ument stores are hardly accessible via direct OLAP query- ments within the same collection may present a structural ing. In this paper we propose an interactive, schema-on-read variety due to schema flexibility and data evolution. Doc- approach for finding multidimensional structures in docu- uments are self-describing and mainly encoded using the ment stores aimed at enabling OLAP querying in the context semi-structured data format JSON (JavaScript Object No- of self-service BI. Our approach works in three steps: mul- tation), which is also popular as an exchange format and tidimensional enrichment, querying, and OLAP enabling; widely adopted in modern web APIs. the validation of user queries and the detection of multidi- The growing use of document stores and the dominance mensional structures is based on the mining of approximate of JSON have resulted in vast amounts of semi-structured functional dependencies from data. The efficiency of our data holding precious information, which could be profitably approach is discussed with reference to real datasets. integrated into existing business intelligence (BI) systems [1]. Indeed, multidimensional modeling and analysis has been recognized to be an effective way for conducting an- CCS Concepts alytics over big NoSQL data [6]. Unfortunately, due to their •Information systems Ñ Data warehouses; •Applied schemaless nature, document stores are hardly accessible via computing Ñ Business intelligence; direct OLAP querying. Recent efforts in this area propose SQL-like query interfaces (e.g., Apache Drill, Spark SQL, Keywords Presto), which provide access to schemaless and nested data while offering the possibility of using traditional BI tools. NoSQL, Document-Oriented Databases, JSON, Multidimen- However, none of these solutions supports multidimensional sional modeling querying or OLAP over document stores. In this paper we propose an interactive, schema-on-read 1. INTRODUCTION approach for finding multidimensional structures in docu- Over the past decade, companies have been adopting NoSQL ment stores aimed at enabling OLAP querying in the con- databases to deal with the huge volumes of data manipu- text of self-service BI. Differently from a schema-on-write lated by modern applications. NoSQL systems have emerged approach, which would force a fixed structure in data and as an alternative to relational database management sys- load them into a data warehouse, a schema-on-read ap- tems in many implementations [4]. They can be classi- proach leaves data unchanged in their structure until they fied based on their data model, the most popular categories are accessed by the user [16]. We claim that, when querying being key-value, wide-column, graph-based and document- document stores for self-service BI, a schema-on-read ap- oriented; document-oriented databases (briefly, document proach should be preferred to a schema-on-write approach. stores) are probably the most widely adopted so far. The The main reason for this is that document stores handle common features of document stores are horizontal scala- high volumes of data with varied structure, which question bility on commodity hardware and the lack of an explicit the ability of the traditional ETL in processing data on the schema (according to the data first, schema later or never one hand, and the possibility of storing them into a data warehouse with a fixed schema on the other hand. To discover the information necessary for discovering mul- tidimensional concepts despite the lack of structure, we re- sort to data distributions, i.e., we mine approximate func- tional dependencies (AFDs) and use them to automate the design process. More specifically, our approach takes as in- c 2017, Copyright is with the authors. Published in the Workshop Proceed- ings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, put a collection of nested documents and operates in three Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is phases. permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 1. Multidimensional Enrichment. In this phase, a In these approaches, multidimensional modeling is done at schema is extracted out of the collection and enriched design time, mainly based on FDs expressed in the source with some basic multidimensional knowledge. This is schemata as foreign keys or many-to-one relationships. Con- done by tentatively classifying attributes into dimen- versely, our approach is meant to be used at query time sions and measures, and by relating each measure to and automates the discovery of multidimensional concepts the subset of dimensions that determine its granular- by mining FDs from data, as required by the schemaless ity, so as to create a draft multidimensional schema. context. Multidimensional design from non-relational data. 2. Querying. Now the user can formulate a multidi- Our approach closely relates to previous approaches for mul- mensional query by picking dimensions and measures tidimensional modeling from semi-structured data [9, 13, 28] of interest. This query is checked by accessing data to in XML format, which is similar to JSON. These approaches ensure its multidimensional validity and correct sum- take in input DTDs or XML schemata that provide rich in- marization. If the query is found to be well-formed, it formation about XML documents (e.g., multiplicities, data is executed. Otherwise, the multidimensional schema types), so they cannot operate directly on XML data not is refined and some possible querying alternatives are having a schema specification. In particular, the work in [28] proposed to the user. builds multidimensional schemata starting from an XML 3. OLAP Enabling. This iterative phase enables the schema but, in some cases, data is accessed to discover FDs user to further explore data by running an OLAP ses- that are not expressed in the schema. Similarly, starry vault sion. To this end, local portions of multidimensional [7] is a recent approach that mines FDs for multidimensional hierarchies (consisting of roll-up and drill-down rela- modeling from data vaults. These are databases character- tionships for each level involved in the user query) are ized by a specific data model tailored to provide historical built by mining AFDs from data. The user can now storage in presence of schema evolutions. The main idea is apply an OLAP operator (e.g., roll-up or drill-down) to mine approximate and temporal FDs to cope with the to iteratively create a new query on the collection. issue of noisy and time-varying data, which may result in hidden FDs. The advantages of the proposed approach are twofold. First, All the above-mentioned approaches are similar in that it enables decision makers to formulate multidimensional they define multidimensional schemata at design time using queries and create reports on-the-fly on document stores in structural and additional information extracted from data a self-service fashion, i.e., without any support from ICT (i.e., FDs). In contrast, our approach operates at query people. This reduces the time and effort spent in traditional time and mostly relies on data distributions, while structural BI settings. Second, our approach also works when nested information is only used —when available— to intelligently structures are not present within the input data (e.g., for reduce the search space. flat documents), which is relevant because a large number OLAP on linked data. Recent works in this space pro- of non-nested datasets are available in JSON format on the pose to directly perform OLAP-like analysis over semantic web (e.g., more than 11000 collections in www.data.gov). web data. Exploratory OLAP has been defined as the pro- The approach has been implemented on top of MongoDB, cess that discovers, collects, integrates, and analyzes these one of the most popular document stores. external data on the fly [2]. Some works in this direction ad- The rest of this paper is organized as follows. Section 2 dress the problem of building multidimensional hierarchies discusses the related work and Section 3 presents document from linked data [21, 27], which is closely related to the stores along with the working example. Our approach is third phase of our approach. Specifically, iMOLD [21] is detailed in Section 4 and experimentally evaluated in Sec- an instance-based approach that operates at query time for tion 5. Section 6 concludes the paper and gives some future exploratory OLAP on linked data; it aims at finding mul- directions for research. tidimensional patterns representing roll-up relationships, in order to enable users to extend corporate data cubes with 2. RELATED WORK new hierarchies extracted from linked data. Similarly, [27] proposes an algorithm for discovering hierarchies from di- Supply-driven multidimensional design. The au- mensions present in statistical linked data represented us- tomation of multidimensional modeling has been widely ex- ing the RDF Data Cube (QB) vocabulary (www.w3.org/TR/ plored in the area of data warehouse design. Supply-driven vocab-data-cube/). Like starry vault, this algorithm mines approaches automate the discovery of multidimensional con- for AFDs (here called quasi-FDs) to cope with the presence cepts by a thorough analysis of the source data. The first of imperfect and partial data. approaches proposed algorithms to build multidimensional The algorithm that we propose for building hierarchies dif- schemata starting from Entity/Relationship diagrams or re- fers from these works in that, to ensure good performances, lational schemata [8, 19, 12], while further approaches con- it looks for local portions of hierarchies for the levels involved sidered more expressive conceptual sources such as UML in the user queries, thus acting in an on-demand fashion. class diagrams [20] and ontologies [23]. Furthermore, [24] SQL on schemaless data. Several solutions have emerged discussed the use of FDs for automating multidimensional in the industry to enable SQL querying of schemaless data modeling from ontologies. In particular, similarly to the for analytic purposes. These solutions can be classified into querying phase of our approach, [22] proposes an algorithm two categories: (1) relational systems enabling the storage to check the multidimensional validity of an SQL cube query and management of schemaless data, and (2) systems de- and to derive the underlying multidimensional schema. This signed as an SQL interface for schemaless data. algorithm identifies multidimensional concepts from the struc- Solutions in the first category include RDBMSs that sup- ture of the query itself and by following foreign keys in the port storage and querying of schemaless data to be used relational schema. along with relational data in one system [5, 17] (e.g., Or- games : games : {“ id” : “52f2a70eddbd75540aba7c06”, {“ id” : “52f2a70eddbd75540aba7c06”, acle, IBM DB2, and PostgreSQL). Current systems do not “date” : ISODate(“1989-03-27”), “date” : ISODate(“1989-03-27”), impose a fixed relational schema to store and query data, “teams” : “hometeam” : [ {“name” : “Detroit Pistons”, {“name” : “Detroit Pistons”, but derive and maintain a dynamic schema to be used for “abbreviation” : “DET”, “abbreviation” : “DET”, schema-on-read SQL/JSON querying instead [17]. Other so- “score” : 90, “score” : 90, “home” : true, “home” : true, lutions that fall into this category are virtual adapters that “won” : 1, “won” : 1, expose a relational view of the data in a document store, to “results” : { “points” : 90, . . . }, “results” : { “points” : 90, . . . }, “players” : “players” : be used in common BI tools; an example is the MongoDB [ {“player” : “Isiah Thomas”, [ {“player” : “Isiah Thomas”, BI connector (www.mongodb.com/products/bi-connector). “points” : 30, . . . }, “points” : 30, . . . }, ... ... The second line of solutions are SQL-like interfaces that ] ] offer extensions to query schemaless data persisted in doc- }, }, {“name”: “Dallas Mavericks”, “guestteam” : ument stores or as JSON files (e.g., in Hadoop). Spark “abbreviation” : “DAL”, {“name”: “Dallas Mavericks”, SQL [3] has been designed for relational data processing “score” : 77, “abbreviation” : “DAL”, “home” : false, “score” : 77, of native Spark distributed datasets in addition of diverse “won” : 0, “home” : false, data sources and formats (including JSON). In order to run “results” : { “points” :77, . . . }, “won” : 0, “players” : “results” : { “points” :77, . . . }, SQL queries on schemaless data in Spark SQL, a schema [ {“player” : “Rolando Blackman”, “players” : is automatically inferred beforehand. Apache Drill (drill. “points” : 23, . . . }, [ {“player” : “Rolando Blackman”, ... “points” : 23, . . . }, apache.org) is a distributed SQL engine that can join data ] ... from multiple data sources, including Hadoop and NoSQL } ] ] } databases. It has a built-in support for JSON with colum- } } nar in-memory execution. Drill is also able to dynamically discover a schema during query processing, which allows to (a) (b) handle schemaless data without defining a schema upfront. Finally, Presto (prestodb.io) is a distributed SQL engine, Figure 1: Sample JSON document of a game developed at Facebook for interactive queries, that can also combine data from multiple kinds of data sources includ- ing relational and non-relational databases. When querying Example 1. In the scope of this paper, we focus on col- schemaless data, Presto tries to discover data types, but it lections of JSON documents that logically represent the same may need a schema to be defined manually in order to run business entities, while expecting that their structure may queries. All above-mentioned engines differ in their archi- vary. In particular, we will use as a working example a col- tecture, support to ANSI SQL, and extensions to SQL to lection that models NBA games (adapted from [26]), where a handle schemaless and nested data. These engines can con- document represents one game between two teams along with nect to MongoDB, which gave us various architectural op- players and team results. In the sample document shown in tions for implementation. However, none of them supports Figure 1a the main object is games, which holds the array multidimensional or OLAP querying over document stores. of objects teams (to-many), which in turn holds the object results (to-one) and the array of objects players (to-many). 3. DOCUMENT STORES Figure 1b shows an alternative JSON representation of the A document store holds a set of databases, each organizing same domain; the relationship between game and teams is the storage of documents in the form of collections in a way modeled using two separate sub-documents (home and guest similar to rows and tables in relational databases. A docu- team) instead of an array. Note that a simple query that ment, also called object, consists of a set of key-value pairs computes the average score by team would be much easier to (or fields) mainly encoded in JSON format (www.json.org). implement using the first representation and it would also JSON has many variants that are used for storage and op- perform better, since in the second representation it would timization purposes such as BSON (www.bsonspec.org) in require two separate queries. MongoDB and, very recently, OSON in the Oracle DBMS [17]. Keys are always strings, while values have the follow- ing types: primitive (number, string, Boolean), object (or 4. APPROACH sub-document), array of atomic values or objects. As already discussed, our approach for enabling self-service From a conceptual point of view, a typical collection con- BI on document stores includes three phases: multidimen- sists of a set of business entities connected through two kinds sional enrichment, querying, and OLAP enabling; these phases of relationships: nesting and references. Nesting consists of are discussed in the detail in the following subsections. embedding objects arbitrarily within other objects, which corresponds to two types of relationships: to-one in the case 4.1 Multidimensional Enrichment of a nested object and to-many in the case of an array of The aim of this phase is to define a draft multidimensional nested objects. On the other hand, references are similar schema on which the user will be able to formulate queries. to foreign keys in relational databases and can be expressed To this end, we extract the schema of the input collection manually or using a specific mechanism such as $ref in Mon- and enrich it with basic multidimensional knowledge. Sev- goDB. However, when references are used, getting data re- eral works have addressed schema extraction and discovery quires joins which are not supported in document stores; so, from document stores [25, 29, 14]. These works focus on the nesting is most often used instead. Since a document store structural variety of documents within the same collection has a flexible schema there are a variety of ways for data caused by their schemaless nature and by the evolution of representation; nevertheless, the way data is modeled may data; for instance, [14] proposes to extract a JSON schema affect performance and data retrieval patterns. (www.json-schema.org) similar to a JSON document and infer attribute and type occurrence frequencies to empha- After the c-schema S of a collection has been defined, a size the structural variety. In [29], all the schema versions corresponding multidimensional schema has to be derived in a collection are extracted and stored in a repository to be from it. Since measures at different granularities can pos- used within a schema management framework. In order to sibly be included in the documents (e.g., in our example, have a single view of the distinct schemata in a collection, points at the player’s level and score at the team level), the the authors propose a relaxed form of schema called skele- definition we provide relates each single measure to a specific ton that better captures the core and prominent fields and set of dimension. filters out unfrequent ones. This would be useful for us since we rely on a single view of the collection; however, the com- Definition 2 (Md-Schema). A multidimensional putational cost for building skeletons is high, making them schema (briefly, md-schema) is a triple M “ pH, M, gq unsuitable in real-time contexts. where: Since schema discovery is not the focus of this paper, we ‚ H is a finite set of hierarchies; each hierarchy hi P H adopt a simple algorithm that builds a tree-like schema in- is associated to a set Li of categorical levels and a roll- cluding all the fields that appear in the documents [3]. This L up partial Ť order ľi of Li . Each element G P 2 , where algorithm works in a single-pass over data to infer a tree L “ i Li and at most one level in G is taken from structure of fields and their corresponding types (fields with each hierarchy hi , is called a group-by set of H. varying types are generalized to String). The tree-like collection schema that will be the input for ‚ M is a finite set of numerical measures. the subsequent steps of our approach is defined as follows. ‚ g is a function relating each measure in M to its di- Definition 1 (Collection Schema). Given a collec- mensions, i.e., to the set of levels that determine its tion D, its schema (briefly, c-schema) is a tree S “ pK, Eq granularity: g : M Ñ G where G is the set of all group- where: by sets of H. ‚ K is the union of the sets of keys appearing in the At first, a draft md-schema Mdraf t is built from S by documents of D; tentatively labelling all attributes of types date, string, and Boolean as dimensions (i.e., as levels of different hierarchies), ‚ r P K is the distinguished root node and is labelled with and all attributes of types integer and decimal as measures. the collection name; Date dimensions are then decomposed into different categor- ical levels giving rise to standard temporal hierarchies (e.g., ‚ E is a set of arcs that includes (i) an arc r Ñ k for date ľ month ľ year). Note that no further attempt to dis- each key/value pair in the root r, and (ii) an arc k1 Ñ cover hierarchies is made at this stage, since discovering the k2 iff k2 appears as a key in an array of nested objects FDs involving all attributes would be computationally too having key k1 . expensive. As a consequence, in Mdraf t two levels l and l1 might be erroneously modeled as two separate dimensions, Figure 2 shows the c-schema to which the document of while they should be actually part of the same hierarchy Figure 1a conforms. A path from the root key to a leaf because l ľ l1 . key is called an attribute. In denoting attributes we use The user can contribute to this step by manually changing the dot notation omitting the root key (since a single col- the label of some attributes, since in some cases a numeric lection is involved) but including the keys corresponding to attribute can be used as a dimension, and a non-numeric nested objects; so, for instance, the c-schema in Figure 2 attribute can be used as a measure. The first situation has contains attributes id, date, teams.name, teams.results.points, no impact on the following steps, since a group-by set can teams.players.player, teams.players.points, etc. The depth of also include numeric attributes. Conversely, the second sit- an attribute a is the number of arcs included in its path, uation requires that the values of the non-numeric attribute and is denoted by depthpaq (e.g., depthpidq “ 1); the set of are transformed into numerical values to be supported by attributes at depth δ is denoted by Attrspδq. aggregation functions. To complete the definition of Mdraf t , the granularity map- games ping g must be built. We recall from [22] that an md-schema should comply with the multidimensional space arrangement constraint, stating that each instance of a measure is related to one instance of each dimension. In the c-schema S, at- tributes in Attrspδq are related by to-many multiplicity to id date teams those in Attrspδ ` 1q, so a measure cannot be related to levels having a higher depth. Therefore, gpmq is set by con- necting each measure m to the group-by set G that contains all levels l P L such that depthpmq ě depthplq. name abbreviation home score results.points players ... Example 2. In this example and in the following ones we will refer to the representation in Figure 1a. Attribute teams.score is numerical, so it is assumed to be a mea- player points ... sure. It is depthpteams.scoreq “ 2, therefore teams.score is tentatively associated in Mdraf t with group-by set G “ tteams.name, teams.abbreviation, teams.home, id, dateu. All Figure 2: NBA games c-schema the measures that have the same depth of teams.score (e.g., teams.won and teams.results.points) are also asso- Algorithm 1 Querying ciated with G. Similarly, measure teams.players.points Require: A c-schema S, an md-schema M, an md-query q “ has depth 3, so it is associated with group-by set pGq , Mq , Σq on M Ensure: A (refined) md-schema M, a (valid) md-query q on M G1 “ tteams.players.player, teams.name, teams.abbreviation, 1: Mq Ð CheckM easurespMq , Gq , Mq teams.home, id, dateu. 2: if Mq “ H then 3: Mrec “ RecommendM easurespGq , Mq 4.2 Querying 4: if Mrec ‰ H then 5: update Mq based on the user’s choice This phase is aimed at supporting a non-expert user in for- 6: if Mq ‰ H then mulating a well-formed multidimensional query. In a schema- 7: if |Gq | ą 1 then on-write approach, queries are formulated on a complete md- 8: Gq “ CheckGroupBypMq , Gq , M, Sq schema whose correctness is ensured by the designer. Con- 9: if |Gq | “ 1 then 10: Lrec “ RecommendLevelspGq , Mq versely, in a schema-on-read approach —our case—, the md- 11: update Gq and M based on the user’s choice schema is defined at read-time by the query itself. Though 12: Σ Ð CheckSummarizationpq, Mq this approach gives more flexibility because each user can 13: Executepqq “force” her own multidimensional view onto the data, it re- 14: else 15: ask the user to change the md-query group-by set Gq quires a further check to ensure that the FDs implied by the query are not contradicted by data; so, querying becomes an iterative process where the underlying md-schema is pro- to be used in the next phase. Algorithm 1 works as fol- gressively refined together with the query. lows. Firstly, procedure CheckMeasures drops from Mq the Definition 3 (Md-Query). A multidimensional query measures, if any, that are not compatible with Gq based on (briefly, md-query) q on md-schema M “ pH, M, gq is a their granularity (Line 1). If no measure is left, some alter- triple q “ pGq , Mq , Σq where native (compatible) measures are suggested by Recommend- Measures (Lines 2–5). If some measures Mrec are found, ‚ Gq P G is the md-query group-by set; the user interacts to approve their inclusion in the md-query (Line 5). Then, if the md-query group-by set includes either ‚ Mq Ď M is the set of required measures; 2 or 3 levels (Lines 7–11), procedure CheckGroupBy is called ‚ Σ is a function that associates each measure m P Mq to drop from Gq the levels, if any, that are not compliant with an aggregation operator. with the base integrity constraint and update M accord- ingly (Line 8). If just one level is left in Gq , procedure Starting from the draft md-schema Mdraf t , the user for- RecommendLevels is called to look for additional group-by mulates an md-query q by choosing one to three dimensions levels (Line 10) and possibly include them in the md-query (Gq ), one or more measures of interest (Mq ), and an ag- (Line 11). Finally, procedure CheckSummarization checks gregation operator for each measure (Σ). However, to be the aggregation operators (Line 12). We allow the resulting considered well-formed, q (and the md-schema q is formu- md-query to be executed with a single group-by level, but lated on) should comply with the following constraints [22]: not without measures (Lines 6 and 13); in the latter case, the user is asked to choose a new group-by set (Line 15). 71 The base integrity constraint, stating that the levels in the group-by set are orthogonal, i.e., functionally 4.2.1 Checking Measures independent on each other. The goal of this procedure is to ensure that all the mea- sures in Mq can be correctly aggregated at group-by set Gq ; 72 The summarization integrity constraint, which requires specifically, our goal is to avoid that for some m P Mq there disjointness (the measure instances to be aggregated is a to-many relationship between m and a level l P Gq , are partitioned by the group-by instances), complete- because this would violate disjointness and lead to double ness (the union of these partitions constitutes the en- counting when executing q. Measure check is based on the tire set), and compatibility (the aggregation operator g component of M, which expresses the granularity of each chosen for each measure is compatible with the type measure. of that measure) [15]. To explain how this is done, we observe that the roll-up How to carry out these validity checks is discussed in the partial orders on the hierarchies H of M induce a partial following subsections. order ľH on the set G of the group-by sets of H, defined as the product order of the single roll-up orders.1 Measure m is Example 3. A possible md-query on the draft md-schema compatible with Gq iff gpmq ľH Gq ; indeed, if this condition described in Example 2 is the one characterized by Gq “ is satisfied, the granularity expressed by the group-by set is tteams.players.player, dateu, Mq “ tteams.scoreu, and Σ “ coarser than the one at which m is defined, meaning that Sum. Unfortunately, aggregating on teams.score would re- m can be safely aggregated at Gq . The measures that are sult in double counting since each instance of teams.score is found not to be compatible with Gq are removed from Mq . related to multiple instances of teams.players.player (a team has several players). So, this query violates disjointness and Example 4. Considering again the query in Example 3, is not well-formed. it is gpteams.scoreq ńH Gq (because teams.players.player is not in gpteams.scoreq). Then, q is not well-formed and The pseudo-code of the querying phase is sketched in Al- teams.score is dropped from Mq . gorithm 1. It starts from the md-query q and takes as input 1 The product order of n total orders is a partial order on both the c-schema and the md-schema. The output of the the Cartesian product of the n totally ordered sets, such algorithm is a valid md-query based on a refined md-schema that px1 , . . . , xn q ľ py1 , . . . , yn q iff xi ľ yi for i “ 1, . . . , n. This check is performed at the schema level, so some ex- (c) Finally, when l1 Ñ l and l2 Ñ l, ceptions may arise when the relationship between a measure i. if the relationship between l1 and l2 is many-to- m and Gq is hidden in data. This may happen when arrays many, they are both retained in Gq ; conceptu- are not used to model to-many relationships, or when m de- ally, l is a convergence [8] and the md-schema pends either on a level not present in Gq or on a subset of must be updated by adding l1 ľ l and l2 ľ l. the levels in Gq [18]. In these cases, m may be non-additive or even non-aggregable on Gq , so the user’s knowledge of ii. if the relationship between l1 and l2 is many-to- the application domain is required to manually fix summa- one, then the three levels in Gq belong to the rization (see Section 4.2.5). same hierarchy (l1 ľ l2 ľ l) and only the level with highest cardinality, l1 , is kept in Gq . 4.2.2 Checking Group-by Set This process is aimed at ensuring completeness and base 4.2.3 Recommending Measures integrity. When there is no compatible measure left in Mq , this pro- The completeness condition is violated when, for some cedure looks for alternative measures. Specifically, it returns level l P Gq , there is no instance corresponding to one or all the measures in the md-schema that are compatible with more instances of a measure in Mq ; from a conceptual point Gq , i.e., all m P M such that gpmq ľH Gq . of view, l is classified as optional [8]. In a schema-on-write approach this problem is fixed by aggregating all “dangling” Example 5. With reference to Example 4, procedure measures into an ad-hoc group of l. Similarly, in our schema- RecommendM easures returns teams.players.points, since it on-read approach, these instances are grouped into a null is gpteams.players.pointsq ľH Gq . instance of l thus restoring the completeness condition; then for instance, the $ifNull operator of MongoDB could be used 4.2.4 Recommending Levels to replace null values with an ad-hoc one when executing q. Recommending additional group-by levels is done when Base integrity requires that the levels in Gq are mutually originally the user had selected two or three group-by levels, orthogonal. So, from this point of view, md-query q is valid but only one of them was left in Gq after checking the group- only if, for each pair of levels l, l1 P Gq , there is no FD by set. For a given candidate level l, it requires to check between l and l1 , i.e., there is a many-to-many relationship that the base integrity constraint is met between l and some between them. An FD is a to-one relationship, which we will other level in M, and that the measure check is not violated. denote with l Ñ l1 to emphasize that values of l functionally The first level found is proposed to the user, who has the determine the values of l1 . Checking base integrity requires choice to use it or to proceed looking for other levels (up to to access data in order to retrieve AFDs between the levels, a maximum of three levels in the group-by set). in Gq , that are not related by a to-many multiplicity in the c-schema (see Section 4.3 for more details about AFDs). Example 6. Let Gq “ tid, dateu and Mq “ We recall that 1 ď |Gq | ď 3: tteams.players.pointsu. The relationship between the levels in Gq is many-to-one, so date is removed. In this 1. If |Gq | “ 1, orthogonality is obvious. case, to recommend levels we may look for a many-to-many 2. If |Gq | “ 2, the relationship between the two levels relationship between id and teams.name, teams.abbreviation, in Gq is checked. If it turns out to be many-to-many teams.home, and teams.players.player. Since teams.name (e.g., if Gq “ tdate, teams.nameu), the base integrity does not violate the measure check and has a many-to-many constraint is met and both levels remain in Gq . If it relationship with id, it is recommended to the user as a turns out to be many-to-one (e.g., if Gq “ tid, dateu), possible group-by level. then there is a roll-up relationship between the two levels in Gq . Only the level at the “many” side (id) re- 4.2.5 Checking Summarization mains in Gq and the md-schema is modified by placing Compatibility states that measures cannot be aggregated these levels in the same hierarchy (id ľ date). A spe- along hierarchies using any aggregation operators. In prin- cial case is when the relationship is one-to-one (e.g., ciple, checking summarization would require knowing the if Gq “ tteams.name, teams.abbreviationu); here, one of measure category (flow, stock, or value-per-unit), the type the levels is kept in Gq while the other is considered of group-by levels (temporal or non-temporal), and the ag- as a descriptive attribute. gregation operator (Sum, Avg, Min, Max, Count, etc.) [15]. Knowing the measure category and type of group-by levels 3. In case |Gq | “ 3, there are three possibilities: could be used to recommend some aggregation operators, but still the user would have to choose among a number (a) If many-to-many relationships are found between of potential options; besides, correctly classifying a mea- each pair of levels, then the constraint is met sure into flow, stock, or value-per-unit may be hard for non- and all levels remain in Gq (e.g., if Gq “ skilled users. Therefore, we prefer to enable users to directly tdate, teams.name, teams.players.playeru). pick an aggregation operator for each measure in Mq . (b) Let many-to-many relationships be found between Summarization is violated when a measure m P Mq has two pairs of levels pl, l1 q, pl, l2 q, and a many-to-one finer granularity than another measure m1 P Mq , since m relationship be found between the remaining pair would be double-counted when executing q. In this case, pl1 , l2 q (e.g., if Gq “ tteams.name, id, dateu, since the user is warned that she will get erroneous results, and id Ñ date). In this case, l1 and l2 are placed in the she may choose to change the aggregation operator for m same hierarchy in the md-schema (l1 ľ l2 ), and (e.g., using Min, Max or Avg instead of Sum will give the l2 is removed from Gq . correct result) or to drop m from Mq . Example 7. Let Mq “ tteams.score, teams.players.ptsu. 4.3.1 Roll-up Discovery These measures have different granularities, so if the user Let l P Gq ; discovering possible roll-ups for l requires to has chosen to aggregate teams.score (which has finer gran- mine the AFDs of type l ; l1 , with l1 P LzGq and depthpl1 q ď ularity) using Sum she will get double counting. Then, she depthplq. Let R be the set of all right-hand sides for these can either change the aggregation operator for teams.score AFDs. To avoid useless checks, the AFDs whose right-hand to Avg or drop teams.score from Mq . side l1 has higher cardinality than l are not checked, since they clearly cannot hold. One exception is when |l| “ |l1 | or 4.3 OLAP Enabling |l| Á |l1 | (due to approximation): in this case both l1 ; l The goal of this phase is to refine the md-schema by dis- and l ; l1 must be checked, since the relationship between covering some hierarchies, so that the user is enabled to in- l and l1 may be one-to-one (so l1 is a descriptive attribute teract with data in an OLAP fashion. Completely building for l). These bidirectional checks are made at the end of the all the hierarchies would require to mine all FDs between process to prioritize the discovery of “true” hierarchies. The levels, which would be computationally too expensive. For AFDs in R are checked until one is found to hold, say l ; l1 . this reason, we only build local portions of hierarchies for Then, the md-schema M is updated by placing these levels the levels in the group-by set of the previously-formulated in the same hierarchy (l ľ l1 ) and the user is notified of the md-query q. Besides, again for complexity reasons, we only ability of performing a roll-up from l. mine simple FDs (i.e., those relating single attributes rather than attribute sets), which are mostly common in multidi- Example 8. Let Gq “ tid, teams.nameu and Mq “ mensional schemata. tteams.score, teams.wonu. The search space for level id Specifically, the idea is to mine, for each level l P Gq for only includes date, since it is the only level for which which no roll-up relationships have been discovered yet, the depthpdateq ď depthpidq. Since |id| ą |date|, id ; date FDs of either type l Ñ l1 (to enable a roll-up of l) or l1 Ñ l (to is checked. This is found to be true, so the md-schema is enable a drill-down of l). Then, if the user applies a roll-up updated with id ľ date. The search space for teams.name or drill-down, a new md-query q 1 is formed and the process consists of teams.abbreviation and teams.home; by query- is iterated to further extend the hierarchies. Remarkably, ing data we find that |teams.name| “ |teams.abbreviation| ą using FDs to discover hierarchies guarantees that q 1 satisfies |teams.home|. Firstly, teams.name ; teams.home is the conditions discussed earlier, specifically disjointness and checked, and found not to hold. Then, teams.name ; completeness. teams.abbreviation and teams.abbreviation ; teams.name are FDs are not explicitly modeled in a document store, so checked; both are found to hold, giving rise to a descriptive we must resort to data. Since document stores host large attribute. amounts of data, we can reasonably assume that the mined FDs are representative enough of the application domain. 4.3.2 Drill-Down Discovery However, schemaless data commonly present some errors Here the set of candidate levels for checking AFDs of type and missing values which may hide some FDs. The tool we l1 ; l includes the levels in LzGq whose depth is greater adopt to cope with this issue are approximate FDs (AFDs) or equal than l, i.e., such that depthpl1 q ě depthplq. As in [10, 11], which “almost hold” on data, instead of traditional roll-up discovery, an AFD whose left-hand side l1 has lower ones. Unfortunately, this entails an approximate satisfac- cardinality than l is not checked, since it cannot hold. If tion of the disjointness and completeness conditions, possi- l1 ; l is found to hold, l1 ľ l is added to M and the user is bly leading to non-disjoint and incomplete hierarchies. notified of the ability of drilling down from l. Definition 4 (Approximate FD). Given two levels l Note that drilling down from l to l1 could produce a new and l1 , let strengthpl, l1 q denote the ratio between the number md-query q 1 whose group-by set, Gq1 “ Gq ztlu Y tl1 u, is not of unique values of l and the number of unique values of ll1 . compatible with some of the measures in Mq1 “ Mq . Since We will say that AFD l ; l1 holds if strengthpl, l1 q ě , checking compatibility is computationally cheaper than check- where  is a user-defined threshold [11]. ing AFDs (as explained in Section 4.2.1, it can be done based on the partial order of group-by sets, without access- At a given time during a user’s session, let M be the cur- ing data), the AFD for a candidate level l1 is checked only if rent version of the md-schema and q be the last md-query Gq1 is found to be compatible with at least one of the mea- formulated. Then, the search space for mining AFDs in- sures in Mq1 . Of course, if l1 ; l is found to hold and the cludes the levels that are not in Gq and are not involved user decides to drill-down to l1 , the incompatible measures in roll-up relationships in M. To reduce the search com- in Mq1 must be dropped. plexity we take into account the structural clues provided by the c-schema. Specifically, let l and l1 be two levels, such Example 9. Consider again Example 8. The that depthplq ă depthpl1 q; as already stated, the attributes search space for id includes date, teams.home, and in Attrspδq are related by to-many multiplicity to those in teams.players.player (we recall that teams.abbreviation has Attrspδ ` 1q, so l ­; l1 . For this reason, checking if l and already been added as a descriptive attribute). Since id l1 should be placed in the same hierarchy only requires to has the highest cardinality, there are no AFDs to check. check if l1 ; l. In addition, we avoid checking trivial and On the other hand, the search space for teams.name transitive AFDs since we explore one hierarchy at a time. consists of teams.home and teams.players.player, where Finally, the result of all AFD checks performed is stored |teams.home| ă |teams.name| ă |teams.players.player|. AFD in a meta-data repository, to be used to avoid checking the teams.home ; teams.name is not checked because of its same AFDs twice. cardinality; the same for teams.players.player ; teams.name, In the following subsections, roll-up and drill-down dis- since Gq1 “ tid, teams.players.playeru is not compatible with covery are detailed. the measures in Mq1 . The md-schema resulting after OLAP enabling in Exam- In this section, we consider five md-queries on the games ples 8 and 9 is shown in DFM notation [8] in Figure 3. Note dataset, q1 to q5 , and four on the DBLP dataset, q6 to q9 that, at the end of the user’s session, md-schemata can be (the group-by sets of all queries are shown in Table 1). Then, stored for reuse and sharing. we measure the time required by the querying and OLAP- enabling phases. We also discuss the efficiency of our algo- rithm based on the number of performed and skipped checks Year in each phase. The results obtained are proposed below. Month Game id Teams.name 5.1 Querying Teams.score To evaluate the efficiency of the validation of an md-query, Teams.won Teams.abbreviation Date we measure the time of the group-by check since it is the only step that requires to access data.2 In Table 1 we show, for our nine md-queries, the number of checks avoided using Figure 3: The games md-schema the c-schema (#Avoided), and the total time for validating the md-query (tquerying ). In order to measure tquerying , the AFD checks required by a single md-query are executed in 5. EXPERIMENTAL EVALUATION parallel. We can easily conclude that (i) our approach effec- tively reduces the number of checks by relying on the dataset We evaluate the performance of our approach using two structure, and (ii) the md-query validation time depends on real-world datasets : the number of levels in the group-by set and on their depths. ‚ Games. The working example dataset has been col- Remarkably, the md-query validation time is still reasonable lected by Sports Reference LLC [26]. It contains around (max 50 secs) even with a large dataset, which makes our 32K nested documents representing NBA games in the approach suitable for real-time usage. period of 1985–2013. Each document represents a game between two teams with at least 11 players each. It Dataset q Gq #Avoided tquerying contains 46 attributes; 40 of them are numeric and q1 id, teams.name 1/2 0.181 represent team and player results. q2 date, teams.name 1/2 0.177 q3 id,teams.name, 2/6 0.211 ‚ DBLP. This dataset contains 5.4M documents scraped teams.home from DBLP (dblp.uni-trier.de/xml/) in XML for- Games q4 teams.players.player, 3/6 1.364 mat and converted into JSON. Documents are flat and date, teams.name represent eight kinds of publications including confer- q5 teams.players.player, 2/6 1.157 teams.name, ence proceedings, journal articles, books, thesis, etc. teams.home, So, they have common attributes such as title, au- thor, type, year and uncommon ones such as journal q6 journal, year 0/2 3.616 q7 year, type 0/2 11.843 and booktitle. DBLP q8 author, year 1/2 36.685 Implementation. Each dataset has been loaded as a col- q9 type, year, author 2/6 49.993 lection in a local MongoDB instance running on Intel Core i5 CPU at 3.3 GHz and 4 GB of RAM machine with Ubuntu 14.04 OS. We have focused on the parts of our approach Table 1: Total time for the querying phase (in sec- that require accessing data to retrieve multiplicities of inter- onds) level relationships by means of count distinct queries, since the purely algorithmic parts have no relevant compu- tational complexity. Query execution can be delegated ei- 5.2 OLAP Enabling ther to the document store (MongoDB in our architecture) For OLAP enabling, we measure the time to check each or to query engines such as Spark SQL, Drill, and Presto. single AFD and the time to retrieve the cardinality of each Based on the tests we performed, these engines are signifi- level. Tables 2 and 3 show, for each AFD checked between cantly slower than native querying, plausibly because of the a pair of levels, the number of processed documents, the different query execution plans. In addition, these engines AFD strength, and the checking time. The difference in the sometimes return incorrect results when it comes to count number of documents (or the data size) for a given dataset distinct queries involving nested attributes. Therefore, we is to due to the flattening operation that is performed when defined two queries based on the native query language of executing QAF D . The results clearly show that the querying MongoDB: time tAF D depends on the levels and their depths. This 1. The first one (QAF D ) checks the AFD between a pair is apparent for the AFDs involving the author attribute in of levels and returns its approximation using the for- Table 3, which have a higher execution time. This is due mula of Definition 4. to the amount of memory allowed by MongoDB for group- by queries, which is limited to 100 MB; when a query does 2. The second one (Qcard ) returns the cardinality of a not fit in memory, temporary files must be written on disk, level. which dramatically slows execution down. Methodology. In the previous section, we showed how 2 The time for actually executing each md-query is not con- our approach validates md-queries and iteratively derives an sidered here, since the optimization of each md-query is out md-schema that satisfies the multidimensional constraints. of the paper scope. l ; l1 #Docs strengthpl, l1 q tAF D Table 4 shows for each level its cardinality and the time id ; date 100% 0.113 to retrieve it. From these results, we can conclude that the 32K date ; id 13.96% 0.065 efficiency of checking cardinalities also depends on the lev- t.name ; id 0.05% 0.181 els. tcard is normally small, since MongoDB can use indexes t.name ; date 0.05% 0.177 instead of collection scans. The high value of tcard for the 64K t.name ; t.abbreviation 100% 0.175 level teams.players.player is due to the non-use of nested at- t.name ; t.home 50% 0.168 tribute indexes to answer the corresponding query. As to t.abbreviation ; id 0.05% 0.169 level author, Qcard failed to return the result because the t.abbreviation ; date 0.05% 0.169 BSON document size (16MB) was exceeded. So we had to 64K t.abbreviation ; t.name 100% 0.161 define a different query that could not benefit from indexes t.abbreviation ; t.home 50% 0.163 and suffered from the memory limit discussed earlier. t.home ; id 0.01% 0.169 Using cardinalities to reduce the search space significantly t.home ; date 0.02% 0.159 improves the overall performance, since checking a cardinal- 64K t.home ; t.name 2.77% 0.179 t.home ; t.abbreviation 2.77% 0.169 ity is faster than checking an AFD. Besides, the number of feasible AFDs when using cardinalities (8, i.e., their strength t.players.player ; id 0.34% 1.285 is shown in bold in Table 2) is less than half the total number t.players.player ; date 0.34% 1.278 t.players.player ; t.name 644K 34.74% 1.113 of all checked AFDs (19, i.e., the number of rows in Table t.players.player ; t.abbrev. 34.74% 1.087 2). The same applies for the DBLP dataset, where only 10 t.players.player ; t.home 50.54% 1.054 of 16 AFDs are feasible. Finally, Table 5 shows the overall performance of the OLAP enabling phase for our nine md-queries (for all levels in the Table 2: Time for AFD checks on the Games dataset query): the number of roll-up and drill-down relationships (in seconds) discovered, the number of checks avoided in roll-up and drill-down discovery, and the total time spent. The latter l ; l1 #Docs strengthpl, l1 q tAF D (tOLAP ) is calculated as the sum of the times for checking each AFD, plus the time for checking all cardinalities. The year ; type 21.76% 10.945 year ; journal 5.4M 0.35% 3.373 results show that our approach for discovering hierarchies year ; booktitle 0.16% 4.461 effectively reduces the number of AFD checks by using car- dinalities and by relying on the dataset structure. Further- type ; year 2.07% 11.211 type ; journal 5.4M 0.12% 8.181 more, the time required by OLAP enabling is reasonable for type ; booktitle 0.03% 4.824 all md-queries (2 minutes at most), which proves that our performances fit real-time contexts. journal ; year 7.09% 3.365 journal ; type 5.4M 99.94% 8.166 journal ; booktitle no relationship 1.841 Dataset q #Roll-up, #Drill-down #Avoided tOLAP booktitle ; year 29.02% 4.517 q1 1,0 5/8, 6/8 2.657 booktitle ; type 5.4M 54.97% 4.949 q2 0,1 6/8 , 5/8 2.657 booktitle ; journal no relationship 1.852 Games q3 1,0 7/9, 5/9 3.706 q4 0,1 5/9, 7/9 3.685 author ; type 42.98% 43.231 q5 0,0 7/9, 7/9 2.513 author ; year 36.13% 36.685 12M author ; journal 42.31% 15.401 q6 1,0 4/6 , 2/6 89.940 author ; booktitle 31.92% 18.881 DBLP q7 0,1 6/6 , 0/6 113.287 q8 0,0 3/6 , 3/6 107.401 q9 0,1 4/6 , 2/6 67.653 Table 3: Time for AFD checks on the DBLP dataset (in seconds) Table 5: Total time for the OLAP enabling phase (in seconds) Dataset l |l| tcard id 31686 0.124 date 4424 0.019 teams.name 36 0.057 6. CONCLUSION Games teams.abbreviation 36 0.056 In this paper we have proposed an interactive schema-on- teams.home 2 0.049 teams.players.player 2191 0.622 read approach for enabling OLAP on document stores. To All levels — 0.927 this end, AFDs are mined and used to automate the discov- ery of multidimensional structures. The user interaction is author 1859242 12.298 journal 1632 0.011 limited to the selection of multidimensional concepts and to booktitle 10811 0.063 a proper choice of the aggregation operators. After validat- DBLP ing user queries from the multidimensional point of view and year 82 0.001 type 8 0.001 refining the underlying multidimensional schema, we adopt All levels — 12.374 a smart strategy to efficiently build local portions of hierar- chies aimed at enabling OLAP-style user interaction in the form of roll-ups and drill-downs. Table 4: Time for cardinality checks (in seconds) Overall, the experiments we conducted show that the per- formances of our approach are in line with the requirements of a real-time user interaction. However, some relevant is- [13] M. R. Jensen, T. H. Møller, and T. B. Pedersen. sues still need to be explored and are part of our future Specifying OLAP cubes on XML data. Journal of work. To improve performance, we plan to further optimize Intelligent Information Systems, 17(2-3):255–280, the algorithms proposed. Besides, to increase effectiveness, 2001. the evolution of data and schemata in a collection must be [14] M. Klettke, U. Störl, and S. Scherzinger. Schema considered; we intend to address this issues by searching for extraction and structural outlier detection for temporal FDs. JSON-based NoSQL data stores. In Proc. BTW, pages 425–444, 2015. 7. REFERENCES [15] H. J. Lenz and A. Shoshani. Summarizability in [1] A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, OLAP and statistical databases. In Proc. SSDBM, J.-N. Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, pages 132–143, 1997. J. Trujillo, P. Vassiliadis, and G. Vossen. Fusion [16] Z. H. Liu and D. Gawlick. Management of flexible cubes: towards self-service business intelligence. schema data in RDBMSs - opportunities and International Journal of Data Warehousing and limitations for NoSQL. In Proc. CIDR, 2015. Mining, 9(2):66–88, 2013. [17] Z. H. Liu, B. Hammerschmidt, D. McMahon, Y. Lu, [2] A. Abelló, O. Romero, T. B. Pedersen, R. Berlanga, and H. J. Chang. Closing the functional and V. Nebot, M. J. Aramburu, and A. Simitsis. Using performance gap between SQL and NoSQL. In Proc. semantic web technologies for exploratory OLAP: A SIGMOD, pages 227–238, 2016. survey. IEEE Transactions on Knowledge and Data [18] J. N. Mazón, J. Lechtenbörger, and J. Trujillo. A Engineering, 27(2):571–588, 2015. survey on summarizability issues in multidimensional [3] M. Armbrust, A. Ghodsi, M. Zaharia, R. S. Xin, modeling. Data and Knowledge Engineering, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, 68(12):1452–1469, 2009. T. Kaftan, and M. J. Franklin. Spark SQL: Relational [19] C. Phipps and K. C. Davis. Automating data data processing in spark. In Proc. SIGMOD, pages warehouse conceptual schema design and evaluation. 1383–1394, 2015. In Proc. DMDW, pages 23–32, 2002. [4] R. Cattell. Scalable SQL and NoSQL data stores. [20] N. Prat, J. Akoka, and I. Comyn-Wattiau. A ACM SIGMOD Record, 39(4):12–27, 2011. UML-based data warehouse design method. Decision [5] C. Chasseur, Y. Li, and J. Patel. Enabling JSON Support Systems, 42(3):1449–1473, 2006. document stores in relational systems. In Proc. [21] S. Rizzi, E. Gallinucci, M. Golfarelli, O. Romero, and WebDB, pages 1–6, 2013. A. Abelló. Towards exploratory OLAP on linked data. [6] A. Cuzzocrea, L. Bellatreche, and I.-Y. Song. Data In Proc. SEBD, pages 86–93, 2016. warehousing and OLAP over big data: current [22] O. Romero and A. Abelló. Multidimensional design by challenges and future research directions. Proc. examples. In Proc. DaWaK, pages 85–94, 2006. DOLAP, pages 67–70, 2013. [23] O. Romero and A. Abelló. Automating [7] M. Golfarelli, S. Graziani, and S. Rizzi. Starry vault: multidimensional design from ontologies. In Proc. Automating multidimensional modeling from data DOLAP, pages 1–8, 2007. vaults. In Proc. ADBIS, pages 137–151, 2016. [24] O. Romero, D. Calvanese, A. Abelló, and [8] M. Golfarelli, D. Maio, and S. Rizzi. The dimensional M. Rodrı́guez-Muro. Discovering functional fact model: A conceptual model for data warehouses. dependencies for multidimensional design. In Proc. International Journal of Cooperative Information DOLAP, pages 1–8, 2009. Systems, 7(2-3):215–247, 1998. [25] D. S. Ruiz, S. F. Morales, and Jesus Garcia Molina. [9] M. Golfarelli, S. Rizzi, and B. Vrdoljak. Data Inferring versioned schemas from NoSQL databases warehouse design from XML sources. In Proc. and its applications. In Proc. ER, pages 467–480, 2015. DOLAP, pages 40–47, 2001. [26] K. Valeri. https://thecodebarbarian.wordpress.com, [10] Y. Huhtala, J. Kärkkäinen, P. Porkka, and 2014. H. Toivonen. TANE: An efficient algorithm for [27] J. Varga, A. Vaisman, O. Romero, L. Etcheverry, discovering functional and approximate dependencies. T. B. Pedersen, and C. Thomsen. Dimensional Computer Journal, 42(2):100–111, 1999. enrichment of statistical linked open data. Journal of [11] I. F. Ilyas, V. Markl, P. Haas, P. Brown, and Web Semantics, 40:22–51, 2016. A. Aboulnaga. Cords: Automatic discovery of [28] B. Vrdoljak, M. Banek, and S. Rizzi. Designing web correlations and soft functional dependencies. In Proc. warehouses from XML schemas. In Proc. DaWaK, of SIGMOD, pages 647–658, 2004. pages 89–98, 2003. [12] C. S. Jensen, T. Holmgren, and T. B. Pedersen. [29] L. Wang, O. Hassanzadeh, S. Zhang, J. Shi, L. Jiao, Discovering multidimensional structure in relational J. Zou, and C. Wang. Schema management for data. In Proc. DaWaK, pages 138–148, 2004. document stores. VLDB Journal, 8(9):922–933, 2015.